Skip to content
Home ยป Personal knowledge base – Automatically scrape, summarize and stores article data

Personal knowledge base – Automatically scrape, summarize and stores article data



For: Me

Project Goal

The goal of this project was to create a knowledge base for myself, just by sharing a link from my mobile or desktop. It automates the process of extracting, summarizing, and storing web content for personal knowledge management. Specifically, it aims to take a URL, scrape the article text, use an LLM to generate a summary and keywords, and store this data in a BigQuery database. I’ve used Looker Studio (previously Google Data Studio) for a frontend to query the database in a UI friendly manner

How it was built

This system is built using a combination of Python, Google Cloud services, Tasker (Android) and web scraping techniques. Here’s a breakdown:

  • Cloud Task Queue: Flask application receives URLs and triggers a Cloud Task, decoupling the web request from the processing logic.
  • Web Scraping: Selenium was used for dynamic web page scraping, handling JavaScript rendered content and cookie banners. It extracts the full text and title of the article.
  • Cookie Handling: A custom function using an LLM (OpenAI) identifies and clicks on cookie accept buttons, ensuring full page content is accessible.
  • LLM Summarization: The extracted text is processed by OpenAI’s GPT model using function calling, which produces a concise summary and top 10 keywords.
  • BigQuery Storage: The summarized data, along with the original URL, is stored in a BigQuery table for easy querying and analysis.
  • Error Handling: The system is equipped with multiple error handling points, such as handling of API errors and JSON parsing errors.

Technologies used

  • Tasker (Android)
  • Python
  • Flask
  • Google Cloud Tasks
  • Google Cloud BigQuery
  • Google Looker Studio
  • Selenium
  • OpenAI API (GPT-3.5-turbo)
  • Beautiful Soup
  • Webdriver Manager

Leave a Reply

Your email address will not be published. Required fields are marked *