Personal knowledge base - Automatically scrape, summarize and stores article data - Portfolio

For: Me

Project Goal

The goal of this project was to create a knowledge base for myself, just by sharing a link from my mobile or desktop. It automates the process of extracting, summarizing, and storing web content for personal knowledge management. Specifically, it aims to take a URL, scrape the article text, use an LLM to generate a summary and keywords, and store this data in a BigQuery database. I’ve used Looker Studio (previously Google Data Studio) for a frontend to query the database in a UI friendly manner

How it was built

This system is built using a combination of Python, Google Cloud services, Tasker (Android) and web scraping techniques. Here’s a breakdown:

Cloud Task Queue: Flask application receives URLs and triggers a Cloud Task, decoupling the web request from the processing logic.
Web Scraping: Selenium was used for dynamic web page scraping, handling JavaScript rendered content and cookie banners. It extracts the full text and title of the article.
Cookie Handling: A custom function using an LLM (OpenAI) identifies and clicks on cookie accept buttons, ensuring full page content is accessible.
LLM Summarization: The extracted text is processed by OpenAI’s GPT model using function calling, which produces a concise summary and top 10 keywords.
BigQuery Storage: The summarized data, along with the original URL, is stored in a BigQuery table for easy querying and analysis.
Error Handling: The system is equipped with multiple error handling points, such as handling of API errors and JSON parsing errors.

Technologies used

Tasker (Android)
Python
Flask
Google Cloud Tasks
Google Cloud BigQuery
Google Looker Studio
Selenium
OpenAI API (GPT-3.5-turbo)
Beautiful Soup
Webdriver Manager

Personal knowledge base – Automatically scrape, summarize and stores article data

Project Goal

How it was built

Technologies used

Project Goal

How it was built

Technologies used

Leave a Reply Cancel reply