For: Me
Project Goal
The goal of this project was to create a knowledge base for myself, just by sharing a link from my mobile or desktop. It automates the process of extracting, summarizing, and storing web content for personal knowledge management. Specifically, it aims to take a URL, scrape the article text, use an LLM to generate a summary and keywords, and store this data in a BigQuery database. I’ve used Looker Studio (previously Google Data Studio) for a frontend to query the database in a UI friendly manner
How it was built
This system is built using a combination of Python, Google Cloud services, Tasker (Android) and web scraping techniques. Here’s a breakdown:
- Cloud Task Queue: Flask application receives URLs and triggers a Cloud Task, decoupling the web request from the processing logic.
- Web Scraping: Selenium was used for dynamic web page scraping, handling JavaScript rendered content and cookie banners. It extracts the full text and title of the article.
- Cookie Handling: A custom function using an LLM (OpenAI) identifies and clicks on cookie accept buttons, ensuring full page content is accessible.
- LLM Summarization: The extracted text is processed by OpenAI’s GPT model using function calling, which produces a concise summary and top 10 keywords.
- BigQuery Storage: The summarized data, along with the original URL, is stored in a BigQuery table for easy querying and analysis.
- Error Handling: The system is equipped with multiple error handling points, such as handling of API errors and JSON parsing errors.
Technologies used
- Tasker (Android)
- Python
- Flask
- Google Cloud Tasks
- Google Cloud BigQuery
- Google Looker Studio
- Selenium
- OpenAI API (GPT-3.5-turbo)
- Beautiful Soup
- Webdriver Manager