For: Anyone
Project Goal
The primary goal of this project was to create a flexible web scraper that uses AI to extract specific data points from web pages, without relying on predefined CSS selectors. The scraper should adapt to various page layouts and content structures, making it accessible to users who may not be familiar with web scraping techniques.
The scraper is deployed on Apify, check it out here!
How it was built
This scraper combines browser automation with the power of Google’s Gemini LLM. Here’s a breakdown of the process:
- Dynamic Element Identification: The scraper uses Gemini to analyze screenshots and identify the location of elements based on user instructions provided in natural language.
- Bounding Box Mapping: The bounding box coordinates returned by Gemini are converted to pixel coordinates and accurately mapped to the DOM structure.
- CSS Selector Refinement: Candidate elements are filtered using a combination of center point and Intersection over Union (IoU) checks. An LLM is then used to select the best CSS selector for each element.
- Data Extraction: Once the CSS selectors are refined, the scraper extracts the text content from the corresponding elements.
- Domain Optimization: The scraper caches selectors for each domain, allowing it to reuse them on subsequent pages from the same domain, improving efficiency.
- Cookie Handling: The scraper intelligently detects and clicks cookie consent buttons, using a combination of pre-existing selectors and LLM-powered analysis.
Technologies used
- Python: The core programming language.
- Selenium: For browser automation.
- Google Gemini LLM: For image analysis and natural language processing.
- Beautiful Soup 4: For HTML parsing.
- Apify Platform: For actor deployment and data management.