For: Me, Myself and I
Project Goal
This project aims to extract product data from websites by scraping the dataLayer
object, a common JavaScript object used by Google Analytics and other tracking tools to store information about products viewed on a website. Instead of relying on traditional methods such as web scraping specific HTML elements, this project is specifically designed to extract data from the dataLayer
that is already present.
The scraper is deployed on Apify, check it out here!
How it was built
The scraper is built using Python, Selenium, and Apify’s Actor platform. It starts by initializing a headless Chrome browser via Selenium and navigates to the specified URLs. It then waits for the dataLayer
object to be available, using a custom JavaScript function, before extracting product information. The scraper utilizes a recursive function to traverse the dataLayer
object, searching for dictionaries containing relevant product data, which is identified by partial matches for keys like “name”, “id”, and “price”. The extracted data is then pushed to the Apify dataset. It also handles cases where a cookie acceptance prompt is present by accepting cookies via a CSS selector specified by the user.
Technologies used
- Python: The primary programming language.
- Selenium: For browser automation and interaction.
- Apify Actor: For cloud-based execution and data management.
- Chrome WebDriver: For controlling the headless browser.
- JavaScript: Used to interact with the
dataLayer
.