For: Me, Myself and I
Project Goal
This project aims to extract structured data from WordPress websites using a combination of REST API calls and Selenium for enhanced reliability. The primary goal is to retrieve posts, media, categories, or comments, along with their associated metadata, and output them in a clean, structured JSON format.
The scraper is deployed on Apify, check it out here!
How it was built
The scraper begins by making REST API requests to WordPress sites, attempting both “pretty” and “non-pretty” URL formats to ensure broad compatibility. If the API is inaccessible, or if data is not returned as expected, the scraper intelligently falls back to using Selenium to render the page, then extracts the JSON from the page source. The results are parsed and flattened, and any HTML content is cleaned to extract the text. The scraper also handles pagination, retrieving all available results, or up to the limit set by the user. Finally, the extracted data is pushed to an Apify dataset.
Technologies used
- Python: The primary programming language for the scraper.
- Apify SDK: For actor framework, input/output management, and request queue management.
- Requests: To make HTTP requests to the WordPress REST API.
- Selenium: For browser automation and extracting data when REST API access fails.
- Beautiful Soup: For cleaning HTML content.
- JSON: For handling data serialization and deserialization.