Project Goal
The goal of this project was to create a flexible tool that can retrieve specific data fields from multiple entities (companies, URLs, etc.) using OpenAI’s Responses API or by scraping using Playwright. The tool intelligently determines whether to use web search or web scraping based on the input provided by the user.
Github repo: Multi-Purpose scraper
How it was built
The application leverages Streamlit for the user interface and OpenAI’s Responses API for data retrieval. It uses the is_url
function to determine if an input is a URL or a general search term. For URLs, it employs Playwright for asynchronous web scraping, falling back to the requests
library if Playwright fails. For non-URLs, it uses OpenAI’s built-in web_search
tool. The application dynamically generates a JSON schema based on user requests to structure the data for OpenAI. It also includes logic to handle cookie consent dialogs during web scraping.
Key components include:
multi_company_app.py
: Main Streamlit application that orchestrates the data retrieval process.utils.py
: Contains utility functions likeis_url
for URL detection andcreate_schema_from_user_requests
for generating JSON schemas.web_scraper.py
: Implements theweb_scraper
tool using Playwright andrequests
for web scraping.
Technologies used
- Python
- Streamlit
- OpenAI Responses API
- Playwright (async)
- Requests
- Beautiful Soup 4
- Pandas
- python-dotenv
- nest-asyncio
Github repo: Multi-Purpose scraper