Project Goal
The goal of this project was to create a flexible tool that can retrieve specific data fields from multiple entities (companies, URLs, etc.) using OpenAI’s Responses API or by scraping using Playwright. The tool intelligently determines whether to use web search or web scraping based on the input provided by the user.
Github repo: Multi-Purpose scraper
How it was built
The application leverages Streamlit for the user interface and OpenAI’s Responses API for data retrieval. It uses the is_url function to determine if an input is a URL or a general search term. For URLs, it employs Playwright for asynchronous web scraping, falling back to the requests library if Playwright fails. For non-URLs, it uses OpenAI’s built-in web_search tool. The application dynamically generates a JSON schema based on user requests to structure the data for OpenAI. It also includes logic to handle cookie consent dialogs during web scraping.
Key components include:
multi_company_app.py: Main Streamlit application that orchestrates the data retrieval process.utils.py: Contains utility functions likeis_urlfor URL detection andcreate_schema_from_user_requestsfor generating JSON schemas.web_scraper.py: Implements theweb_scrapertool using Playwright andrequestsfor web scraping.
Technologies used
- Python
- Streamlit
- OpenAI Responses API
- Playwright (async)
- Requests
- Beautiful Soup 4
- Pandas
- python-dotenv
- nest-asyncio
Github repo: Multi-Purpose scraper