Multi-Purpose Data Finder & Scraper

Project Goal

The goal of this project was to create a flexible tool that can retrieve specific data fields from multiple entities (companies, URLs, etc.) using OpenAI’s Responses API or by scraping using Playwright. The tool intelligently determines whether to use web search or web scraping based on the input provided by the user.

Github repo: Multi-Purpose scraper

How it was built

The application leverages Streamlit for the user interface and OpenAI’s Responses API for data retrieval. It uses the is_url function to determine if an input is a URL or a general search term. For URLs, it employs Playwright for asynchronous web scraping, falling back to the requests library if Playwright fails. For non-URLs, it uses OpenAI’s built-in web_search tool. The application dynamically generates a JSON schema based on user requests to structure the data for OpenAI. It also includes logic to handle cookie consent dialogs during web scraping.

Key components include:

multi_company_app.py: Main Streamlit application that orchestrates the data retrieval process.
utils.py: Contains utility functions like is_url for URL detection and create_schema_from_user_requests for generating JSON schemas.
web_scraper.py: Implements the web_scraper tool using Playwright and requests for web scraping.

Technologies used

Python
Streamlit
OpenAI Responses API
Playwright (async)
Requests
Beautiful Soup 4
Pandas
python-dotenv
nest-asyncio

Github repo: Multi-Purpose scraper

Project Goal

How it was built

Technologies used

Project Goal

How it was built

Technologies used

Leave a Reply Cancel reply