Skip to content
Home ยป Multi-Purpose Data Finder & Scraper

Multi-Purpose Data Finder & Scraper


Project Goal

The goal of this project was to create a flexible tool that can retrieve specific data fields from multiple entities (companies, URLs, etc.) using OpenAI’s Responses API or by scraping using Playwright. The tool intelligently determines whether to use web search or web scraping based on the input provided by the user.

Github repo: Multi-Purpose scraper

How it was built

The application leverages Streamlit for the user interface and OpenAI’s Responses API for data retrieval. It uses the is_url function to determine if an input is a URL or a general search term. For URLs, it employs Playwright for asynchronous web scraping, falling back to the requests library if Playwright fails. For non-URLs, it uses OpenAI’s built-in web_search tool. The application dynamically generates a JSON schema based on user requests to structure the data for OpenAI. It also includes logic to handle cookie consent dialogs during web scraping.

Key components include:

  • multi_company_app.py: Main Streamlit application that orchestrates the data retrieval process.
  • utils.py: Contains utility functions like is_url for URL detection and create_schema_from_user_requests for generating JSON schemas.
  • web_scraper.py: Implements the web_scraper tool using Playwright and requests for web scraping.

Technologies used

  • Python
  • Streamlit
  • OpenAI Responses API
  • Playwright (async)
  • Requests
  • Beautiful Soup 4
  • Pandas
  • python-dotenv
  • nest-asyncio

Github repo: Multi-Purpose scraper

Leave a Reply

Your email address will not be published. Required fields are marked *