Skip to content
Home ยป Automated Exploratory Data Analysis (EDA) with GPT-4 Insights

Automated Exploratory Data Analysis (EDA) with GPT-4 Insights


Project Goal

The primary goal of this project was to automate the Exploratory Data Analysis (EDA) process and leverage the power of GPT-4 to interpret the results. This involved creating a function that performs a comprehensive set of EDA steps, visualizes the data, and then uses GPT-4 to provide insights, identify potential issues, and give recommendations.

How it was built

This project was built using Python, leveraging several key libraries. The core EDA logic was implemented using Pandas for data manipulation, NumPy for numerical operations, Matplotlib and Seaborn for data visualization, and SciPy for statistical analysis. Scikit-learn was used for preprocessing tasks like imputation, scaling, dimensionality reduction (PCA), and clustering (K-means). The statsmodels library was used for time series analysis (if applicable) and ANOVA testing. The OpenAI API was utilized to integrate GPT-4 for automated interpretation of the EDA output and plots. The process involved several key steps:

  • Data Loading and Preprocessing: The input data is loaded into a Pandas DataFrame. Missing values in numeric columns are imputed using the mean, and categorical columns are identified for separate analysis.
  • EDA Operations: The script performs a range of EDA tasks, including: basic dataset information, missing values analysis, descriptive statistics, correlation analysis, distribution analysis (including normality tests), categorical data analysis (value counts and bar plots), outlier detection, time series analysis, dimensionality reduction with PCA, and basic clustering with K-means.
  • Visualization: The project generates a suite of plots, including correlation heatmaps, histograms, Q-Q plots, bar plots for categorical data, time series plots, PCA variance ratio plots, PCA scatter plots, and K-means cluster plots.
  • GPT-4 Integration: All printed output and generated plots are then sent to GPT-4 for analysis. The outputs are converted to base64 strings to include within API calls and ensure data is passed correctly.
  • Interpretation: GPT-4 returns a comprehensive summary of key findings, potential issues, and recommendations, based on the EDA and visualisations.
  • Output Handling: Printed output and plots are captured and saved, then passed to GPT-4 for interpretation.

Technologies used

  • Python
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • SciPy
  • Scikit-learn
  • statsmodels
  • OpenAI API

Leave a Reply

Your email address will not be published. Required fields are marked *