Project Goal
The primary goal of this project was to analyze and categorize user search queries from websites to understand what users are looking for. This involved cleaning the raw search data, identifying dominant topics within the queries, and extracting key entities. The end result provides insights into user behavior and their specific information needs.
How it was built
The project began by loading search query data from a CSV file using Pandas. The text data was then cleaned (by trimming whitespace, converting to lowercase, removing punctuation, and replacing digits with “NUMBER.”) Duplicates were removed to ensure that each unique search session was only counted once. Topic modeling was performed using Gensim’s Latent Dirichlet Allocation (LDA), with the number of topics optimized using a coherence score. Finally, the most relevant topic and entities were extracted for each query using Spacy’s Dutch language model.
Technologies used
- Pandas: For data manipulation and analysis.
- Gensim: For topic modeling (LDA).
- Scikit-learn: For vectorizing the text data.
- Matplotlib: For visualizing coherence scores.
- pyLDAvis: For interactive topic visualization.
- Spacy: For Dutch language entity extraction.