2025-10-23 12:19:38 +03:30
2025-10-23 12:19:38 +03:30
2025-10-22 14:31:39 +03:30
2025-10-22 14:31:39 +03:30
2025-10-23 12:19:38 +03:30
2025-10-22 14:31:39 +03:30
2025-10-22 14:31:39 +03:30
2025-10-21 11:14:59 +03:30
2025-10-21 11:14:59 +03:30

TEXT CLUSTERING

A pipeline for clustering tweets

Overall Pipeline for Cluster Extraction

1.Convert tweet text to categories using the Gemma model:
Takes about 7 hours for 40,000 tweets.

2.Convert categories to embedding vectors using Jina:
Takes about 3 minutes.

3.Perform clustering with K-Means:
Choose the number of clusters with the highest silhouette score among 2060 groups.
Takes about 5 minutes.

4.Name the clusters using the Gemma model:
Takes about 1 minute.

5.Cluster the generated names using K-Means and group similar names together:
Takes about 1 minute.

6.Use GPT O3 to merge and refine cluster names:
Provided GPT with the list cluster names and asked it to build new, higher-level clusters.
Takes about 1 minute.

7.Assign each topic to its final cluster using the Gemma model:
Takes about 7 hours.

Reason for step 5:
If I had directly given the list of names to step 6, GPT wouldnt have performed well.
By first clustering similar names (step 5), the input to GPT became more organized,
which made step 6 much more effective.

How to extract main cluster

You should give a excel file which has a column named "tweet" to this below command Overally it will take 15h time for 40,000 tweets

python3 clustering_pipeline.py --input_file tweets_file.xlsx --output_file tweets_file_cluster.xlsx

How to extract sub cluster

You should first run above code whihc will give you a excel file which has a colummn of "topic" and "cluster_llm"

python3 sub_clustering_pipeline.py --input_file tweets_file_cluster.xlsx --output_file tweets_file_sub_cluster.xlsx 
Description
No description provided
Readme 362 KiB
Languages
Jupyter Notebook 98.9%
Python 1.1%