49 lines
1.6 KiB
Markdown
49 lines
1.6 KiB
Markdown
# TEXT CLUSTERING
|
||
|
||
A pipeline for clustering tweets
|
||
|
||
## Overall Pipeline for Cluster Extraction
|
||
|
||
1.Convert tweet text to categories using the Gemma model:
|
||
Takes about 7 hours for 40,000 tweets.
|
||
|
||
2.Convert categories to embedding vectors using Jina:
|
||
Takes about 3 minutes.
|
||
|
||
3.Perform clustering with K-Means:
|
||
Choose the number of clusters with the highest silhouette score among 20–60 groups.
|
||
Takes about 5 minutes.
|
||
|
||
4.Name the clusters using the Gemma model:
|
||
Takes about 1 minute.
|
||
|
||
5.Cluster the generated names using K-Means and group similar names together:
|
||
Takes about 1 minute.
|
||
|
||
6.Use GPT O3 to merge and refine cluster names:
|
||
Provided GPT with the list cluster names and asked it to build new, higher-level clusters.
|
||
Takes about 1 minute.
|
||
|
||
7.Assign each topic to its final cluster using the Gemma model:
|
||
Takes about 7 hours.
|
||
|
||
Reason for step 5:
|
||
If I had directly given the list of names to step 6, GPT wouldn’t have performed well.
|
||
By first clustering similar names (step 5), the input to GPT became more organized,
|
||
which made step 6 much more effective.
|
||
|
||
## How to extract main cluster
|
||
|
||
You should give a excel file which has a column named "tweet" to this below command
|
||
Overally it will take 15h time for 40,000 tweets
|
||
|
||
python3 clustering_pipeline.py --input_file tweets_file.xlsx --output_file tweets_file_cluster.xlsx
|
||
|
||
|
||
## How to extract sub cluster
|
||
|
||
You should first run above code whihc will give you a excel file which has a colummn of "topic" and "cluster_llm"
|
||
|
||
|
||
python3 sub_clustering_pipeline.py --input_file tweets_file_cluster.xlsx --output_file tweets_file_sub_cluster.xlsx
|