embedding_model/notes.txt

** DATASET for train

1-passage score does not have to be 0,1. it can be a range from 0 to 1 (0,0.25,0.5,0.75,1) : we can get this core by llm and apply it in loss calculation.

2-dataset needs preprocesing of removing negetive or positive passage by llm.

3-miracle dataset: question = 2107   - passages = 21844 : some negetive passage can be related

4-cross ligual dataset can be useful : query = first language  - passage = second language

5-swim-ir dataset : they have passage and they have created query from it : it is shit for persian

6-parsinlu dataset: question = 600 - passage = 600 : all are positive

7-persianqa dataset: question = 6306 - passage = less than queries : every passage has multiple queries : all are positive - be careful some query is impossible to anser

8-pquad dataset  :question = 48273 - passage = 10082 : every passage has multiple queries :  be careful some query is impossible to anser : all are positive

9-longragfa dataset: it is long doc and query and for evaluation : question = 250, passage = 1500 : not using

10-Synthetic-persian-qa-retrieval dataset : question = 223423, passage = 250000 : negetaive passage are not exactly different : needs preprocessing