[initial commit] data extraction with example data folder
This commit is contained in:
commit
2c66a42a7b
File diff suppressed because one or more lines are too long
@ -0,0 +1,78 @@
|
|||||||
|
ﺍﻋﯿﺎﺩ ﻣﺨﺼﻮﺹ ﺯﻧﺎﻥ
|
||||||
|
ﺩﺭ ﺍﻗﻠﯿﺘﻬﺎﯼ ﻣﻠﯽ ﭼﯿﻦ , ﺍﻋﯿﺎﺩ ﻣﺘﻨﻮﻋﯽ ﻭﺟﻮﺩ ﺩﺍﺭﺩ ﮐﻪ ﺩﺭ ﺁﻥ ﺯﯾﺒﺎﺋﯽ ﻭ ﺷﻌﻮﺭ ﺯﻧﺎﻥ
|
||||||
|
ﺭﺍ ﺑﺨﻮﺑﯽ ﺑﻪ ﻧﻤﺎﯾﺶ ﻣﯽ ﮔﺬﺍﺭﺩ.
|
||||||
|
ﻋﯿﺪ ﺯﻧﺎﻥ ﻣﻠﺖ ﻫﻮﻧﮓ ﺩﻭﻧﮓ
|
||||||
|
. ﻫﺮ ﺳﺎﻝ ﺭﻭﺯ ﻫﺸﺘﻢ ﺍﺯ ﻣﺎﻩ ﭼﻬﺎﺭﻡ ﻗﻤﺮﯼ ﻋﯿﺪ ﺯﻧﺎﻥ ﻣﻠﺖ ﻫﻮﻧﮓ ﺩﻭﻧﮓ ﺍﺳﺖ
|
||||||
|
ﻣﯽ ﮔﻮﯾﻨﺪ ﮐﻪ ﺍﯾﻦ ﻋﯿﺪ ﺍﺯ ﺧﺎﻧﻮﺍﺩﻩ ﯾﺎﻧﮓ ﻗﻮﻡ ﺩﻭﻧﮓ ﺳﺮﭼﺸﻤﻪ ﮔﺮﻓﺘﻪ ﺍﺳﺖ . ﺩﺭ ﺯﻣﺎﻥ
|
||||||
|
( ﻗﻬﺮﻣﺎﻥ ﺯﻧﺎﻥ ﻣﻠﺖ ﺩﻭﻧﮓ , ﺑﻪ ﻋﻠﺖ ﺷﮑﺴﺖ Yang bamei) " ﻗﺪﯾﻢ " ﯾﺎﻧﮓ ﺑﺎﻣﺌﯽ
|
||||||
|
ﻗﯿﺎﻡ ﮐﺸﺎﻭﺭﺯﺍﻥ ﮐﻪ ﺗﺤﺖ ﺭﻫﺒﺮﯼ ﺍﻭ ﺑﻮﺩ ﺯﻧﺪﺍﻧﯽ ﺷﺪ. ﺭﻭﺯﯼ ﯾﺎﻧﮓ ﺑﺎﻣﺌﯽ ﺑﺮﺍﯼ ﺩﯾﺪﺍﺭ
|
||||||
|
ﺑﺎ ﺑﺮﺍﺩﺭ ﺧﻮﺩ ﺑﻪ ﺯﻧﺪﺍﻥ ﺭﻓﺖ . ﺍﻭ ﺩﯾﺪ ﭼﻬﺮﻩ ﺑﺮﺍﺩﺭ ﺑﺴﯿﺎﺭ ﺭﻧﮓ ﭘﺮﯾﺪﻩ ﺍﺳﺖ. ﺍﺯ ﺍﻭ
|
||||||
|
ﭘﺮﺳﯿﺪ ﻫﻨﻮﺯ ﺟﺮﺃﺕ ﻭ ﺷﺠﺎﻋﺖ ﺩﺭ ﺗﻮ ﺑﺎﻗﯽ ﻣﺎﻧﺪﻩ ﯾﺎ ﺧﯿﺮ؟ ﻭﯼ ﭘﺎﺳﺦ ﺩﺍﺩ, ﺟﺮﺃﺕ ﺩﺍﺭﻡ
|
||||||
|
ﻭﻟﯽ ﭼﻮﻥ ﺩﺭ ﺯﻧﺪﺍﻥ ﻫﯿﭻ ﻭﻗﺖ ﻏﺬﺍﯼ ﮐﺎﻓﯽ ﻧﺨﻮﺭﺩﻩ ﺍﻡ ﻫﻤﯿﺸﻪ ﺍﺣﺴﺎﺱ ﮔﺮﺳﻨﮕﯽ ﮐﺮﺩﻩ
|
||||||
|
ﺍﻡ . ﯾﺎﻧﮓ ﺑﺎﻣﺌﯽ ﭘﺲ ﺍﺯ ﺑﺮﮔﺸﺖ ﺑﻪ ﺧﺎﻧﻪ , ﺑﺎ ﺑﺮﻧﺞ ﻭ ﺳﺎﯾﺮ ﺧﻮﺭﺍﮐﯽ ﻫﺎ ﻏﺬﺍﻫﺎﯾﯽ
|
||||||
|
ﺭﺍ ﭘﺨﺖ ﻭ ﺩﺭ ﺭﻭﺯ ﻫﺸﺘﻢ ﺍﺯ ﻣﺎﻩ ﭼﻬﺎﺭﻡ ﻗﻤﺮﯼ ﺁﻥ ﺭﺍ ﺑﺮﺍﯼ ﺑﺮﺍﺩﺭ ﺧﻮﺩ ﺑﺮﺩ. ﻭﯼ ﭘﺲ ﺍﺯ
|
||||||
|
ﺧﻮﺭﺩﻥ ﻏﺬﺍﻫﺎ, ﺷﺠﺎﻋﺖ ﻭ ﺩﻟﯿﺮﯼ ﺧﻮﺩ ﺭﺍ ﺑﺎﺯﯾﺎﻓﺖ .ﭘﺲ ﺍﺯ ﺁﻥ ﻓﻮﺭﺍ ﺳﻼﺡ ﭘﻨﻬﺎﻥ ﺷﺪﻩ
|
||||||
|
ﺧﻮﺩ ﺭﺍ ﺑﯿﺮﻭﻥ ﺁﻭﺭﺩﻧﺪ ﻭ ﺑﺎ ﻫﻤﮑﺎﺭﯼ ﻣﺮﺩﻡ ﮐﻪ ﺩﺭ ﺧﺎﺭﺝ ﺍﺯ ﺯﻧﺪﺍﻥ ﺑﻮﺩﻧﺪ , ﺍﺯ ﺯﻧﺪﺍﻥ
|
||||||
|
ﻣﺘﻮﺍﺭﯼ ﻭ ﺷﻬﺮ ﻟﯿﻮﺟﺌﻮ ﺭﺍ ﺍﺷﻐﺎﻝ ﮐﺮﺩﻧﺪ ﻭ ﺍﺯ ﺍﯾﻦ ﭘﯿﺮﻭﺯﯼ ﺍﺣﺴﺎﺱ ﺷﺎﺩﻣﺎﻧﯽ ﮐﺮﺩﻧﺪ .
|
||||||
|
ﺍﺯ ﺍﯾﻦ ﺗﺎﺭﯾﺦ ﺑﻪ ﺑﻌﺪ , ﺯﻧﺎﻥ ﻣﻠﺖ ﺩﻭﻧﮓ ﻫﺮ ﺳﺎﻝ ﺩﺭ ﺍﯾﻦ ﺭﻭﺯ ﺑﺮﺍﯼ ﺟﺸﻦ ﭘﯿﺮﻭﺯﯼ
|
||||||
|
ﺑﺮ ﻋﻠﯿﻪ ﺣﺎﮐﻤﺎﻥ ﻇﺎﻟﻢ ، ﻏﺬﺍ ﭘﺨﺘﻪ ﻭ ﺑﻪ ﺧﻮﯾﺸﺎﻭﻧﺪﺍﻥ ﻭ ﺩﻭﺳﺘﺎﻥ ﺧﻮﺩ ﺗﻮﺯﯾﻊ ﻣﯽ ﮐﻨﻨﺪ ﺗﺎ
|
||||||
|
ﺩﺭﺍﯾﻦ ﭘﯿﺮﻭﺯﯼ ﻭ ﺷﺎﺩﯼ ﺷﺮﯾﮏ ﺑﺎﺷﻨﺪ.ﺍﺯ ﺍﯾﻦ ﺭﻭ ﺍﯾﻦ ﺭﻭﺯ ﺭﺍ ﻋﯿﺪ ﺯﻧﺎﻥ ﻣﻠﺖ ﺩﻭﻧﮓ
|
||||||
|
ﻧﺎﻣﯿﺪﻩ ﺍﻧﺪ.
|
||||||
|
(Mosuo) ﻋﯿﺪ ﺧﺪﺍﯼ ﺯﻥ ﺍﻫﺎﻟﯽ ﻣﻮﺳﻮﻭ
|
||||||
|
.( ﺭﺍ ﺗﺠﺴﻤﯽ ﺍﺯ ﺧﺪﺍﯼ ﺯﻥ ﺗﻠﻘﯽ ﻣﯽ ﮐﻨﻨﺪGemu) ﺍﻫﺎﻟﯽ ﻣﻮﺳﻮﻭ , ﮐﻮﻩ ﮔﻪ ﻣﻮ
|
||||||
|
ﺳﺎﻝ ﺳﺎﺑﻘﻪ ﺩﺍﺭﺩ . ﻫﺮ ﺳﺎﻝ ﺩﺭ ۱۰۰۰ ﻗﺮﺑﺎﻧﯽ ﮐﺮﺩﻥ ﺑﺮﺍﯼ ﺧﺪﺍﯼ ﮐﻮﻩ ﮔﻪ ﻣﻮ ﺑﯿﺶ ﺍﺯ
|
||||||
|
ﻗﻤﺮﯼ ﺍﻫﺎﻟﯽ ﻣﻮﺳﻮﻭ ﺩﺭ ﺭﺍﻩ ﺧﺪﺍﯼ ﺯﻥ ﻗﺮﺑﺎﻧﯽ ﮐﺮﺩﻩ ﻭ ﺁﻥ ﺭﻭﺯ ﺭﺍ ۷ ﻣﺎﻩ۲۵ ﺭﻭﺯ
|
||||||
|
ﻋﯿﺪﯼ ﺑﺎﺷﮑﻮﻩ ﻣﯽ ﻧﺎﻣﻨﺪ.
|
||||||
|
ﺩﺭ ﺻﺒﺢ ﺍﻫﺎﻟﯽ ﻣﻮﺳﻮﻭ ﺧﻮﺩ ﺭﺍ ﺑﺮﺍﯼ ﻗﺮﺑﺎﻧﯽ ﺧﺪﺍﯼ ﺯﻥ ﺁﻣﺎﺩﻩ ﻣﯽ ﮐﻨﻨﺪ . ﻭﻗﺘﯿﮑﻪ
|
||||||
|
ﺭﺍﻫﺒﺎﻥ ﺳﻮﺍﺭ ﺑﺮﺍﺳﺐ ﺩﺭ ﺟﺎﺩﻩ ﻇﺎﻫﺮ ﻣﯽ ﺷﻮﻧﺪ , ﺍﻫﺎﻟﯽ ﻣﻮﺳﻮﻭ ﺑﺪﻧﺒﺎﻝ ﺁﻧﻬﺎ ﺑﻪ ﭘﺎﯼ
|
||||||
|
ﮐﻮﻩ ﻣﯽ ﺭﻭﻧﺪ . ﺗﻮﺩﻩ ﻫﺎﯼ ﻣﺮﺩﻡ ﻣﻮﺳﻮﻭ ﺩﺭ ﺻﺪﺍﯼ ﻗﺮﺍﺋﺖ ﮐﺘﺎﺏ ﺁﯾﯿﻦ ﺑﻮﺩﺍﺋﯽ، ﺩﺭ
|
||||||
|
ﺟﻠﻮﯼ ﻗﺮﺑﺎﻧﮕﺎﻩ ﺳﺮ ﺑﻪ ﺯﻣﯿﻦ ﻧﻬﺎﺩﻩ ﻭ ﺩﺭ ﻣﻌﺒﺪ ﺧﺪﺍﯼ ﺯﻥ ﺑﺨﻮﺭ ﺳﻮﺯﺍﻧﺪﻩ ﻭ ﺍﺷﯿﺎﺀ ﻗﺮﺑﺎﻧﯽ
|
||||||
|
ﺭﺍ ﻫﺪﯾﻪ ﻣﯽ ﮐﻨﻨﺪ . ﺳﭙﺲ ﺁﻧﻬﺎ ﺭﻭﯼ ﺯﻣﯿﻦ , ﭼﻤﻦ ﺯﺍﺭ ﻧﺸﺴﺘﻪ ﻭ ﺁﺗﺶ ﺭﺍ ﺭﻭﺷﻦ ﻣﯽ ﮐﻨﻨﺪ
|
||||||
|
ﻭ ﭼﺎﯼ ﻭ ﻏﺬﺍ ﺭﺍ ﻣﻬﯿﺎ ﻣﯽ ﮐﻨﻨﺪ .
|
||||||
|
. " ﺭﻗﺺ ﺳﯿﻤﺮﻍ " ﺗﻮﺩﻩ ﻫﺎﯼ ﻣﺮﺩﻡ ﺭﺍ ﺑﻪ ﻣﺤﯿﻂ ﺷﺎﺩﯼ ﻭ ﺧﻮﺷﺤﺎﻟﯽ ﻣﯽ ﺑﺮﺩ
|
||||||
|
ﺩﺧﺘﺮﺍﻥ ﻣﻮﺳﻮﻭ ﺑﺎ ﻟﺒﺎﺳﻬﺎﯼ ﺯﯾﺒﺎ ﻭ ﺑﺎ ﺩﺭﺩﺳﺖ ﺩﺍﺷﺘﻦ ﮔﻞ ﻫﺎ ﺩﺭ ﻣﯿﺎﻥ " ﺳﯿﻤﺮﻍ " ﻭ"
|
||||||
|
ﺷﯿﺮ" ﺭﻓﺖ ﻭ ﺁﻣﺪ ﻣﯽ ﮐﻨﻨﺪ . ﻫﻨﮕﺎﻡ ﻏﺮﻭﺏ ﺁﻓﺘﺎﺏ ﭘﺴﺮﺍﻥ ﻭ ﺩﺧﺘﺮﺍﻥ ﺩﺳﺖ ﺑﻪ ﺩﺳﺖ
|
||||||
|
ﯾﮑﺪﯾﮕﺮﺩﺍﺩﻩ ﻭ ﺑﺎ ﺁﻭﺍﺯ ﺧﻮﺍﻧﯽ , ﺑﺎ ﺷﺎﺩﯼ ﻭ ﺧﻮﺷﺤﺎﻟﯽ ﺑﻪ ﺭﻗﺺ ﻣﺸﻐﻮﻝ ﻣﯽ ﺷﻮﻧﺪ .
|
||||||
|
|
||||||
|
.ﺍﯾﻦ ﺭﻭﺯ ﺟﺸﻦ ﮐﺎﺭﻧﺎﻭﺍﻝ ﺯﻧﺎﻥ ﻭ ﻣﺮﺩﺍﻥ ﻣﻮﺳﻮﻭ ﺍﺳﺖ
|
||||||
|
(Nu) ﻋﯿﺪ ﺯﻥ ﺁﺳﻤﺎﻧﯽ ﻣﻠﺖ ﻧﻮ
|
||||||
|
ﻋﯿﺪ ﺯﻥ ﺁﺳﻤﺎﻧﯽ ﻋﯿﺪ ﺳﻨﺘﯽ ﻣﺮﺩﻡ ﻧﻮ ﺍﺳﺖ ﻭ ﺍﯾﻦ ﻋﯿﺪ ﻫﻢ ﺑﻪ ﻧﺎﻡ ﻋﯿﺪ ﮔﻞ ﺧﻮﺍﻧﺪﻩ
|
||||||
|
|
||||||
|
ﻣﯽ ﺷﻮﺩ ﻭ ﻣﺮﺩﻡ ﻫﺮ ﺳﺎﻝ ﺩﺭ ﺭﻭﺯ ﭘﺎﻧﺰﺩﻫﻢ ﺍﺯ ﻣﺎﻩ ﺳﻮﻡ ﻗﻤﺮﯼ ﺟﺸﻦ ﻣﯽ ﮔﯿﺮﻧﺪ.
|
||||||
|
ﻣﻠﺖ ﻧﻮ ﺩﺭ ﺁﻥ ﺭﻭﺯ ﺑﻪ ﯾﮏ ﻏﺎﺭ ﻣﯽ ﺭﻭﻧﺪ ﺗﺎ ﮔﻠﻬﺎ ﺭﺍ ﺑﻪ ﺯﻥ ﺁﺳﻤﺎﻧﯽ ﺗﻘﺪﯾﻢ ﮐﻨﻨﺪ. ﺷﺎﯾﻊ
|
||||||
|
( ﺳﯿﻢ ﻧﻘﺎﻟﻪ ﺍﯼ ﺭﺍ ﺭﻭﯼ ﺭﻭﺩﺧﺎﻧﻪ ﻧﻮ Arong) ﺑﻮﺩ ﮐﻪ ﯾﮏ ﺩﺧﺘﺮ ﺯﯾﺒﺎ ﺑﻨﺎﻡ ﺍﺭﻭﻧﮓ
|
||||||
|
( ﺳﺎﺧﺘﻪ ﻭ ﺁﺏ ﺷﯿﺮﯾﻨﯽ ﺭﺍ ﺑﺮﺍﯼ ﻣﺮﺩﻡ ﺍﯾﻦ ﻧﺎﺣﯿﻪ ﮐﺸﯿﺪ. ﺳﭙﺲ ﺍﻭ ﺑﺮﺍﯼ ﭘﺮﻫﯿﺰ ﺍﺯ Nu)
|
||||||
|
ﺍﺯﺩﻭﺍﺝ ﺑﺎ ﺳﺮﮐﺮﺩﻩ , ﻓﺮﺍﺭ ﮐﺮﺩ ﻭ ﺩﺭ ﻏﺎﺭﯼ ﭘﻨﻬﺎﻥ ﺷﺪ ﮐﻪ ﺳﭙﺲ ﺑﻪ ﯾﮏ ﻣﺠﺴﻤﻪ ﺳﻨﮕﯽ
|
||||||
|
ﺗﺒﺪﯾﻞ ﺷﺪ . ﺍﯾﻦ ﺍﺗﻔﺎﻕ ﺩﺭ ﻫﻤﺎﻥ ﺭﻭﺯ ﭘﺎﻧﺰﺩﻫﻢ ﺍﺯ ﻣﺎﻩ ﺳﻮﻡ ﻗﻤﺮﯼ ﺍﻓﺘﺎﺩﻩ ﺍﺳﺖ . ﻣﺮﺩﻡ
|
||||||
|
ﻣﻠﺖ ﻧﻮ ﺑﺮﺍﯼ ﯾﺎﺩﺑﻮﺩ ﺍﯾﻦ ﺩﺧﺘﺮ ﻋﺎﻗﻞ ﻭ ﺗﻮﺍﻧﺎ ﻭ ﻧﯿﺮﻭﻣﻨﺪ ﻫﺮﺳﺎﻝ ﺩﺭ ﺍﯾﻦ ﺭﻭﺯ ﺩﺭ ﺭﺍﻩ ﺍﻭ
|
||||||
|
ﺩﺭ ﺍﯾﻦ ﺭﻭﺯ, ﺯﻧﺎﻥ ﻭ ﻣﺮﺩﺍﻥ ﻭ ﭘﯿﺮﺍﻥ ﻭ ﮐﻮﺩﮐﺎﻥ ﻫﻤﻪ ﺑﻪ ﺍﯾﻦ ﻏﺎﺭ
|
||||||
|
ﻗﺮﺑﺎﻧﯽ ﻣﯽ ﮐﻨﻨﺪ .
|
||||||
|
ﺭﻓﺘﻪ ﻭ ﺑﻪ" ﺍﺳﺘﺎﻻﮐﺘﯿﺖ "ﮐﻪ ﻧﺸﺎﻧﻪ ﺯﻥ ﺁﺳﻤﺎﻧﯽ ﺍﺳﺖ , ﮔﻞ ﺗﻘﺪﯾﻢ ﻣﯽ ﮐﻨﻨﺪ ﻭ ﺳﻼﻣﺘﯽ ﻭ
|
||||||
|
ﺧﻮﺷﺒﺨﺘﯽ ﺭﺍ ﺑﺮﺍﯼ ﯾﮑﺪﯾﮕﺮ ﺁﺭﺯﻭ ﻣﯽ ﮐﻨﻨﺪ ﻭ ﺳﭙﺲ ﺑﻪ ﻧﻮﺷﯿﺪﻥ ﻭ ﺧﻮﺭﺩﻥ ﻭ ﺷﺎﺩﯼ ﻣﯽ
|
||||||
|
ﭘﺮﺩﺍﺯﻧﺪ . ﺩﺭ ﺭﻭﺯ ﺑﻌﺪ ﻧﻤﺎﯾﺶ ﻗﺎﯾﻖ ﺭﺍﻧﯽ ﻭ ﻣﺴﺎﺑﻘﻪ ﻫﺎﯼ ﺗﯿﺮﺍﻧﺪﺍﺯﯼ ﺑﺮﮔﺰﺍﺭ ﻣﯽ ﺷﻮﺩ.
|
||||||
|
(Miao) ﻋﯿﺪ ﺧﻮﺍﻫﺮ ﻣﻠﺖ ﻣﯿﺎﺋﻮ
|
||||||
|
ﺭﻭﺯ ﻋﯿﺪ ﺧﻮﺍﻫﺮ , ﻋﯿﺪ ﺳﻨﺘﯽ۳ ﻫﺮ ﺳﺎﻝ ﺩﺭ ﺭﻭﺯ ﭘﺎﻧﺰﺩﻫﻢ ﺍﺯ ﻣﺎﻩ ﺳﻮﻡ ﻗﻤﺮﯼ ﺑﺮﺍﯼ
|
||||||
|
( ﺍﺳﺘﺎﻥ ﮔﻮﯼ ﺟﺌﻮ ﺍﺳﺖ Ching shui) ﻣﻠﺖ ﻣﯿﺎﺋﻮ ﮐﻪ ﺩﺭ ﮐﻨﺎﺭ ﺭﻭﺩﺧﺎﻧﻪ ﭼﯿﻨﮓ ﺷﻮﯼ
|
||||||
|
ﺑﺮﮔﺰﺍﺭ ﻣﯽ ﺷﻮﺩ. ﺍﮐﻨﻮﻥ ﺍﯾﻦ ﻋﯿﺪ ﺗﻨﻬﺎ ﻋﯿﺪ ﺯﻧﺎﻥ ﻣﻠﺖ ﻣﯿﺎﺋﻮ ﻧﺒﻮﺩﻩ , ﺑﻠﮑﻪ ﻋﯿﺪ ﻣﺸﺘﺮﮎ
|
||||||
|
ﻣﻠﺖ ﻣﯿﺎﺋﻮ ﮐﻪ ﺩﺭ ﮐﻨﺎﺭ ﺭﻭﺧﺎﻧﻪ ﭼﯿﻨﮓ ﺷﻮﯼ ﺯﻧﺪﮔﯽ ﻣﯽ ﮐﻨﻨﺪ ﻣﯽ ﺑﺎﺷﺪ. ﺩﺭ ﺭﻭﺯ ﻗﺒﻞ
|
||||||
|
ﺍﺯ ﻋﯿﺪ ﻫﺮﯾﮏ ﺍﺯ ﺩﺧﺘﺮﺍﻥ ﻣﻠﺖ ﻣﯿﺎﺋﻮ ﺑﺎﯾﺪ ﻏﺬﺍﯼ ﺧﻮﺍﻫﺮ ﺭﺍ ﺩﺭﺳﺖ ﮐﺮﺩﻩ ﻭ ﮔﻞ
|
||||||
|
ﻭﺣﺸﯽ ﻭ ﻣﯿﻮﻩ ﻭﺣﺸﯽ ﺭﺍ ﺍﺯ ﮐﻮﻩ ﻓﺮﺍﻫﻢ ﮐﻨﻨﺪ . ﭘﺴﺮﯼ ﮐﻪ ﺍﺯ ﺭﺍﻩ ﺩﻭﺭ ﺁﻣﺪﻩ ﺑﺎﯾﺪ ﺑﻪ
|
||||||
|
ﻫﻤﺮﺍﻩ ﺩﺧﺘﺮ ﺩﺭ ﺟﻤﻊ ﺁﻭﺭﯼ ﮔﻞ ﻭﺣﺸﯽ ﻭ ﻣﯿﻮﻩ ﻭﺣﺸﯽ ﺑﻪ ﻭﯼ ﮐﻤﮏ ﮐﻨﺪ ﺗﺎ ﺑﺎ ﺍﯾﻦ
|
||||||
|
ﺑﻬﺎﻧﻪ ﺑﺎ ﻫﻢ ﺁﺷﻨﺎ ﺷﻮﻧﺪ . ﺑﺰﺭﮔﺘﺮﯾﻦ ﻭﯾﮋﮔﯽ ﻋﯿﺪ ﺧﻮﺍﻫﺮ، ﺧﻮﺭﺩﻥ ﻏﺬﺍﯼ ﺧﻮﺍﻫﺮ
|
||||||
|
ﺍﺳﺖ . ﻏﺬﺍﯼ ﺧﻮﺍﻫﺮ ﭘﺨﺘﻪ ﺷﺪﻩ ﭼﻨﯿﻦ ﺍﺳﺖ ﮐﻪ ﻭﯼ ﮔﻞ ﻭ ﻣﯿﻮﻩ ﻭﺣﺸﯽ ﺭﺍ ﮐﻪ ﺍﺯ ﺭﻧﮓ
|
||||||
|
ﻫﺎﯼ ﻣﺨﺘﻠﻒ ﻣﯽ ﺑﺎﺷﺪ ﺟﻤﻊ ﺁﻭﺭﯼ ﮐﺰﺩﻩ ﺩﺭ ﺩﺯﻭﻥ ﺁﺏ ﺭﯾﺨﺘﻪ ﺳﭙﺲ ﺑﺮﻧﺞ ﭼﺴﺒﻨﺎﮎ ﺭﺍ
|
||||||
|
ﺑﻮﺳﯿﻠﻪ ﺁﻧﻬﺎ ﺭﻧﮓ ﮐﺮﺩﻩ ﮐﻪ ﺩﺭ ﺍﯾﻦ ﺻﻮﺭﺕ ﻏﺬﺍﯼ ﺧﻮﺍﻫﺮ ﺑﻪ ﺭﻧﮓ ﻗﺮﻣﺰ , ﺯﺭﺩ
|
||||||
|
ﻭ ... ﺗﺒﺪﯾﻞ ﺷﺪﻩ ﻭ ﺭﻧﮓ ﻭ ﺑﻮﯼ ﺧﻮﺑﯽ ﺧﻮﺍﻫﺪ ﺩﺍﺷﺖ .ﺩﺭ ﺭﻭﺯ ﻋﯿﺪ, ﺩﺧﺘﺮ ﻏﺬﺍﯼ
|
||||||
|
ﺧﻮﺍﻫﺮ ﺭﺍ ﺗﻮﯼ ﺩﺳﺘﻤﺎﻝ ﯾﺎ ﺳﺒﺪ ﮔﺬﺍﺷﺘﻪ ﻭ ﺑﻪ ﭘﺴﺮ ﻫﻤﺮﺍﻩ ﺧﻮﺩ ﻫﺪﯾﻪ ﻣﯽ ﺩﻫﺪ. ﺍﮔﺮ ﺩﻭ
|
||||||
|
ﭼﻮﺏ ﻗﺮﻣﺰ ﻏﺬﺍﺧﻮﺭﯼ ﺭﻭﯼ ﻏﺬﺍﯼ ﺧﻮﺍﻫﺮ ﮔﺬﺍﺷﺘﻪ ﺷﺪﻩ , ﻧﺸﺎﻥ ﻣﯽ ﺩﻫﺪ ﮐﻪ ﺩﺧﺘﺮ ﭘﺴﺮ
|
||||||
|
ﺭﺍ ﺩﻭﺳﺖ ﺩﺍﺭﺩ ﻭ ﻣﯿﻞ ﺩﺍﺭﺩ ﺑﺎ ﺍﻭ ﺩﻭﺳﺖ ﺑﺎﺷﺪ. ﺍﮔﺮ ﻓﻠﻔﻞ ﻭ ﭘﯿﺎﺯ ﺭﻭﯼ ﻏﺬﺍﯼ ﺧﻮﺍﻫﺮ
|
||||||
|
ﮔﺬﺍﺷﺘﻪ ﺷﻮﺩ , ﻧﺸﺎﻧﺪﻫﻨﺪﻩ ﺍﯾﻦ ﺍﺳﺖ ﮐﻪ ﺩﺧﺘﺮ ﺁﻥ ﭘﺴﺮ ﺭﺍ ﺩﻭﺳﺖ ﻧﺪﺍﺭﺩ . ﭘﺴﺮ ﺑﺎﯾﺪ ﺁﻥ
|
||||||
|
ﺩﺧﺘﺮ ﺭﺍ ﺗﺮﮎ ﮔﻔﺘﻪ ﻭ ﺑﻪ ﺩﻧﺒﺎﻝ ﻓﺮﺩ ﺩﯾﮕﺮﯼ ﺑﺎﺷﺪ . ﺍﮔﺮ ﺑﺮﮒ ﺩﺭﺧﺖ ﻭ ﺳﻮﺯﺍﻧﺪﻥ ﮐﺎﺝ
|
||||||
|
ﺭﻭﯼ ﻏﺬﺍﯼ ﺧﻮﺍﻫﺮ ﮔﺬﺍﺷﺘﻪ ﺷﻮﺩ , ﻧﺸﺎﻥ ﻣﯽ ﺩﻫﺪ ﮐﻪ ﭘﺴﺮ ﻫﻤﭽﻨﺎﻥ ﺍﻣﯿﺪ ﻭﺍﺭ ﺑﺎﯾﺪ ﺑﺎﺷﺪ .
|
||||||
|
ﺩﺭ ﺍﯾﻦ ﺻﻮﺭﺕ , ﭘﺴﺮ ﺑﺎﯾﺪ ﺑﻪ ﺩﺧﺘﺮ ﻫﺪﯾﻪ ﺩﻫﺪ ﻭ ﺍﺭﺗﺒﺎﻁ ﺑﯿﺸﺘﺮﯼ ﺑﺎ ﻭﯼ ﺑﺮﻗﺮﺍﺭ ﮐﻨﺪ .
|
||||||
|
ﭘﺲ ﺍﺯ ﺧﻮﺩﺭﻥ ﻏﺬﺍﯼ ﺧﻮﺍﻫﺮ , ﺩﺧﺘﺮﺍﻥ ﺑﺎ ﭘﺴﺮﺍﻥ ﺑﺎ ﻫﻢ ﻧﻤﺎﯾﺶ ﮔﺎﻭﺑﺎﺯﯼ ﻭ ﺧﺮﻭﺱ
|
||||||
|
ﺑﺎﺯﯼ ﺭﺍ ﺗﻤﺎﺷﺎ ﻣﯽ ﮐﻨﻨﺪ ﻭ ﺑﺎ ﻫﻢ ﺑﻪ ﺁﻭﺍﺯﺧﻮﺍﻧﯽ ﻭ ﺭﻗﺺ ﻣﯽ ﭘﺮﺩﺍﺯﻧﺪ . ﺍﺯ ﺍﯾﻦ
|
||||||
|
|
||||||
|
. ﺟﻬﺖ ﺍﯾﻦ ﻋﯿﺪ ﯾﮑﯽ ﺍﺯ ﺍﻋﯿﺎﺩ ﺷﺎﺩﯼ ﻭ ﺳﺮﻭﺭ ﺟﻮﺍﻧﺎﻥ ﻣﻠﺖ ﻣﯿﺎﺋﻮ ﻣﺤﺴﻮﺏ ﻣﯽ ﮔﺮﺩﺩ
|
||||||
|
۲۰۰۹ ﻣﺎﺭﺱ۶ ﻣﻨﺒﻊ : ﺭﻭﺯﻧﺎﻣﻪ ﻫﺎﯼ ﭼﯿﻦ ﺭﻭﺯ
|
||||||
|
|
||||||
BIN
data_example/raw_date/12-5شخصیت های واقع گرایانه فیلم.docx
Normal file
BIN
data_example/raw_date/12-5شخصیت های واقع گرایانه فیلم.docx
Normal file
Binary file not shown.
BIN
data_example/raw_date/اعیاد مخصوص زنان18-1.pdf
Normal file
BIN
data_example/raw_date/اعیاد مخصوص زنان18-1.pdf
Normal file
Binary file not shown.
298
extract_doc_files.py
Normal file
298
extract_doc_files.py
Normal file
@ -0,0 +1,298 @@
|
|||||||
|
import mammoth
|
||||||
|
from pathlib import Path
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
ROOT_PATH = Path(__file__).parent
|
||||||
|
DATA_DIR_PATH = ROOT_PATH / "data"
|
||||||
|
CLEANED_DIR_PATH = ROOT_PATH / "cleaned_dir"
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def detect_headding(elem):
|
||||||
|
|
||||||
|
if elem.name != "p":
|
||||||
|
return False
|
||||||
|
|
||||||
|
strongs = elem.find_all(["strong", "b"])
|
||||||
|
if len(strongs) != 1:
|
||||||
|
return False
|
||||||
|
|
||||||
|
strong_text = strongs[0].get_text(" ", strip=True)
|
||||||
|
full_text = elem.get_text(" ", strip=True)
|
||||||
|
|
||||||
|
return strong_text == full_text
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def get_element_metadata(elem):
|
||||||
|
metadata = {
|
||||||
|
"tag": elem.name,
|
||||||
|
"classes": elem.get("class", []),
|
||||||
|
"id": elem.get("id", "")
|
||||||
|
}
|
||||||
|
|
||||||
|
if detect_headding(elem):
|
||||||
|
metadata["content_type"] = "heading"
|
||||||
|
|
||||||
|
elif elem.name == "p" and not detect_headding(elem):
|
||||||
|
metadata["content_type"] = "paragraph"
|
||||||
|
|
||||||
|
|
||||||
|
return metadata
|
||||||
|
|
||||||
|
|
||||||
|
def merge_consecutive_paragraphs(elements):
|
||||||
|
|
||||||
|
if not elements:
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
texts = [elem.get("text", "") for elem in elements if elem.get("text")]
|
||||||
|
if not texts:
|
||||||
|
return []
|
||||||
|
|
||||||
|
combined = "\n".join(texts)
|
||||||
|
merged = {
|
||||||
|
"text": combined,
|
||||||
|
"metadata": {"content_type": "merged_paragraph"},
|
||||||
|
"element_type": "content"
|
||||||
|
}
|
||||||
|
return [merged]
|
||||||
|
|
||||||
|
|
||||||
|
def extract_book_structure(soup: BeautifulSoup, input_file: Path):
|
||||||
|
farsi_pattern = re.compile(r"[\u0600-\u06FF]+")
|
||||||
|
book_data = {
|
||||||
|
"document_info": { # fixed spelling
|
||||||
|
"title": "",
|
||||||
|
"source_file": str(input_file),
|
||||||
|
"extraction_date": datetime.now().isoformat(),
|
||||||
|
"total_chapters": 0
|
||||||
|
},
|
||||||
|
"chapters": []
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
all_elem = soup.find_all(['p'])
|
||||||
|
|
||||||
|
|
||||||
|
filtered_elem = [
|
||||||
|
elem for elem in all_elem
|
||||||
|
if elem.get_text(strip=True) and farsi_pattern.search(elem.get_text(strip=True))
|
||||||
|
]
|
||||||
|
|
||||||
|
current_chapter = None
|
||||||
|
|
||||||
|
for elem in filtered_elem:
|
||||||
|
text = elem.get_text(" ", strip=True)
|
||||||
|
metadata = get_element_metadata(elem)
|
||||||
|
|
||||||
|
# If this element is detected as heading -> start new chapter
|
||||||
|
if metadata.get("content_type") == "heading":
|
||||||
|
# finalize previous chapter (merge content) if exists
|
||||||
|
if current_chapter is not None:
|
||||||
|
# merge paragraph content (only if there are elements)
|
||||||
|
if current_chapter.get("chapter_content"):
|
||||||
|
current_chapter["chapter_content"] = merge_consecutive_paragraphs(
|
||||||
|
current_chapter["chapter_content"]
|
||||||
|
)
|
||||||
|
book_data["chapters"].append(current_chapter)
|
||||||
|
|
||||||
|
# start a new chapter
|
||||||
|
current_chapter = {
|
||||||
|
"chapter_title": text,
|
||||||
|
"chapter_metadata": metadata,
|
||||||
|
"chapter_number": len(book_data["chapters"]) + 1,
|
||||||
|
"chapter_content": []
|
||||||
|
}
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Otherwise it's a paragraph element
|
||||||
|
element_data = {
|
||||||
|
"text": text,
|
||||||
|
"metadata": metadata,
|
||||||
|
"element_type": "content"
|
||||||
|
}
|
||||||
|
|
||||||
|
# If we have a current chapter, append; else create an "Introduction" chapter
|
||||||
|
if current_chapter:
|
||||||
|
current_chapter["chapter_content"].append(element_data)
|
||||||
|
else:
|
||||||
|
# create a default intro chapter to hold leading paragraphs
|
||||||
|
current_chapter = {
|
||||||
|
"chapter_title": "Introduction",
|
||||||
|
"chapter_metadata": {"generated": True},
|
||||||
|
"chapter_number": len(book_data["chapters"]) + 1,
|
||||||
|
"chapter_content": [element_data]
|
||||||
|
}
|
||||||
|
|
||||||
|
if current_chapter is not None:
|
||||||
|
if current_chapter.get("chapter_content"):
|
||||||
|
current_chapter["chapter_content"] = merge_consecutive_paragraphs(
|
||||||
|
current_chapter["chapter_content"]
|
||||||
|
)
|
||||||
|
book_data["chapters"].append(current_chapter)
|
||||||
|
|
||||||
|
book_data["document_info"]["total_chapters"] = len(book_data["chapters"])
|
||||||
|
return book_data
|
||||||
|
|
||||||
|
|
||||||
|
def process_one_docx(input_file: Path, output_file: Path, verbose=False):
|
||||||
|
|
||||||
|
try:
|
||||||
|
if verbose:
|
||||||
|
print(f"Processing file: {input_file}")
|
||||||
|
|
||||||
|
with open(input_file, 'rb') as docx_file:
|
||||||
|
result = mammoth.convert_to_html(docx_file)
|
||||||
|
html = result.value
|
||||||
|
soup = BeautifulSoup(html, 'html.parser')
|
||||||
|
|
||||||
|
book_structure = extract_book_structure(soup, input_file)
|
||||||
|
|
||||||
|
if verbose:
|
||||||
|
print(f"Saving to output file: {output_file}")
|
||||||
|
|
||||||
|
with open(output_file, "w", encoding="utf-8") as out_file:
|
||||||
|
json.dump(book_structure, out_file, ensure_ascii=False, indent=2)
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
except FileNotFoundError as e:
|
||||||
|
raise FileNotFoundError(f"File not found - {e}")
|
||||||
|
except PermissionError as e:
|
||||||
|
raise PermissionError(f"Permission denied - {e}")
|
||||||
|
except UnicodeDecodeError as e:
|
||||||
|
raise UnicodeDecodeError(f"unable to decode file. Try a different encoding - {e}")
|
||||||
|
except Exception as e:
|
||||||
|
raise Exception(f"An exception error occurred - {e}")
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def process_all_files(raw_dir: Path, cleaned_dir: Path):
|
||||||
|
Path(cleaned_dir).mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
docx_files = [file for file in raw_dir.glob("*.docx")]
|
||||||
|
|
||||||
|
if not docx_files:
|
||||||
|
print(f"No .docx files found in directory: {raw_dir}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
print(f"Found {len(docx_files)} .docx files in directory: {raw_dir}")
|
||||||
|
|
||||||
|
for docx_file in docx_files:
|
||||||
|
json_file = cleaned_dir / f"{docx_file.stem} extracted.json"
|
||||||
|
|
||||||
|
print(f"Converting {docx_file} to {json_file}")
|
||||||
|
|
||||||
|
process_one_docx(docx_file, json_file, verbose=True)
|
||||||
|
|
||||||
|
print(f"All done. Processed {len(docx_files)} .docx files.")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
process_all_files(DATA_DIR_PATH, CLEANED_DIR_PATH)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# from docx import Document
|
||||||
|
# from datetime import datetime
|
||||||
|
# from pathlib import Path
|
||||||
|
# import re
|
||||||
|
# import json
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# def process_single_docx(input_file: Path, output_file: Path, verbose=False):
|
||||||
|
# # process a single docx file
|
||||||
|
# try:
|
||||||
|
# if verbose:
|
||||||
|
# print(f"Loading docx file: {input_file}")
|
||||||
|
|
||||||
|
# doc = Document(input_file)
|
||||||
|
|
||||||
|
|
||||||
|
# except Exception as e:
|
||||||
|
# print(f"Error loading docx file {input_file}: {e}")
|
||||||
|
|
||||||
|
# def show_docx_props(doc: Document):
|
||||||
|
|
||||||
|
# props = doc.core_properties
|
||||||
|
|
||||||
|
# print(props.author)
|
||||||
|
# print(props.title)
|
||||||
|
# print(props.created)
|
||||||
|
# print(props.last_modified_by)
|
||||||
|
# print(props.subject)
|
||||||
|
# print(props.keywords)
|
||||||
|
|
||||||
|
# files = [file for file in Path(".").glob("*.docx")]
|
||||||
|
# for file in files:
|
||||||
|
# doc = Document(file)
|
||||||
|
# print(file)
|
||||||
|
# print(f"Properties for file: {file}")
|
||||||
|
# show_docx_props(doc)
|
||||||
|
# print("-" * 40)
|
||||||
|
|
||||||
|
# for section in doc.sections:
|
||||||
|
|
||||||
|
# for para in doc.paragraphs:
|
||||||
|
# print(para.text, para.style.name)
|
||||||
|
# for run in para.runs:
|
||||||
|
# data.append({
|
||||||
|
# "text": run.text,
|
||||||
|
# "bold": run.bold,
|
||||||
|
# "italic": run.italic,
|
||||||
|
# "under_line": run.underline
|
||||||
|
# })
|
||||||
|
|
||||||
|
# print(data)
|
||||||
|
|
||||||
|
|
||||||
|
# import mammoth
|
||||||
|
# from pathlib import Path
|
||||||
|
|
||||||
|
# files = [file for file in Path(".").glob("*.docx")]
|
||||||
|
# for file in files:
|
||||||
|
# print(file)
|
||||||
|
|
||||||
|
# with open(file, "rb") as docx_file:
|
||||||
|
# result = mammoth.convert_to_html(docx_file)
|
||||||
|
# html = result.value
|
||||||
|
|
||||||
|
|
||||||
|
# filepath = f"{file.stem}.html"
|
||||||
|
# with open(filepath, "w", encoding="utf-8") as html_file:
|
||||||
|
# html_file.write(html)
|
||||||
|
|
||||||
|
|
||||||
|
# from docx import Document
|
||||||
|
# import re
|
||||||
|
# from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
# def paragraph_with_styles(doc: Document):
|
||||||
|
# out = []
|
||||||
|
# for i, para in enumerate(doc.paragraphs):
|
||||||
|
# style = None
|
||||||
|
|
||||||
|
# style = para.style.name
|
||||||
|
|
||||||
|
# out.append({
|
||||||
|
# "index": i,
|
||||||
|
# "text": para.text,
|
||||||
|
# "style": style
|
||||||
|
# })
|
||||||
|
|
||||||
|
# return out
|
||||||
|
|
||||||
|
# files = [file for file in Path(".").glob("*.docx")]
|
||||||
|
# for file in files:
|
||||||
|
# doc = Document(file)
|
||||||
|
|
||||||
|
# p = paragraph_with_styles(doc)
|
||||||
|
# for item in p:
|
||||||
|
# print(item["index"], item["style"], item["text"])
|
||||||
155
extract_pdf_files.py
Normal file
155
extract_pdf_files.py
Normal file
@ -0,0 +1,155 @@
|
|||||||
|
from pathlib import Path
|
||||||
|
from PyPDF2 import PdfReader
|
||||||
|
import pymupdf as pm
|
||||||
|
|
||||||
|
ROOT_PATH = Path(__file__).parent
|
||||||
|
DATA_PATH = ROOT_PATH / "data"
|
||||||
|
OUTPUT_PAHT = ROOT_PATH / "output"
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def pdf_is_readable(input_file):
|
||||||
|
reader = PdfReader(input_file)
|
||||||
|
for page in reader.pages:
|
||||||
|
text = page.extract_text()
|
||||||
|
if text and text.strip():
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
# def read_pdf_file(input_file):
|
||||||
|
|
||||||
|
# reader = PdfReader(input_file)
|
||||||
|
# pages = reader.pages
|
||||||
|
# print(len(pages), type(pages))
|
||||||
|
# page0 = pages[0]
|
||||||
|
# text = page0.extract_text()
|
||||||
|
# print(text)
|
||||||
|
|
||||||
|
|
||||||
|
# with open("output.txt", "w", encoding="utf-8") as file:
|
||||||
|
# file.write(text)
|
||||||
|
|
||||||
|
def process_one_file(input_file):
|
||||||
|
|
||||||
|
if not pdf_is_readable(input_file):
|
||||||
|
return
|
||||||
|
|
||||||
|
docs = pm.open(input_file)
|
||||||
|
|
||||||
|
all_text = ""
|
||||||
|
for page in docs:
|
||||||
|
text = page.get_text("text")
|
||||||
|
all_text += text + "\n"
|
||||||
|
|
||||||
|
return all_text
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def process_all_files(input_dir, output_dir: Path):
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
files = {}
|
||||||
|
for file in input_dir.iterdir():
|
||||||
|
ext = file.suffix.replace(".", "")
|
||||||
|
|
||||||
|
if ext not in files:
|
||||||
|
files[ext] = []
|
||||||
|
|
||||||
|
files[ext].append(file)
|
||||||
|
|
||||||
|
|
||||||
|
for file in files["pdf"]:
|
||||||
|
file_text = process_one_file(file)
|
||||||
|
|
||||||
|
output_file = output_dir / f"{file.stem} extracted.txt"
|
||||||
|
|
||||||
|
with open(output_file, "w", encoding="utf-8") as file:
|
||||||
|
file.write(file_text)
|
||||||
|
|
||||||
|
|
||||||
|
# src = pm.open("ocr_needed_sample.pdf")
|
||||||
|
# res = pm.open()
|
||||||
|
|
||||||
|
# for page in src:
|
||||||
|
# pix = page.get_pixmap()
|
||||||
|
# pdfbytes = pix.pdfocr_tobytes(language="eng")
|
||||||
|
# imgpdf = pm.open("pdf", pdfbytes)
|
||||||
|
# res.insert_pdf(imgpdf)
|
||||||
|
|
||||||
|
# res.save("exported-document.pdf")
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
process_all_files(DATA_PATH, OUTPUT_PAHT)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# file = files["pdf"][7]
|
||||||
|
# if not pdf_is_readable(file):
|
||||||
|
# print("file is not readable")
|
||||||
|
# # print(pdf_is_readable("ocr_needed_sample.pdf"))
|
||||||
|
# print(file)
|
||||||
|
# # read_pdf_file(file)
|
||||||
|
# all_text = ""
|
||||||
|
# doc = pm.open(file)
|
||||||
|
# all_text = ""
|
||||||
|
# for page in doc:
|
||||||
|
# for block in page.get_text("dict")["blocks"]:
|
||||||
|
# print(block)
|
||||||
|
# print()
|
||||||
|
# print()
|
||||||
|
|
||||||
|
# # for page in doc:
|
||||||
|
# # text = page.get_text("text")
|
||||||
|
# # all_text += text + "\n"
|
||||||
|
# # with open("output.txt", "w", encoding="utf-8") as file:
|
||||||
|
# # file.writelines(all_text)
|
||||||
|
|
||||||
|
# all_spans = []
|
||||||
|
|
||||||
|
# for page in doc:
|
||||||
|
|
||||||
|
# spans = [
|
||||||
|
# {
|
||||||
|
# "text": span["text"],
|
||||||
|
# "flags": span["flags"],
|
||||||
|
# "page": page.number + 1
|
||||||
|
# }
|
||||||
|
# for block in page.get_text("dict")["blocks"] if block.get("")
|
||||||
|
# for line in block["lines"]
|
||||||
|
# for span in line["spans"]
|
||||||
|
# ]
|
||||||
|
|
||||||
|
# all_spans.extend(spans)
|
||||||
|
|
||||||
|
# for s in all_spans:
|
||||||
|
# if s["flags"] > 4:
|
||||||
|
# print(s)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# with open("output.txt", "w", encoding="utf-8") as file:
|
||||||
|
# file.writelines(all_text)
|
||||||
|
|
||||||
|
# blocks = page.get_text("blocks") # for larger text blocks
|
||||||
|
|
||||||
|
# texts = []
|
||||||
|
# # Extract detailed info with font
|
||||||
|
# for block in page.get_text("dict")["blocks"]:
|
||||||
|
# for line in block.get("lines", []):
|
||||||
|
# for span in line["spans"]:
|
||||||
|
# text = span["text"]
|
||||||
|
# font = span["font"] # font name
|
||||||
|
# size = span["size"] # font size
|
||||||
|
# flags = span["flags"]
|
||||||
|
# texts.append({
|
||||||
|
# "text": text, "font": font, "size": size, "flags": flags
|
||||||
|
# })
|
||||||
|
|
||||||
|
# for elem in texts:
|
||||||
|
# print(elem)
|
||||||
Loading…
x
Reference in New Issue
Block a user