Can pydictor find and remove non-UTF-8 characters? #34

Privacy6484847 · 2022-01-08T15:14:20Z

Can pydictor find and remove non-UTF-8 characters?

LandGrey · 2022-01-09T05:11:44Z

Can pydictor find and remove non-UTF-8 characters?

Currently not contains this function, you can use other tool to finish it, such as iconv

Privacy6484847 · 2022-01-10T18:38:10Z

Thanks for answer.

I applied the iconv tool with this command iconv -f utf-8 -t utf-8 -c test.txt -o clean_test.txt.
Then i used pydictor on the clean_test.txt with: pydictor --len 6 20 -tool handler clean_test.txt -o super_clean_test.txt

But i got the following error:

File "E:\pydictor-2.0.5\pydictor.py", line 107, in <module> tool_parser() 
File "E:\pydictor-2.0.5\lib\parse\argsparse.py", line 104, in tool_parser get_handler_dic(pyoptions.args_tool[1]) 
File "E:\pydictor-2.0.5\tools\handler.py", line 24, in get_handler_dic for item in f.readlines(): 
File "C:\Python310\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
 UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3858: character maps to <undefined>

LandGrey · 2022-01-11T11:50:03Z

Thanks for answer.

I applied the iconv tool with this command iconv -f utf-8 -t utf-8 -c test.txt -o clean_test.txt. Then i used pydictor on the clean_test.txt with: pydictor --len 6 20 -tool handler clean_test.txt -o super_clean_test.txt

But i got the following error:
File "E:\pydictor-2.0.5\pydictor.py", line 107, in <module> tool_parser() 
File "E:\pydictor-2.0.5\lib\parse\argsparse.py", line 104, in tool_parser get_handler_dic(pyoptions.args_tool[1]) 
File "E:\pydictor-2.0.5\tools\handler.py", line 24, in get_handler_dic for item in f.readlines(): 
File "C:\Python310\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
 UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3858: character maps to <undefined>

I add the filter printable character tool for pydictor just now.
you can download the latest pydictor version (2.1.5.6) and using command python pydictor.py --len 6 20 -tool printabler test.txt to get your wordlist.

Privacy6484847 · 2022-01-12T22:56:25Z

Thank you for your help.
Unfortunately the issue is still occuring. You can test it with this file i found online: https://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html

LandGrey · 2022-01-13T03:36:15Z

Thank you for your help. Unfortunately the issue is still occuring. You can test it with this file i found online: https://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html

I fixed the bug.
Please download pydictor 2.1.5.7 version and using command python pydictor.py --len 6 20 -tool printabler test.txt.

Privacy6484847 · 2022-01-13T04:01:31Z

Thank you for your effort.
Unfortunately i got the same error.

LandGrey · 2022-01-13T06:26:32Z

2.1.5.7

try 2.1.5.8 version.

Privacy6484847 · 2022-01-13T18:15:38Z

You did it! Amazing!!
Thank you so so much!! 💯

Privacy6484847 · 2022-01-13T21:19:00Z

Update: Now the pydictor works well with non UTF-8 characters.
The issue i got now is that i hit a memory limit.

LandGrey · 2022-01-14T04:01:53Z

Update: Now the pydictor works well with non UTF-8 characters. The issue i got now is that i hit a memory limit.

try latest version 2.1.6.0.

Privacy6484847 · 2022-01-15T00:19:17Z

I tried the 2.1.6.0 version and i got a different behavior. But the same end result.
The Memory behavior changed. Instead of a straight shoot up i got a slower and steady rise for 5 min then it eventually gave the memory error.

During the process:

At the end of the process:

LandGrey · 2022-01-17T06:22:33Z

I tried the 2.1.6.0 version and i got a different behavior. But the same end result. The Memory behavior changed. Instead of a straight shoot up i got a slower and steady rise for 5 min then it eventually gave the memory error.

During the process:

At the end of the process:

That's due to "memory remove duplicate file lines by preserving order" caused.
Your input files must be huge, I would to consider a better way to fix it.

LandGrey · 2022-01-18T07:07:01Z

I tried the 2.1.6.0 version and i got a different behavior. But the same end result. The Memory behavior changed. Instead of a straight shoot up i got a slower and steady rise for 5 min then it eventually gave the memory error.

During the process:

At the end of the process:

maybe you can try version 2.1.7.0.

Privacy6484847 · 2022-01-18T22:01:12Z

Unfortunately the same behavior. For info the file size of this dictionnary is 400GB.

LandGrey · 2022-01-19T03:35:03Z

Download the latest version 2.1.7.1, I reduce the pydictor lib/data/data.py file pyoptions.memory_unique_max_lines_count variable to 10000000 , just try again. If the same behavior, you can reduce the variable and try it again.

Unfortunately the same behavior. For info the file size of this dictionnary is 400GB.

Download the latest version 2.1.7.1, I reduce the pydictor lib/data/data.py file pyoptions.memory_unique_max_lines_count variable to 10000000 , just try again.
If the same behavior, you can reduce the variable and try it again.

Privacy6484847 · 2022-01-19T20:34:48Z

OK. I'm experimenting playing with the variables. I would like to ask you if there is a way for pydictor to show the progress in %, or in lines processed. Any indication of the progress made by Pydictor would be very helpful for troubleshooting and also benchmarking the performance effect of different variables :)

Privacy6484847 · 2022-01-22T06:29:01Z

So after a lot of testing. Reducing this variable pyoptions.memory_unique_max_lines_count lowers performance dramatically and it just makes the error take more time to happen.
I actually even tried with a relatively small file of 30gb. And i gave windows 60GB of paging file size. Still Pydictor ended up in memory error.
Maybe instead of using the paging file, pydictor should save incrementally the data processed directly to the hard drive. :)

tl123987 · 2022-02-16T09:49:15Z

可以加个限制字典生成行数的功能不

tl123987 · 2022-02-17T10:08:12Z

自己写规则出现错误咋回事哇，作者

LandGrey added the help wanted label Jan 9, 2022

LandGrey added the enhancement label Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can pydictor find and remove non-UTF-8 characters? #34

Can pydictor find and remove non-UTF-8 characters? #34

Privacy6484847 commented Jan 8, 2022

LandGrey commented Jan 9, 2022

Privacy6484847 commented Jan 10, 2022

LandGrey commented Jan 11, 2022

Privacy6484847 commented Jan 12, 2022

LandGrey commented Jan 13, 2022

Privacy6484847 commented Jan 13, 2022

LandGrey commented Jan 13, 2022

Privacy6484847 commented Jan 13, 2022

Privacy6484847 commented Jan 13, 2022

LandGrey commented Jan 14, 2022

Privacy6484847 commented Jan 15, 2022 •

edited

LandGrey commented Jan 17, 2022

LandGrey commented Jan 18, 2022

Privacy6484847 commented Jan 18, 2022

LandGrey commented Jan 19, 2022

Privacy6484847 commented Jan 19, 2022

Privacy6484847 commented Jan 22, 2022

tl123987 commented Feb 16, 2022

tl123987 commented Feb 17, 2022

Can pydictor find and remove non-UTF-8 characters? #34

Can pydictor find and remove non-UTF-8 characters? #34

Comments

Privacy6484847 commented Jan 8, 2022

LandGrey commented Jan 9, 2022

Privacy6484847 commented Jan 10, 2022

LandGrey commented Jan 11, 2022

Privacy6484847 commented Jan 12, 2022

LandGrey commented Jan 13, 2022

Privacy6484847 commented Jan 13, 2022

LandGrey commented Jan 13, 2022

Privacy6484847 commented Jan 13, 2022

Privacy6484847 commented Jan 13, 2022

LandGrey commented Jan 14, 2022

Privacy6484847 commented Jan 15, 2022 • edited

LandGrey commented Jan 17, 2022

LandGrey commented Jan 18, 2022

Privacy6484847 commented Jan 18, 2022

LandGrey commented Jan 19, 2022

Privacy6484847 commented Jan 19, 2022

Privacy6484847 commented Jan 22, 2022

tl123987 commented Feb 16, 2022

tl123987 commented Feb 17, 2022

Privacy6484847 commented Jan 15, 2022 •

edited