-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting wrong URL when there is dot before url #36
Comments
Thanks for reporting this issue. I see the problem but it is not that easy to solve. The main reason is that the string is not typographically correct. There should be space after dot (or three dots). I will keep it in mind when I make some changes. Honestly, I will try to solve other issues first and then this one. Or if you have some free time you can send me pull request with a fix. I would appreciate any help! |
I have very non elegant solution which involves using regex to remove more than 1 dots that occur at the same time. Will this be an alright fix? I will send a pull request if it is |
Well, it depends where does the regexp remove the dots? Just removing of more that two dots in global is not general solution and may give incorrect results. |
I have run into many incorrectly extracted URLs because of this issue. What's more, the dot is not the only problem. It's also the at sign, colon, plus, etc. With the following input...
And there might be more. |
Any updates on that matter? |
@nicolasassi Not right now. However if somebody has an idea how to solve this issue properly then feel free to start a discussion or create PR to discuss the actual code. @nicolasassi what I am referring to is that I do not have good way how to determine when the domain starts when there is no white space. I can not figure out any good general solution. But if you know what you are parsing. You can pre-process the text or you can try to use Do you think it would help in your case? |
@lipoja this is a trick problem indeed... In my opinion pre-process the text or use I'm gonna put thought on this problem... maybe we can find way to fix that in the parser. |
I am looking on this library as a tool that will return you as much domains as it founds even when they are "wrong" and there needs to be some post-processing. We are trying to cover all general issues without limiting the number of returned results. Of course the results should be correct if possible. But I would rather return this domain that contains other text (e.g. website+www.example.com) rather then returning nothing at all. So users at least can see what was found and tune their parser or do some filtering. But I would really appreciate any help in any form (discussion on some ideas, PRs, .... ). Thank you! |
For this text:
extractor.find_urls("My name is claim...https://t.co/SZlazvFzYx")
URL extractor returns:
['claim...https://t.co/SZlazvFzYx']
The text was updated successfully, but these errors were encountered: