move dns checking to dedicated class and add concurrency #92

nicolasassi · 2021-06-19T17:10:18Z

implemented ideas discussed in #91.

Also moved all dns checking for find_urls and has_urls so all found urls could be check concurrently if the user needs.
I kept all intances of dns checking in the abstract methods for backwards compatibility but marked them as DEPRECATED and removed its effects. Maybe we could remove it all together in the future.

nicolasassi · 2021-06-20T16:28:15Z

I'm not really sure how depedencies work on this project, but I tried to add Pebble both to setup.py and to requirements.txt. Hope it fix the problem

lipoja · 2021-06-20T19:21:12Z

You should be able to run tox locally to test if everything works.
requirements.txt should be used to install all dependencies for development
setup.py should contain all basic dependencies necessary to run urlextract

urlextract/dns_check.py

urlextract/urlextract_core.py

urlextract/dns_check.py

lipoja · 2021-06-20T22:07:07Z

urlextract/urlextract_core.py

@@ -745,13 +717,18 @@ def gen_urls(self, text, check_dns=False, get_indices=False):
            # move cursor right after found TLD
            tld_pos += len(tld) + offset

-    def find_urls(self, text, only_unique=False, check_dns=False, get_indices=False):
+    def find_urls(self, text, only_unique=False, check_dns=False, get_indices=False, timeout=None,
+                  accept_on_timeout=False, max_workers=None, max_tasks=None):


What I am thinking here is, isn't there already too much parameters for this find function?
Do you think that it is not sufficient to let user set these parameters ahead by modifying properties of DNSCheck class

What is you opinion?

We could write some setter methods for the dsn_checker class in the URLExtract class and the user could set then before running find_urls. But I would add to the docs of find_urls the default values for dns_checker in case the user sets check_dns = True and is not aware of it. What do you think?

I vote for setter methods in dns_check class. Ideally all methods related to DNS checks should be in that class. (not directly in URLExtract class)
And yes, everything should be documented. And also everything should have reasonable default values.

urlextract/urlextract_core.py

lipoja · 2021-06-20T22:18:29Z

Thank you for your time and effort working on this issue. I really appreciate that!
Let's have a discussion about this code and I believe that withing few iteration we can merge the code.

I did not had chance to review it all, I want to go deeper once I have more time.

FYI: Do you think we could also add some test for it? I did not thought about it much yet. But I think there should be some.

…ct functions

nicolasassi · 2021-06-21T17:04:46Z

@lipoja I fixed most of what was asked. As some changes are still in discussion (like i find_urls should have some many args etc...) there are still changes to be made. Also, the tests are not passing.

For some reason with my alterations now the results of dns checking are not being saved in the cache (that's way the tests are not passing now) and I can't figure out why. Could you please give it a look?

Thanks!

lipoja · 2021-07-08T12:04:06Z

@nicolasassi Sorry for the delay, lack of time it is ... family comes first these days.

What I was able to determine is that your second commit is breaking the tests. So maybe we can dig deeper around that.
It is about moving cache to one function. But at the first glance I do not see anything wrong. It needs more debugging.

I will look on when I save some time (might be on weekend, but no promises).

lipoja · 2021-07-11T20:25:00Z

@nicolasassi OK I found it and fixed it:
Method socket.gethostbyname(host) accept hosts ... unfortunately there should be some pre-processing since you are working with URLs and not with hosts only. You have to modify _get_hots() to this:

    def _get_host(self, host: str):
        """
         Get the IP address from a given host
        :param str host: the host to get IP from
        :return: A tuple with the given host and its IP address (a string of the form '255.255.255.255') if found
        (e.g: host.com, '255.255.255.255')
        :rtype: tuple
        """
        tmp_url = host
        scheme_pos = host.find('://')
        if scheme_pos == -1:
            tmp_url = 'http://' + host

        url_parts = uritools.urisplit(tmp_url)
        tmp_host = url_parts.gethost()

        if isinstance(tmp_host, ipaddress.IPv4Address):
            return host, tmp_host

        try:
            return host, socket.gethostbyname(tmp_host)
        ...

Thank you for your work, it looks like the parallel processing might work well... I want to review and test it more once this fix is in place.

nicolasassi · 2021-07-19T13:43:24Z

@lipoja sorry for the delay I'm also kind busy these days. As soon as I have some time I'm gonna implement the changes you've suggested and add more tests. Thank you for your time!

lipoja · 2021-08-24T14:16:35Z

@nicolasassi Do you think I can take this over and finish that PR in case I have some time?

nicolasassi · 2021-08-24T14:33:43Z

@lipoja sure! Life has been crazy and unfortunatelly time is sort... I still hope I can take some time to focus on this project and finish it, but just in case, if you have some time, feel free to take over.

Hope my contribuition already helped somehow and feel free to @ me if you need some help on this or anything in the future

nicolas.sassi added 3 commits June 17, 2021 13:23

lint file

2075e99

move logic to check dns from _is_domain_valid to dedicated function

a8aefa2

move logic to check dns to dedicated class with concurrency enabled

6f7bdc5

lipoja self-requested a review June 19, 2021 20:59

add Pebble as dependency

03516f2

lipoja requested changes Jun 20, 2021

View reviewed changes

nicolas.sassi added 7 commits June 21, 2021 13:17

import DNSCheck on init

6a5434d

fix setting values to properties

aedc73c

import DNSCheck and lint

e623d8d

use socket from dns_check

82aaeab

fix tests to use socket from checker module

befaefa

improve docs and return list of hosts instead of bool

518374b

make function complete_url public and reimplement dns_check in abstra…

7214188

…ct functions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move dns checking to dedicated class and add concurrency #92

move dns checking to dedicated class and add concurrency #92

nicolasassi commented Jun 19, 2021

nicolasassi commented Jun 20, 2021

lipoja commented Jun 20, 2021

lipoja Jun 20, 2021

nicolasassi Jun 21, 2021

lipoja Jun 25, 2021

lipoja commented Jun 20, 2021

nicolasassi commented Jun 21, 2021 •

edited

lipoja commented Jul 8, 2021

lipoja commented Jul 11, 2021 •

edited

nicolasassi commented Jul 19, 2021

lipoja commented Aug 24, 2021

nicolasassi commented Aug 24, 2021

move dns checking to dedicated class and add concurrency #92

Are you sure you want to change the base?

move dns checking to dedicated class and add concurrency #92

Conversation

nicolasassi commented Jun 19, 2021

nicolasassi commented Jun 20, 2021

lipoja commented Jun 20, 2021

lipoja Jun 20, 2021

Choose a reason for hiding this comment

nicolasassi Jun 21, 2021

Choose a reason for hiding this comment

lipoja Jun 25, 2021

Choose a reason for hiding this comment

lipoja commented Jun 20, 2021

nicolasassi commented Jun 21, 2021 • edited

lipoja commented Jul 8, 2021

lipoja commented Jul 11, 2021 • edited

nicolasassi commented Jul 19, 2021

lipoja commented Aug 24, 2021

nicolasassi commented Aug 24, 2021

nicolasassi commented Jun 21, 2021 •

edited

lipoja commented Jul 11, 2021 •

edited