Downloading and caching extracted #15

kgrz · 2013-08-09T14:53:40Z

Most of the logic inside the #get_page now extracted to the Upton::Downloader class. I did remove most of the encoding specific parts. IF they are needed, I will add them in.

Extracting the downloading and caching logic to its own class. This cleans up the main class quite a bit. Some of the explicit encoding options are removed. modified: spec/spec_helper.rb The repo uses webmock which makes stubbing HTTP connections easier

kgrz · 2013-08-09T15:55:18Z

@jeremybmerrill Before merging this commit, please check if you have any encoding-related test cases that fail. If yes, I should add them to the Downloader class and the corresponding spec.

jeremybmerrill · 2013-08-09T22:35:23Z

Oh, this is awesome! Thanks! I want to take a closer look, but I expect to merge this soon.

I stupidly forgot to note where the encoding stuff was broken, but I'll try to find it and write a test with it. I definitely didn't put in that encoding stuff for no reason... :)

I may put back in those accept headers, since some sites don't work right without them; I'll put them in some Downloader class variable though. I was able to find which site I was scraping that gave me this problem and I'll try to reduce it to a minimal test case.

jeremybmerrill · 2013-08-14T02:38:48Z

Using tmpdir is an interesting choice... I suppose it makes sense as a default, since debugging files won't need to persist after a reboot (in most cases) and the default can be overridden if you want to hold onto the stashed files. Good idea!

I'm curious, before I merge, why did you want to write cached files with md5 filenames? I'd love to hear your thought process. (Worried about collisions with old method of removing special chars from URI?) I think having human-readable filenames is a benefit, since I sometimes look at the stashed files.

I haven't been able to come up with a test case for either the accept headers or the encoding stuff. If it comes up, I'll re-add it.

kgrz · 2013-08-14T11:14:48Z

I'm curious, before I merge, why did you want to write cached files with md5 filenames? I'd love to hear your thought process. (Worried about collisions with old method of removing special chars from URI?)

I think I wrote the first implementation of the caching part as one where the downloaded files are kept in separate directories named after the URL in-turn grouped under the main stash directory — We could add paginated pages under the same URL hostname or parent page to the same directory — and so, since there would be too many parameters to gsub from the URL, I just hashed the name. When I extracted the functionality into a gem, I thought it made more sense to just generalize the name and make sure a unique file is saved per URL.

In retrospect, however, I suppose this is a fair choice in avoiding collisions as you mentioned and also to deal with UTF-8 characters in the URL that might be difficult to gsub using regex.

dannguyen · 2013-08-14T14:44:24Z

Maybe there should be an option in the Upton constructor to allow MD5 hashing? I do think that that use case ends up being inferior when it comes to stashing on the file system. However, if Upton is extended to allow storage into a database, that kind of hashing would be useful.

jeremybmerrill · 2013-08-14T15:35:21Z

Good call, Dan.

I'll cut out the md5 stuff and then merge (when I get a chance, which may not be today), and we can continue this discussion on #20 to eventually improve url_to_filename.

I'll note that I really like the idea of putting stashed pages in subfolders (though it makes sense for that to live in Upton proper, not the downloader)

kgrz · 2013-08-14T15:59:57Z

I do think that that use case ends up being inferior when it comes to stashing on the file system.

I didn't get this. We are not MD5-ing the whole file. Just the name of the file. It straight away makes it easy to deal with the edge cases mentioned in #20.

jeremybmerrill · 2013-08-14T16:15:03Z

I think MD5 makes perfectly good sense for the downloader. In Upton proper, however, I think it comes with a very big downside: namely, that you can't easily figure out where a specific stashed page is to examine the stashed HTML (for whatever reason). I don't think CGI.escape and the stuff discussed in #20 is hard enough to justify not doing it.

dannguyen · 2013-08-14T16:26:52Z

@kgrz Sorry, what I meant was that if it is important for the user to manage/read through the stash, then MD5 is not a good solution...however, in the last 5 minutes, I've since changed my mind on that :)

#20 (comment)

(tl;dr for what seems like the typical Upton use-case, in which a user has implemented a successful scraper and data extraction routine and just needs to run it from time to time, the stash should be of no interest to the user, so why make it readable in the first place?)

I think if the choice is between MD5 or doing the somewhat difficult work of keeping filenames readable (and preventing as many collisions as possible), I think MD5 seems like the more reasonable tradeoff at this point.

jeremybmerrill · 2013-08-15T03:07:20Z

What do you all think (keeping in mind discussion on #20?) about keeping the md5 filename method, keeping the gsub method in hopes of eventually implementing aCGI.escape -type method, and, if @dannguyen writes a PR, adding a SQlite method? Which is the default isn't a big deal to me; md5 solves most of the problems in easy cases, but having the other options available easily is important.

jeremybmerrill · 2013-08-15T14:45:23Z

(I made that merge at home, but forgot to push last night; will push to master tonight)

PS: Good call on the file tree structure changes, i.e. lib/upton/utils.rb etc.

dannguyen · 2013-08-15T18:44:07Z

I haven't taken a close look at the new downloading/caching code yet, may try to look at it this weekend. My first step, if it already isn't, would be to create an abstract UptonStorage class that interfaces with the file system or db (or whatever)...for some reason though, that sounds like a "lets make an ORM, but for files!" project, which seems weird :)

Then the next step would be to figure out a basic schema for each saved page...probably something like a table with columns for URL, hostname (for quicker indexing, even if redundant), joined to another table that contains just blobs for content.

kgrz added 2 commits August 9, 2013 20:20

Added webmock to the development bundle

57703cc

dannguyen mentioned this pull request Aug 14, 2013

Improving url_to_filename #20

Open

jeremybmerrill merged commit 57703cc into propublica:master Aug 16, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloading and caching extracted #15

Downloading and caching extracted #15

kgrz commented Aug 9, 2013

kgrz commented Aug 9, 2013

jeremybmerrill commented Aug 9, 2013

jeremybmerrill commented Aug 14, 2013

kgrz commented Aug 14, 2013

dannguyen commented Aug 14, 2013

jeremybmerrill commented Aug 14, 2013

kgrz commented Aug 14, 2013

jeremybmerrill commented Aug 14, 2013

dannguyen commented Aug 14, 2013

jeremybmerrill commented Aug 15, 2013

jeremybmerrill commented Aug 15, 2013

dannguyen commented Aug 15, 2013

Downloading and caching extracted #15

Downloading and caching extracted #15

Conversation

kgrz commented Aug 9, 2013

kgrz commented Aug 9, 2013

jeremybmerrill commented Aug 9, 2013

jeremybmerrill commented Aug 14, 2013

kgrz commented Aug 14, 2013

dannguyen commented Aug 14, 2013

jeremybmerrill commented Aug 14, 2013

kgrz commented Aug 14, 2013

jeremybmerrill commented Aug 14, 2013

dannguyen commented Aug 14, 2013

jeremybmerrill commented Aug 15, 2013

jeremybmerrill commented Aug 15, 2013

dannguyen commented Aug 15, 2013