Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving url_to_filename #20

Open
dannguyen opened this issue Aug 14, 2013 · 7 comments
Open

Improving url_to_filename #20

dannguyen opened this issue Aug 14, 2013 · 7 comments

Comments

@dannguyen
Copy link
Contributor

A tangent to this discussion: #15

The gsubbing of all non-word characters may cause more collisions than desired in some edge cases, and more importantly, makes it difficult/impossible to reverse the filename and get the original URL. Maybe it's possible to use CGI.escape to convert to a proper file name and then CGI.unescape to reverse the change?

 CGI.escape 'https://github.com/propublica/upton/issues/new#hashbang+42'
 => "https%3A%2F%2Fgithub.com%2Fpropublica%2Fupton%2Fissues%2Fnew%23hashbang%2B42"

Second issue: extremely long filenames

I've run into this before but am not going to take the time to reproduce it...operating systems may restrict the length of a filename to something shorter than the saved URL. This could be mitigated by saving the file to subdirectories according to its URL path:

http://example.com/path/to/filename.html

Gets saved to:

http%3A%2F%2Fexample.com/path/to/filename.html

In some extreme scraping cases, having tens of thousands of files in the stash directory might cause some performance issues. Not sure what the speed of trying to hash the file identity through recursive subdirectory search will be (but that's probably premature optimization)

@jeremybmerrill
Copy link
Contributor

I like @kgrz's idea, as adapted by you, dan, about not stashing everything in a flat directory...

e.g. stashes/#{clean_index_url}/#{clean_url}/

I do think readable filenames are important. CGI.escape will probably do this work for us. Might be worthwhile still to bake in a way to deal with colliding filenames (e.g. appending -2,-3, etc.).

Nothing wrong with your suggestion of letting people choose between CGI.escape and md5 in the options hash, though I'm still not sure if there's a benefit to md5 ever. (It might be limited to the downloader itself and overridden in Upton proper?)

@dannguyen
Copy link
Contributor Author

Come to think of it, though, how useful is it for the user to recognize the stashed filenames? If the scrape goes well, they shouldn't ever have to worry about the stash. When a scrape fails, a user may want to find the stashed file to debug it, but in my opinion, that's kind of an anti-pattern. Finding that actual file (i.e. by looking through the OS Finder) will vary in difficulty, depending on the number of pages scraped and the ugliness of the URL...they should be used to debugging against the live page at the remote URL.

Moreover, the users probably should not have to worry about the existence of the stash at all, as it seems to be mainly a system to speed up performance (through caching)...if users intend to be saving these pages for exploratory purposes (beyond scraping the datapoints), then they should be using a site-spidering library like http://anemone.rubyforge.org/

MD5 may not be the ultimate solution (for example, you might find it better to ultimately have Upton stash in SQLite by default), but it may be worthwhile in the long run to not make users think of a readable-stash as an actual feature...it's not useful in the majority of cases but is pretty hard to perfect and maintain.

@jeremybmerrill
Copy link
Contributor

In most cases, the live site is a better thing to debug against, for sure. But there are a few cases where what Upton gets isn't easily replicable in the browser; e.g. debugging logins or wacky sites that behave differently based on POST data or headers. It's useful in these cases to have access to the pages themselves.

One middle ground would be to output a CSV mapping URLs to md5 hashes... but that might be an unnecessary complication.

@dannguyen
Copy link
Contributor Author

Yeah, one thing that just came to mind...what has been useful to me is to have, if not readable cache-file names, reversible cache names...(the solution just ends up being both). The use case is that I've hammered a site but am not sure whether I got all the pages I wanted...so I write myself a little script that tells me how many (or what) pages I've downloaded from a given domain, or for a particular scrape. This is pretty much impossible if the URLs are hashed with a one-way function.

But again, that's an edge case (i.e. you're not expecting the average Upton user to be writing their own scraping-status-report tool).

Ultimately, I think that the happy medium will be to abstract the caching system from Upton's core spidering/scraping functionality so that you can drop in SQLite, which would let you handle all the edge and collision cases, while providing a better interface for the users who do want to query what pages they've downloaded (and when). And for users who don't want to go through the hassle of installing SQLite, you still have your default save to the /stash folder...I think it's safe to assume that users who don't/can't handle the hassle of installing SQLite will have relatively basic scraping needs. And as they scrape bigger things, well, they'll just have to make the jump in complexity themselves :)

@jeremybmerrill
Copy link
Contributor

The readable url method I added (which is off by default, in favor of md5 filenames) in #15, which I just merged, uses the first 233 characters of the gsubbed URL, then a timestamp. This will solve the collision issue, while retaining readability. Obviously still open to better suggestions and the SQLite solution.

One good call @kgrz made was, as you suggested, pulling the caching stuff into its own file, lib/upton/downloader.rb, to get it out of the way. I think @kgrz is considering turning it into a gem on its own, if I understand correctly, which'd be super neat.

@kgrz
Copy link
Contributor

kgrz commented Aug 16, 2013

Yep. Ignoring the reversibility of the MD5-ed filenames was my mistake; it never crossed my mind. As @dannguyen, an SQLite or such similar interface should be present. I will see if I can make that easy and include that in the gem. That way, Upton need not be polluted with all that logic.

@jeremybmerrill
Copy link
Contributor

Hmm... I should have seen that coming, but timestamps don't work, since they're not reversible. If I'm looking for a cached file, I don't know what time it was cached. (Without the as-yet-non-existent database.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants