Skip to content

linkypedia v2

edsu edited this page Jul 23, 2012 · 24 revisions

Why v2.0 ?

linkypedia v1.0 demonstrated that there is quite a bit of interest in monitoring how Wikipedians use publicly available Web content when they verify the substance of articles. However it soon became clear that the initial scope of linkypedia was somewhat limited, given the interest of some of its users:

  • Interest is not limited to cultural heritage institutions. There is also interest from media organizations like the New York Times, commercial sites like Flickr, as well as government institutions like NIH, many of whom have large amounts of links (>100k).
  • There is interest in statistics aggregated at different levels than just a particular domain name, for example: grouping domain names and top-level-domains(.com, .edu, org, etc).
  • Some of the sites submitted were from non-English speaking countries (France, Germany, Spain, Mexico) which would’ve been better served by looking at their respective Wikipedias in addition to the English Wikipedia.
  • Enough sites were added that it wasn’t possible for a serial crawler (a crawler that updated links one site at a time in a single process) wasn’t sufficient to keep on top of all the links in a day.
  • Some reports for large sites (>100k links) proved to take quite a bit of time (sometimes 10 seconds or more for the page to render).
  • The addition of user accounts might allow more user centered views: to allow people to see what links are present in pages they have contributed to on Wikipedia, or which Wikipedia pages they have referenced on Twitter, Facebook or in their Blog…and what members of their social circle have as well.
  • Separation of user profile pages from article pages using the page namespace table.
  • Utilities to help organizations figure out what articles need to be created or enhanced on Wikipedia.

What Needs to Change?

  • Provide user accounts that allow users to log in and monitor particular domains. This will allow domains to be grouped together such as (e.g. www.worldcat.org and worldcat.org), and will also provide an idea of who is using Linkypedia and to what ends. Accounts will be free of course, but having an account will be required for adding a domain to monitor. Anyone will be able to monitor an established domain.
  • Pre-populate the database with external link, and page dumps from major language Wikipedias. This should allow reporting on higher level aggregate data, that will not be dependent on users typing in a domain name. It also should enable Linkypedia to respond quicker with initial overviews for particular domains since an initial crawl will not be necessary. It should be possible to reload these tables when new dumps become available in a routine way.
  • Augment raw link data with a data-mart style schema that summarizes aggregate information so that expensive queries aren’t re-run constantly during the request cycle. The data-mart should be regenerated at timely intervals. It will include relations like: links per domain, TLD, per-language, categories, other page namespaces (talk, profile, etc).
  • Parallelize the crawl process using something like Celery to allow multiple crawlers to get jobs, possibly on different machines. It will be important to bear in mind parallelization will put an increased load on the Wikipedia API, and the linkypedia database.
  • How to integrate RDF data from dbpedia so that it can be made available as metadata to interested parties. Perhaps include dbpedia resources by reference, or somehow download the data and then re-publish it?
  • Streamlined curation tools for rapidly typing links.
  • JavaScript widgets that publishers could put in their pages to provide context from Wikipedia. This could both enrich publisher specific views, and can drive traffic to Wikipedia from potential editors. Also it could give linkypedia some user data to chew on.
  • More generalized support for other link targets than Wikipedia. It ought to be possible to have linkypedia consume feeds from other places, looking for links in that content.
  • More thoughtful integration of [page view stats](http://dumps.wikimedia.org/other/pagecounts-raw/) page from Wikimedia. Specifically, when new pages are added to linkypedia the stats generation starts there, and can look misleading when mixed in with pages that have been monitored for longer and have larger stats. It would be nice to have time series data to look at trends over time for a given website.

More?

Check out the v2 branch if you are interested in following along.

Please add your ideas, comments here!

Clone this wiki locally