Skip to content

clarin-eric/linkchecker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Link Checker

Introduction

The Link checker is a StormCrawler adaptation for URL checking. Instead of crawling, it checks the status of URLs and persists them in a database (currently MariaDB/MySQL).

Important note
The Link Checker is not a stand-alone application but storm topology which is running inside a cluster. Only for testing we provide a class which runs as a stand-alone application, preferably in your IDE. But this should not be run in production.

For more information on storm topologies, have a look at the documentation of the apache storm project, please.

Building and running the Link Checker topology

Building the Link Checker topology

  1. Clone this repository to your workspace
  2. Go inside the Link Checker directory and build a jar by calling the Maven wrapper with the command
    ./mvnw clean install

You may use your own Maven instead of the Maven wrapper for building the topology but the wrapper is the safe way, since it is tested. Therefore, if anything goes wrong at build time, make sure at first that you were using the Maven wrapper.

Setting up a storm cluster

For remote cluster setup, have a look at the documentation of the apache storm project, please.

Deploying the Link Checker topology to the cluster

To deploy your Link Checker topology to the cluster, use the command
<storm directory>/bin/storm" jar <Link Checker directory>/target/linkchecker-<version>.jar org.apache.storm.flux.Flux -e -r -R linkchecker.flux

For more information on the parameters, have a look at the Flux chapter of the apache storm documentation.

Testing in local mode in your IDE

As mentioned before the Link Checker project provides a class to test the Link Checker in your favorite IDE in local mode without any necessity to set up a cluster.

  1. Clone this repository into an IDE workspace
  2. Set environment the variables used in src/test/resources/linkchecker-test-conf.yaml in the IDEs application running configuration
  3. Execute class eu.clarin.linkchecker.LinkcheckerTestApp (under src/test/java)

Simple Explanation of the current implementation

Our SQL database has got these tables:

  1. url: This is the table that linkchecker reads the URLs to check from. So this will be populated by another application (in our case curation-module or linkchecker-api).
  2. status: This is the table that linkchecker saves the results into.
  3. history: If a URL is checked more than once, the previous checking result is saved in the history table and the record in the status table is updated.
  4. obsolete A flat table which keeps the records still for a while after purging the from the other tables
  5. providerGroup
  6. context: The table saves the context (the file or the upload) in which the link is found
  7. url_context: Joins url-table n-n to the context table, so that each URL might appear in different contexts. Moreover the table contains the last time when the link was ingested and and a boolean flag which indicates if the join is still active. Only URLs which have at least one active join are considered to be checked!
  8. client The table is basically used to identify the link source

The creation script is available in the linkchecker-persictence API project.

linkchecker.flux defines the components(spouts, bolts and streams) if our topology and loads the configuration file linkchecker-conf.yaml.

  1. eu.clarin.linkchecker.spout.LPASpout uses the linkchecker-persistence API to fill up a buffer with URLs to check.
  2. com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt partitions the URLs by a configured criteria
  3. eu.clarin.linkchecker.bolt.MetricsFetcherBolt fetches the urls. It sends redirects back to URLPartitionerBolt and sends the rest onwards down the stream to StatusUpdaterBolt. Modification of com.digitalpebble.stormcrawler.bolt.FetcherBolt
  4. eu.clarin.linkchecker.bolt.StatusUpdaterBolt persists the results in the status table of the database via the linkchecker-persistence API.
  5. eu.clarin.linkchecker.bolt.SimpleStackBolt persists the latest checking results into a Java Object file for use in curation-web