Link Checker

Introduction

The Link checker is a StormCrawler adaptation for URL checking. Instead of crawling, it checks the status of URLs and persists them in a database (currently MariaDB/MySQL).

Important note
The Link Checker is not a stand-alone application but storm topology which is running inside a cluster. Only for testing we provide a class which runs as a stand-alone application, preferably in your IDE. But this should not be run in production.

For more information on storm topologies, have a look at the documentation of the apache storm project, please.

Building and running the Link Checker topology

Building the Link Checker topology

Clone this repository to your workspace
Go inside the Link Checker directory and build a jar by calling the Maven wrapper with the command
./mvnw clean install

You may use your own Maven instead of the Maven wrapper for building the topology but the wrapper is the safe way, since it is tested. Therefore, if anything goes wrong at build time, make sure at first that you were using the Maven wrapper.

Setting up a storm cluster

For remote cluster setup, have a look at the documentation of the apache storm project, please.

Deploying the Link Checker topology to the cluster

To deploy your Link Checker topology to the cluster, use the command
<storm directory>/bin/storm" jar <Link Checker directory>/target/linkchecker-<version>.jar org.apache.storm.flux.Flux -e -r -R linkchecker.flux

For more information on the parameters, have a look at the Flux chapter of the apache storm documentation.

Testing in local mode in your IDE

As mentioned before the Link Checker project provides a class to test the Link Checker in your favorite IDE in local mode without any necessity to set up a cluster.

Clone this repository into an IDE workspace
Set environment the variables used in src/test/resources/linkchecker-test-conf.yaml in the IDEs application running configuration
Execute class eu.clarin.linkchecker.LinkcheckerTestApp (under src/test/java)

Simple Explanation of the current implementation

Our SQL database has got these tables:

url: This is the table that linkchecker reads the URLs to check from. So this will be populated by another application (in our case curation-module or linkchecker-api).
status: This is the table that linkchecker saves the results into.
history: If a URL is checked more than once, the previous checking result is saved in the history table and the record in the status table is updated.
obsolete A flat table which keeps the records still for a while after purging the from the other tables
providerGroup
context: The table saves the context (the file or the upload) in which the link is found
url_context: Joins url-table n-n to the context table, so that each URL might appear in different contexts. Moreover the table contains the last time when the link was ingested and and a boolean flag which indicates if the join is still active. Only URLs which have at least one active join are considered to be checked!
client The table is basically used to identify the link source

The creation script is available in the linkchecker-persictence API project.

linkchecker.flux defines the components(spouts, bolts and streams) if our topology and loads the configuration file linkchecker-conf.yaml.

eu.clarin.linkchecker.spout.LPASpout uses the linkchecker-persistence API to fill up a buffer with URLs to check.
com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt partitions the URLs by a configured criteria
eu.clarin.linkchecker.bolt.MetricsFetcherBolt fetches the urls. It sends redirects back to URLPartitionerBolt and sends the rest onwards down the stream to StatusUpdaterBolt. Modification of com.digitalpebble.stormcrawler.bolt.FetcherBolt
eu.clarin.linkchecker.bolt.StatusUpdaterBolt persists the results in the status table of the database via the linkchecker-persistence API.
eu.clarin.linkchecker.bolt.SimpleStackBolt persists the latest checking results into a Java Object file for use in curation-web

Name		Name	Last commit message	Last commit date
Latest commit History 313 Commits
.mvn/wrapper		.mvn/wrapper
src		src
.gitignore		.gitignore
CHANGES.md		CHANGES.md
LICENSE		LICENSE
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.mvn/wrapper

.mvn/wrapper

src

src

.gitignore

.gitignore

CHANGES.md

CHANGES.md

LICENSE

LICENSE

README.md

README.md

mvnw

mvnw

mvnw.cmd

mvnw.cmd

pom.xml

pom.xml

Repository files navigation

Link Checker

Introduction

Building and running the Link Checker topology

Building the Link Checker topology

Setting up a storm cluster

Deploying the Link Checker topology to the cluster

Testing in local mode in your IDE

Simple Explanation of the current implementation

About

Releases 46

Packages

Contributors 3

Languages

License

clarin-eric/linkchecker

Folders and files

Latest commit

History

Repository files navigation

Link Checker

Introduction

Building and running the Link Checker topology

Building the Link Checker topology

Setting up a storm cluster

Deploying the Link Checker topology to the cluster

Testing in local mode in your IDE

Simple Explanation of the current implementation

About

Resources

License

Stars

Watchers

Forks

Languages