framework.js
is a framework for writing Zotero translators, especially screen scrapers. It is intended to make it easier for new and experienced developers to write Zotero translators by removing as much boilerplate code as possible.
To use framework.js
, paste the contents of the framework.js
file into your translator, between the { ... }
header and the code at the bottom. On Unix, you can use the Makefile
to translate .js.in
files into .js
files. See, e.g., SFGate.js.in
. You will need to modify the FILES
variable in the Makefile
to include your target .js
file.
See, e.g., the SFGate.js.in
for an example. Between the boilerplate, you will see a Javascript object that looks like:
{ itemType : 'newspaperArticle',
title : FW.Xpath('//head/meta[@name="title"]/@content').text(),
publicationTitle : "San Francisco Chronicle",
date :
FW.Xpath('//div[@class="articleheadings"]//p[@class="date"]').text(),
creators :
FW.Xpath('//div[@class="articleheadings"]//p[@class="byline author vcard"]').
text().remove(/,.*$/).cleanAuthor("author"),
attachments : {
url : FW.Url(),
title : "SFGate Snapshot",
type : "text/html"
}
}
Each key, e.g., itemType
or title
, is generally a Zotero metadata field. Each value is either a string, in which case it never changes, a filter, or a function.
Filters are the primary way that metadata is generated. Filters start with a selector, which selects some part of the document. Often this is FW.Xpath
, though we also see above FW.Url
. The filter then contains a series of transformations which are performed on the list of results, finally resulting in what is added to the metadata field.
For instance, the creators
metadata field above first selects all results from the document that match //div[@class="articleheadings"...
as seen above. Following this, each result has its textcontent generated with the .text()
filter. Then anything matching the regex /,.*$/
is removed. Finally the Zotero.utilities.cleanAuthor
method is called with the arguments (string, "author")
. The resulting strings (remember that there may have been multiple matches to the xpath expression) are added to the creators
field of the generated Zotero item.
If multiple FW.Scrapers are defined, choosing good detect
criteria is essential. The detect
field is used to decide if a scraper should or should be used.
If the detect criteria evaluates to a non-empty array, that scraper is used for the page. If there is one scraper without a detect, it is always used. If there are multiple scrapers whose detects evaluate to non-empty or which have no detect, the behavior is undefined.
For more information, please contact Erik Hetzner at mailto:egh@e6h.org or the zotero-dev
mailing list.