Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add body property #2

Open
ianstormtaylor opened this issue May 25, 2016 · 5 comments
Open

Add body property #2

ianstormtaylor opened this issue May 25, 2016 · 5 comments

Comments

@ianstormtaylor
Copy link

Returning the cleaned body of the article.

@oyeanuj
Copy link

oyeanuj commented Jul 16, 2016

I'd be super interested in this!

@mikuhl-dev
Copy link

Yes please!

@Kikobeats Kikobeats changed the title add body property Add body property Jul 27, 2019
@julianpinedayyz
Copy link

+1

@janzheng
Copy link

janzheng commented May 19, 2020

I modified the code a bit and saved it into a separate function and it works like a charm:

const Readability = require('readability')
const jsdom = require('jsdom')

const { JSDOM, VirtualConsole } = jsdom


export const readabilityScraper = () => {

  const composeRule = fn => ({ from, to = from, ...opts }) => async ({
    htmlDom,
    url
  }) => {
    const data = await fn(htmlDom, url)
    return data[from]
  }

  const readability = memoizeOne(($, url) => {
    const dom = new JSDOM($.html(), { url, virtualConsole: new VirtualConsole() })
    const reader = new Readability(dom.window.document)
    const article = reader.parse()

    /*
      This article object will contain the following properties:
      title: article title
      content: HTML string of processed article content
      length: length of an article, in characters
      excerpt: article description, or short excerpt from the content
      byline: author metadata
      dir: content direction
    */
    return article
  })

  const getReadbility = composeRule(readability)

  const rules = {
    description: getReadbility({ from: 'excerpt', to: 'description' }),
    publisher: getReadbility({ from: 'siteName', to: 'publisher' }),
    author: getReadbility({ from: 'byline', to: 'author' }),
    title: getReadbility({ from: 'title' }),
    dir: getReadbility({ from: 'dir' }),
    length: getReadbility({ from: 'length' }),
    body: getReadbility({ from: 'content' }),
  }
  return rules
}

and then when I run metascraper:

const metascraper = require('metascraper')([
  readabilityScraper(),
])

The reason I had to modify composeRule is because the regular helper has a validator function that filters out "foreign" keys like body

@vpul
Copy link

vpul commented Jun 18, 2020

Made a slight change to @janzheng 's code. Return data[from] only if data exists. Without this, the function crashes on non-article urls.

const composeRule = fn => ({ from, to = from, ...opts }) => async ({
    htmlDom,
    url
  }) => {
    const data = await fn(htmlDom, url)
    if (data) {
      return data[from]
    }
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants