Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undocumented patterns #249

Open
j0k3r opened this issue Feb 1, 2017 · 2 comments
Open

Undocumented patterns #249

j0k3r opened this issue Feb 1, 2017 · 2 comments

Comments

@j0k3r
Copy link
Collaborator

j0k3r commented Feb 1, 2017

I'm trying to add a kind of test validations for siteconfig to avoid mistake in them and I found some undocumented pattern in some of them (following that, I'll submit some fixes for others files).

Here is the list:

I was wondering if these patterns are absolete, new, unsued, etc.
I can't find them in the documentation nor in the current open source version of Full-Text RSS. Have they been introduced in the current version of Full-Text RSS? (which means we can't see how they are handled)

Let me know 🙂

@j0k3r j0k3r mentioned this issue Feb 1, 2017
@fivefilters
Copy link
Owner

fivefilters commented Feb 1, 2017

Hey, thanks for the list. Most of these are carried over from Instapaper when I imported their site rules. They no longer have them public, but it used to be open for anyone to contribute (like this repository). I didn't implement all their directives, so most of these will just be ignored. Here's the list from Instapaper (at least I think all of these are from them, some might be users experimenting/guessing):

  • convert_double_br_tags
  • strip_comments
  • move_into
  • autodetect_next_page
  • dissolve
  • footnotes
  • wrap_in

Of these, I'd like to implement dissolve. I think that removes the containing element without removing the contents. Would've been useful for that French site which had special links for regular words (linked to a dictionary I think). We ended up with a somewhat hacky solution. But dissolve would've come in useful.

These others are implemented in Full-Text RSS:

native_ad_clue
Introduced in Full-Text RSS 3.4. Used to identify if a given article is a native ad. Ad Detector has a lot of rules.

if_page_contains
Introduced in Full-Text RSS 3.5. This is only used with single_page_link at the moment. Added to make single_page_link directives conditional. Sometimes these rules use XPath functions like concat, like in the example you linked to:

  single_page_link: concat(//meta[@property="og:url"]/@content, '?print=1')
  if_page_contains: //a[contains(@class, "articleNav")]

Here, single_page_link will always return a string, so even if the meta element doesn't exist, you'll get '?print=1'. For some sites, the single page view is only available on multi-page articles. When constructing URLs like this, we need a way to make it conditional. Otherwise we'd end up redirecting to a non-existent page, or simply unnecessarily requesting another page when the current one contains everything we need. So that's what if_page_contains does at the moment.

single_page_link_in_feed
This one should be documented, but it's not widely used. Basically the same as single_page_link but applied to the original feed item's description. So safe to ignore if the input URL is not a feed. See this question and our help page.

@j0k3r
Copy link
Collaborator Author

j0k3r commented Feb 2, 2017

Of these, I'd like to implement dissolve.

It might be a good idea. From what I understand, it'll flatten the target node?
Like:

<ul>
  <li>
    <div>my text</div
  </li>
<ul>

If I've dissolve: //ul/li, it'll turn the node into :

    <div>my text</div

Am I right?

Thanks for the explanation on pattern implemented in Full-Text RSS.

For the unused list, maybe we can just remove them from siteconfig to avoid confusion?

  • convert_double_br_tags
  • strip_comments
  • move_into
  • autodetect_next_page
  • footnotes
  • wrap_in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants