Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help to find a fingerprint for 60+ ippen.media newssites #926

Open
HolgerAusB opened this issue Feb 6, 2022 · 18 comments
Open

Need help to find a fingerprint for 60+ ippen.media newssites #926

HolgerAusB opened this issue Feb 6, 2022 · 18 comments

Comments

@HolgerAusB
Copy link
Collaborator

ippen.media is a big German news company with a large number of (mostly local) German newspapers/magazines and there corresponding websites. All/most of these websites using the same (ugly) CMS. Most clutter is allready filtert by standard fivefilerts; now I managed to write a config, to strip the small rest of some anoying parts.

But no one wants to maintain 60+ config files, one for each of the domains. My skills aren't enough to find a reliable single fingerprint for the array in custom_config.php, so I just put my local new in there, e.g.

'<meta content="https://www.fr.de" property="og:site_name"/>' => array('hostname'=>'fingerprint.ippen.media', 'head'=>true),
'<meta content="https://www.fnp.de" property="og:site_name"/>' => array('hostname'=>'fingerprint.ippen.media', 'head'=>true),

That is good for me, but not for hosted versions. So could please someone find a single, reliable fingerprint that matches most ippen.media Websites?

Here are some of the bigger papers:
https://www.fr.de/
https://www.fnp.de/
https://www.hna.de/
https://www.hanauer.de/
https://www.kreiszeitung.de/

get original feed by adding the following to the URL of the category-page: rssfeed.rdf
e.g. https://www.fr.de/hessen/rssfeed.rdf

or is there any other way to push custom fingerprints in a seperate file? But I think, if @fivefilters would blow up his array with 60+ Domains this could slow down performance for ALL useres, which is not what should be happen!

I tried to use the div arround the IM-Logo in the upper right of the page, but I had to find out that fingerprinting only works in the header and not the body, or did I make a mistake there?

@HolgerAusB
Copy link
Collaborator Author

an include-command could also help. Then I could do the 60+ files, just once and put an include_config .ippen.media.txt to each of them while the real code is in .ippen.media.txt

Another advantage here would be that you could make additional settings per site if necessary.

@HolgerAusB
Copy link
Collaborator Author

I think I will try to fingerprint this line
'var ippenErr = [], ippenPrevEH' => array('hostname'=>'fingerprint.ippen.media', 'head'=>true),
sounds like ippen-specific stuff.

@fivefilters if my test on the weekend succeed, would you put this on your standard config.php? Especially on filvefilters.org in your live system. The release for self-hosters will surely take longer, or? I would then write a note in the .ippen.media.txt for them.

@fivefilters
Copy link
Owner

fivefilters commented Feb 11, 2022

@HolgerAusB Thanks, we'll take a look soon and try to add it to the config. You're right that at the moment the config.php is only updated with each new release. It would be nice to have a way of updating these via the site config files somehow. We've had a suggestion of linking site config files together using symlinks before, but I'm worried it'll create difficulty inside git, zip, and the different platforms people use to run the software.

@HolgerAusB
Copy link
Collaborator Author

@fivefilters, I'm sorry to disturb you. Do you have any rough prognosis when you can install the fingerprint-part to the config? I don't want to rush. I just have to do the first fix inside .ippen.media. If you don't have time right now, I would create copies for the larger newspapers for their domains and put them in the PR. (see also #928)

@HolgerAusB
Copy link
Collaborator Author

sorry @fivefilters, @j0k3r. Me again. It seems that my fingerprint is not found under some circumstands, even if its there. So instead I will write a script to copy an ippen-template for every site while picking the test-link from a second file on my side. Then uploading all files and PR the hole bunch.

@j0k3r
Copy link
Collaborator

j0k3r commented Mar 21, 2022

Just tried on f43.me (which use graby, which use these siteconfig) and the fingerprint I suggested in your PR is working great:
image

@HolgerAusB
Copy link
Collaborator Author

@j0k3r, I don't really know how to act now. YOUR suggestion currently works only in Graby/Wallabag. FTR currently only searches within <head> for the fingerprint. My search string is currently not matched in some of the sites, even though it exists. Maybe ippen returns a different header if it detects something as UserAgent which may not be a browser.

And even if @fivefilters finds time to include one of the fingerprints, you never know how long it will be valid. If the provider changes something, it may take again weeks until it is included in a new version of FTR.

Even if it means more work for me, it is more flexible if I just run my script and create 50+ single sitename.de.txt files from a template.

Translated with www.DeepL.com/Translator (free version)

@HolgerAusB
Copy link
Collaborator Author

So I now have finished generating config-files for 56 news sites. Just not shure to PR or wait for answer.

@j0k3r
Copy link
Collaborator

j0k3r commented Mar 22, 2022

Wait for @fivefilters answer

@fivefilters
Copy link
Owner

fivefilters commented Mar 22, 2022

Thanks @HolgerAusB @j0k3r! I'm happy for these to be merged until we can improve the fingerprint handling in Full-Text RSS. I think in the future these fingerprints should really exist outside of the code so they can be updated without the need for new code releases.

@HolgerAusB
Copy link
Collaborator Author

just a new idea. How about referring to another config-file? So all these 56 site-files could have something like this
include_conf: .ippen.media
and the real code is in .ippen.media.txt

@fivefilters
Copy link
Owner

fivefilters commented Mar 26, 2022

I quite like this suggestion. There was a proposal a while back using symlinks to achieve something similar, but I was concerned the symlinks wouldn't survive different systems and packaging (e.g. zip, where symlinks aren't part of the format itself).

I think this has the advantage that we can also add a test_url specific for the entry, which should hopefully allow us to catch problems in the future. What do you think @j0k3r?

@j0k3r
Copy link
Collaborator

j0k3r commented Mar 26, 2022

That's an interesting suggestion.

I think I still prefer the fingerprint because it avoid having many files.

But the idea of having one real test_url per site is great.

One question comes in my mind: should the included file config have a custom name, like base.example.com.txt (let's say like an abstract class) to avoid being used as a real site config OR can it be used as a real site config?

@fivefilters
Copy link
Owner

I like having the fingerprint option too, in such a way that they can be maintained independent of the code. So I'm open to implementing both options.

To me, the fingerprint option was intended to be more far-reaching, for example for platforms like WordPress.com or Medium.com that might use a HTML template we can target, but let people use their own domain names, resulting in thousands or millions of sites we can't possibly know about or wish to store individually in this repository.

In this particular case, it's quite a lot of sites, so I can see the fingerprint option being better suited for it, especially if more get added and we don't want to track them ourselves. And if there are some sites within this group that would benefit from have a test_url, I think we could have site config files with only test_url lines, which would allow us to monitor for changes.

@HolgerAusB
Copy link
Collaborator Author

follow up from #1184
@fivefilters , @j0k3r
The fingerprints have changed from last year.

for FTR:
<script>window.dataLayer = window.dataLayer||[];([{"de.ippen-digital.story.onlineId"' => array('hostname'=>'fingerprint.ippen.media', 'head'=>true)

for Graby/Wallabag:
see j0k3r/graby#338

I checked with FTR and Wallabag, should work now.

@HolgerAusB
Copy link
Collaborator Author

@fivefilters the substack fingerprint isn't live in current release of FTR, right?

@fivefilters
Copy link
Owner

@HolgerAusB It is, but it's a little longer. We shortened it recently as the longer one didn't appear on the custom domains tested. That shorter one which I shared earlier will be in the next version.

@HolgerAusB
Copy link
Collaborator Author

added fingerprint for substack to my graby-PR, thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants