Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search can't find text having soft hyphens and/or ligature control characters #133

Open
Moonbase59 opened this issue Nov 21, 2023 · 2 comments · May be fixed by #134
Open

Search can't find text having soft hyphens and/or ligature control characters #133

Moonbase59 opened this issue Nov 21, 2023 · 2 comments · May be fixed by #134

Comments

@Moonbase59
Copy link

Today I was scratching my head, because TNT Search didn’t find pages that definitely contain my search words. Until I realized that most of my text contains soft hyphens, to give the renderer some hyphenation hints for our looong German words:

Screenshot 2023-11-21 at 14-43-22 Nite Radio  Läuft  ( _blog_nite-radio-laeuft ) Nite Radio

Now of course the search won’t find something like Text­datei or Text­­datei (invisible U+00AD inside), and a user cannot know how I hyphenated my text.

It’s even worse with ligatures, which are heavily used in German Fraktursatz as well as in Arabic/Persian/Indian languages, to control how a word actually looks like. This is mostly done using Unicode U+200C zero-width non-joiner and U+200D zero-width joiner.

Here’s my proposal for better search:

Since we’re already "cleaning" the searched pages in getCleanContent() (in file user/plugins/tntsearch/classes/GravTNTSearch.php), we might as well remove these in-word Unicode control characters before looking for a match.

I have tried this here, using Grav v1.7.43, Admin v1.10.43, TNT Search v3.4.0, and it works well, just by adding:

// 2023-11-21 MCH - Remove some in-word Unicode that regularly breaks searches
$problematic = [
    '/­/i', '/­/', '/­/i', '/\x{00AD}/u', // soft hyphen
    '/‍/i', '/‍/', '/‍/i', '/\x{200D}/u', // zero-width joiner
    '/‌/i', '/‌/', '/‌/i', '/\x{200C}/u', // zero-width non-joiner
];
$content = preg_replace($problematic, '', $content) ?? $content;

in getCleanContent(). As you see, we have to check the four most common use cases for each character, since article editors could use any variant in their Markdown text. Some lucky ones even have keyboards with these characters on them.

I guess this change will improve the TNT Search Plugin a lot, being able to find text even if it has been typographically enhanced on the web site. Of course one couldn’t search for the replaced entities anymore (like ­) but that shouldn’t be a problem, I think.

Strictly spoken, user input from the search box should also have these removed, but a website user would probably never enter soft hyphen or ligature control in the search box, I assume. At least I wouldn’t enter Text­datei, Brot‌zeit or Auf‌lage (or use the invisible keys) but instead use a simple textdatei, brotzeit or auflage for searching:

Screenshot 2023-11-21 at 15-11-43 Suche Nite Radio

If there are no objections, I could prepare a pull request.

@rhukster
Copy link
Member

Pull requests are always welcome. Cheers.

@Moonbase59
Copy link
Author

Done for your testing. Didn’t touch any input handlers or version numbers.
Let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants