Search can't find text having soft hyphens and/or ligature control characters #133

Moonbase59 · 2023-11-21T14:24:46Z

Today I was scratching my head, because TNT Search didn’t find pages that definitely contain my search words. Until I realized that most of my text contains soft hyphens, to give the renderer some hyphenation hints for our looong German words:

Now of course the search won’t find something like Textdatei or Textdatei (invisible U+00AD inside), and a user cannot know how I hyphenated my text.

It’s even worse with ligatures, which are heavily used in German Fraktursatz as well as in Arabic/Persian/Indian languages, to control how a word actually looks like. This is mostly done using Unicode U+200C zero-width non-joiner and U+200D zero-width joiner.

Here’s my proposal for better search:

Since we’re already "cleaning" the searched pages in getCleanContent() (in file user/plugins/tntsearch/classes/GravTNTSearch.php), we might as well remove these in-word Unicode control characters before looking for a match.

I have tried this here, using Grav v1.7.43, Admin v1.10.43, TNT Search v3.4.0, and it works well, just by adding:

// 2023-11-21 MCH - Remove some in-word Unicode that regularly breaks searches
$problematic = [
    '/&shy;/i', '/&#173;/', '/&#x00AD;/i', '/\x{00AD}/u', // soft hyphen
    '/&zwj;/i', '/&#8205;/', '/&#x200D;/i', '/\x{200D}/u', // zero-width joiner
    '/&zwnj;/i', '/&#8204;/', '/&#x200C;/i', '/\x{200C}/u', // zero-width non-joiner
];
$content = preg_replace($problematic, '', $content) ?? $content;

in getCleanContent(). As you see, we have to check the four most common use cases for each character, since article editors could use any variant in their Markdown text. Some lucky ones even have keyboards with these characters on them.

I guess this change will improve the TNT Search Plugin a lot, being able to find text even if it has been typographically enhanced on the web site. Of course one couldn’t search for the replaced entities anymore (like ) but that shouldn’t be a problem, I think.

Strictly spoken, user input from the search box should also have these removed, but a website user would probably never enter soft hyphen or ligature control in the search box, I assume. At least I wouldn’t enter Textdatei, Brot&zwnj;zeit or Auf&zwnj;lage (or use the invisible keys) but instead use a simple textdatei, brotzeit or auflage for searching:

If there are no objections, I could prepare a pull request.

The text was updated successfully, but these errors were encountered:

rhukster · 2023-11-21T14:53:29Z

Pull requests are always welcome. Cheers.

Moonbase59 · 2023-11-21T16:01:41Z

Done for your testing. Didn’t touch any input handlers or version numbers.
Let me know!

Moonbase59 linked a pull request Nov 21, 2023 that will close this issue

Remove  &zwj; &zwnj; from page before comparison #134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search can't find text having soft hyphens and/or ligature control characters #133

Search can't find text having soft hyphens and/or ligature control characters #133

Moonbase59 commented Nov 21, 2023

rhukster commented Nov 21, 2023

Moonbase59 commented Nov 21, 2023

Search can't find text having soft hyphens and/or ligature control characters #133

Search can't find text having soft hyphens and/or ligature control characters #133

Comments

Moonbase59 commented Nov 21, 2023

rhukster commented Nov 21, 2023

Moonbase59 commented Nov 21, 2023