fix #228 (failures with quotes) #613

lmmarsano · 2018-12-08T03:22:15Z

I'm not fluent in C++, so close review & guidance or proceeding from where I started would probably be best.
This tries to fix #228

According to the ChangeLog

hunspell/ChangeLog

Lines 89 to 95 in 4ddd8ed

* better apostrophe usage:

- WORDCHARS only with one of the Unicode or ASCII apostrophe

results extended word tokenization: both of them will be part of

the words (if they are inside: eg. word's, but not words').

- convert Unicode apostrophes to ASCII ones for 8-bit dictionaries

(eg. English dictionaries), or for UTF-8 dictionaries only

with ASCII apostrophe supports (eg. French dictionaries).

tokenization should treat interior apostrophes as part of words and exclude boundary apostrophes.
However, the test provided in lmmarsano/hunspell@c825888 fails the assertion: please checkout to see.
luism@lmm-notebook:~/project/hunspell/tests$ ./test.sh apostrophe.dic
=============================================
Fail in apostrophe.good. Good words recognised as wrong:
'is'

The master branch reports 40 failures in AppVeyor, which I made sure to preserve in this branch.
fixes #228 and related issues

According to the ChangeLog > * better apostrophe usage: > - WORDCHARS only with one of the Unicode or ASCII apostrophe > results extended word tokenization: both of them will be part of > the words (if they are inside: eg. word's, but not words'). > - convert Unicode apostrophes to ASCII ones for 8-bit dictionaries > (eg. English dictionaries), or for UTF-8 dictionaries only > with ASCII apostrophe supports (eg. French dictionaries). tokenization should treat interior apostrophes as part of words and exclude boundary apostrophes. However, the provided test fails the assertion.

Add `TextParser::wordchar_apostrophe` protected property to abstract queries that apostrophe is a `WORDCHAR`. Modify `TextParser::next_token` - in non word case, ignore `'\''` - in word case - substitute `TextParser::wordchar_apostrophe` - rewrite branches regarding apostrophe as a nested branches to introduce boolean variables `is_wordchar_apostrophe` & `is_end_apostrophe` to track whether a `WORDCHAR` apostrophe was encountered and whether it's at the end of a word - split the state transition branch into a separate statement so it can handle apostrophes

rename apostrophe test to apostrophe1 add apostrophe2 with `WORDCHARS ’`

add `TextParser::is_apostrophe` method to test whether current character is an apostrophe in non word case of `TextParser::next_token`, skip apostrophes

Jamim · 2024-01-29T03:22:44Z

Hello @lmmarsano,
Would you mind updating this PR in order to resolve conflicts?
Thanks in advance!

lmmarsano added 4 commits December 7, 2018 04:11

add failing unicode apostrophe test

bf4d7db

rename apostrophe test to apostrophe1 add apostrophe2 with `WORDCHARS ’`

pass apostrophe2 test

608c119

add `TextParser::is_apostrophe` method to test whether current character is an apostrophe in non word case of `TextParser::next_token`, skip apostrophes

lmmarsano changed the title ~~WIP: fix #228 (failures with quotes)~~ fix #228 (failures with quotes) Apr 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix #228 (failures with quotes) #613

fix #228 (failures with quotes) #613

lmmarsano commented Dec 8, 2018

Jamim commented Jan 29, 2024

	* better apostrophe usage:
	- WORDCHARS only with one of the Unicode or ASCII apostrophe
	results extended word tokenization: both of them will be part of
	the words (if they are inside: eg. word's, but not words').
	- convert Unicode apostrophes to ASCII ones for 8-bit dictionaries
	(eg. English dictionaries), or for UTF-8 dictionaries only
	with ASCII apostrophe supports (eg. French dictionaries).

fix #228 (failures with quotes) #613

Are you sure you want to change the base?

fix #228 (failures with quotes) #613

Conversation

lmmarsano commented Dec 8, 2018

Jamim commented Jan 29, 2024