Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #228 (failures with quotes) #613

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lmmarsano
Copy link

I'm not fluent in C++, so close review & guidance or proceeding from where I started would probably be best.
This tries to fix #228

According to the ChangeLog

hunspell/ChangeLog

Lines 89 to 95 in 4ddd8ed

* better apostrophe usage:
- WORDCHARS only with one of the Unicode or ASCII apostrophe
results extended word tokenization: both of them will be part of
the words (if they are inside: eg. word's, but not words').
- convert Unicode apostrophes to ASCII ones for 8-bit dictionaries
(eg. English dictionaries), or for UTF-8 dictionaries only
with ASCII apostrophe supports (eg. French dictionaries).

tokenization should treat interior apostrophes as part of words and exclude boundary apostrophes.
However, the test provided in lmmarsano/hunspell@c825888 fails the assertion: please checkout to see.

luism@lmm-notebook:~/project/hunspell/tests$ ./test.sh apostrophe.dic
=============================================
Fail in apostrophe.good. Good words recognised as wrong:
'is'

The master branch reports 40 failures in AppVeyor, which I made sure to preserve in this branch.
fixes #228 and related issues

According to the ChangeLog
> 	* better apostrophe usage:
> 	- WORDCHARS only with one of the Unicode or ASCII apostrophe
> 	  results extended word tokenization: both of them will be part of
> 	  the words (if they are inside: eg. word's, but not words').
> 	- convert Unicode apostrophes to ASCII ones for 8-bit dictionaries
> 	  (eg. English dictionaries), or for UTF-8 dictionaries only
> 	  with ASCII apostrophe supports (eg. French dictionaries).
tokenization should treat interior apostrophes as part of words and exclude boundary apostrophes.
However, the provided test fails the assertion.
Add `TextParser::wordchar_apostrophe` protected property to abstract queries that apostrophe is a `WORDCHAR`.
Modify `TextParser::next_token`
- in non word case, ignore `'\''`
- in word case
	- substitute `TextParser::wordchar_apostrophe`
	- rewrite branches regarding apostrophe as a nested branches to introduce boolean variables `is_wordchar_apostrophe` & `is_end_apostrophe` to track whether a `WORDCHAR` apostrophe was encountered and whether it's at the end of a word
	- split the state transition branch into a separate statement so it can handle apostrophes
rename apostrophe test to apostrophe1
add apostrophe2 with `WORDCHARS ’`
add `TextParser::is_apostrophe` method to test whether current character is an apostrophe
in non word case of `TextParser::next_token`, skip apostrophes
@lmmarsano lmmarsano changed the title WIP: fix #228 (failures with quotes) fix #228 (failures with quotes) Apr 18, 2019
@Jamim
Copy link

Jamim commented Jan 29, 2024

Hello @lmmarsano,
Would you mind updating this PR in order to resolve conflicts?
Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failures with quotes
2 participants