Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing text for individual words #77

Open
sparticus1701 opened this issue Dec 5, 2022 · 3 comments
Open

Parsing text for individual words #77

sparticus1701 opened this issue Dec 5, 2022 · 3 comments
Labels

Comments

@sparticus1701
Copy link

This is more of a question, but I'd like to use this in a project I'm working on. From what I can tell WordList.Check is designed to check single words.

Are there any recommendations on what tool to use that I can break up sentences into words, etc., that should be checked? A naive way would be to just use string.split(), but I'd like to see if there's a tool that can automatically handle numbers, currency, sentence puncuation. I've been looking at some NLP tools but wondering if you've used anything in particular.

@aarondandy
Copy link
Owner

I have not. The reason I made this port was to make a spell checker for Roslyn but got bored before that could have ever happened. Splitting identifiers into words is much easier than NLP though 😁. I haven't yet done anything with human grammars, so I don't think I can point you in the right direction for that.

@funex
Copy link

funex commented Dec 6, 2022

@sparticus1701 Have a look at this closed issue, it answers your question: #75

"The text boundary positions are found according to the rules described in Unicode Standard Annex 29, Text Boundaries, and Unicode Standard Annex 14, Line Breaking Properties. These are available at http://www.unicode.org/reports/tr14/ and http://www.unicode.org/reports/tr29/."

@funex
Copy link

funex commented Nov 21, 2023

@aarondandy I would close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants