Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index +/- in chemical formulae #172

Open
aaccomazzi opened this issue May 27, 2020 · 0 comments
Open

Index +/- in chemical formulae #172

aaccomazzi opened this issue May 27, 2020 · 0 comments

Comments

@aaccomazzi
Copy link
Member

Users have asked us for the ability to find chemical symbols such as CO2+ which right now is impossible since the tokenizer removes the ending + sign. Same thing applies for the minus symbol.

The other thing we'll have to deal with in this context are superscripts and subscripts, which are now handled with the HTML tags <SUP> and <SUB>

Here is an example of what the Classic tokenizer did for some of these cases:

INPUT:    Index H-alpha and H&alpha;+ and H<SUB>&alpha;</SUB> and as well
OUTPUT:   INDEX HALPHA HALPHA+ HALPHA WELL H ALPHA HALPHA HALPHA
POSITION: 1     2      3       4      5    2 2     3      4      

(note: the &alpha entity is not translated in SOLR in the unicode glyph α (U+03B1)).

Here is an example of how chemical formulae are handled:

INPUT:    test formula H2O+ test CNO-SI+ test PSi+O3- end
OUTPUT:   TEST FORMULA H2O+ TEST CNO-SI+ TEST PSI+O3- END CNO- SI+ PSI+ O3- H2O CNO SI PSI O3 
POSITION: 1    2       3    4    5       6    7       8   5    5   7    7   3   5   5  7   7 

Essentially these apply to ([A-Z][A-Za-z0-9]*[+-])* sequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants