Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isuppercase/islowercase fail on invalid characters #54343

Open
Seelengrab opened this issue May 3, 2024 · 3 comments · May be fixed by #54393
Open

isuppercase/islowercase fail on invalid characters #54343

Seelengrab opened this issue May 3, 2024 · 3 comments · May be fixed by #54393
Labels
domain:unicode Related to unicode characters and encodings

Comments

@Seelengrab
Copy link
Contributor

Seelengrab commented May 3, 2024

MWE:

julia> isuppercase('\xf0\x8e\x80\x80')
ERROR: Base.InvalidCharError{Char}('\xf0\x8e\x80\x80')
Stacktrace:
 [1] throw_invalid_char(c::Char)
   @ Base ./char.jl:86
 [2] UInt32
   @ ./char.jl:133 [inlined]
 [3] isuppercase(c::Char)
   @ Base.Unicode ./strings/unicode.jl:403
 [4] top-level scope
   @ REPL[12]:1

julia> Base.ismalformed('\xf0\x8e\x80\x80')
false

Either this is a requirement, or we can safely return false here, as is done for malformed characters. Does utf8proc handle invalid/malformed chars on its own? The docs aren't clear about this.

@Seelengrab Seelengrab added the domain:unicode Related to unicode characters and encodings label May 3, 2024
@stevengj
Copy link
Member

stevengj commented May 7, 2024

I think we should clearly be returning false here, similar to malformed characters.

Malformed chars can never get passed to utf8proc in the first place — if there is no way to convert them to a UInt32 codepoint, you can't pass them to the utf8proc API.

On invalid codepoints, utf8proc_isupper(codepoint) should return false.

@stevengj
Copy link
Member

stevengj commented May 7, 2024

Isn't this a bug in ismalformed? If it can't be converted to a codepoint, isn't it malformed?

Or should we have another predicate in this case, where it's failing because it is an overlong encoding (Base.is_overlong_enc is returning true in UInt32(c))?

@stevengj
Copy link
Member

stevengj commented May 7, 2024

Maybe

isuppercase(c::AbstractChar) = ismalformed(c) ? false :
Bool(@assume_effects :foldable @ccall utf8proc_isupper(UInt32(c)::UInt32)::Cint)

should just be calling isvalid(c) instead of ismalformed(c)?

Or better yet just (ismalformed(c) | isoverlong(c)) since utf8proc checks for the other cases.

Or better yet, shouldn't we have a predicate

hascodepoint(c::AbstractChar) = !(ismalformed(c) | isoverlong(c))

to check whether one can call codepoint(c) (== UInt32(c))?

@stevengj stevengj linked a pull request May 7, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants