`isuppercase`/`islowercase` fail on invalid characters #54343

Seelengrab · 2024-05-03T09:41:45Z

MWE:

julia> isuppercase('\xf0\x8e\x80\x80')
ERROR: Base.InvalidCharError{Char}('\xf0\x8e\x80\x80')
Stacktrace:
 [1] throw_invalid_char(c::Char)
   @ Base ./char.jl:86
 [2] UInt32
   @ ./char.jl:133 [inlined]
 [3] isuppercase(c::Char)
   @ Base.Unicode ./strings/unicode.jl:403
 [4] top-level scope
   @ REPL[12]:1

julia> Base.ismalformed('\xf0\x8e\x80\x80')
false

Either this is a requirement, or we can safely return false here, as is done for malformed characters. Does utf8proc handle invalid/malformed chars on its own? The docs aren't clear about this.

The text was updated successfully, but these errors were encountered:

stevengj · 2024-05-07T17:09:51Z

I think we should clearly be returning false here, similar to malformed characters.

Malformed chars can never get passed to utf8proc in the first place — if there is no way to convert them to a UInt32 codepoint, you can't pass them to the utf8proc API.

On invalid codepoints, utf8proc_isupper(codepoint) should return false.

stevengj · 2024-05-07T17:15:19Z

Isn't this a bug in ismalformed? If it can't be converted to a codepoint, isn't it malformed?

Or should we have another predicate in this case, where it's failing because it is an overlong encoding (Base.is_overlong_enc is returning true in UInt32(c))?

stevengj · 2024-05-07T17:19:34Z

Maybe

julia/base/strings/unicode.jl

Lines 414 to 415 in dbf0bab

    
           isuppercase(c::AbstractChar) = ismalformed(c) ? false : 
        
               Bool(@assume_effects :foldable @ccall utf8proc_isupper(UInt32(c)::UInt32)::Cint)

should just be calling isvalid(c) instead of ismalformed(c)?

Or better yet just (ismalformed(c) | isoverlong(c)) since utf8proc checks for the other cases.

Or better yet, shouldn't we have a predicate

hascodepoint(c::AbstractChar) = !(ismalformed(c) | isoverlong(c))

to check whether one can call codepoint(c) (== UInt32(c))?

Seelengrab added the domain:unicode Related to unicode characters and encodings label May 3, 2024

stevengj linked a pull request May 7, 2024 that will close this issue

add hascodepoint(c::AbstractChar) and use it #54393

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`isuppercase`/`islowercase` fail on invalid characters #54343

`isuppercase`/`islowercase` fail on invalid characters #54343

Seelengrab commented May 3, 2024 •

edited

stevengj commented May 7, 2024 •

edited

stevengj commented May 7, 2024 •

edited

stevengj commented May 7, 2024 •

edited

isuppercase/islowercase fail on invalid characters #54343

isuppercase/islowercase fail on invalid characters #54343

Comments

Seelengrab commented May 3, 2024 • edited

stevengj commented May 7, 2024 • edited

stevengj commented May 7, 2024 • edited

stevengj commented May 7, 2024 • edited

`isuppercase`/`islowercase` fail on invalid characters #54343

`isuppercase`/`islowercase` fail on invalid characters #54343

Seelengrab commented May 3, 2024 •

edited

stevengj commented May 7, 2024 •

edited

stevengj commented May 7, 2024 •

edited

stevengj commented May 7, 2024 •

edited