Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-unicode hostname #153

Open
frankdilo opened this issue Sep 27, 2023 · 3 comments
Open

Support non-unicode hostname #153

frankdilo opened this issue Sep 27, 2023 · 3 comments

Comments

@frankdilo
Copy link

URLExtract does not match this URL as it should: сайт.com

@Olaf-
Copy link

Olaf- commented Oct 27, 2023

This also applies to other examples like rohlík.cz or neovlivní.cz.

@lipoja
Copy link
Owner

lipoja commented Dec 26, 2023

@frankdilo, @Olaf-: Unfortunately those URLs are not valid according to RFC.

RFC3986
host = IP-literal / IPv4address / reg-name
where
reg-name = *( unreserved / pct-encoded / sub-delims )
and from that
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
and from that and RFC2234
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z

As you can see domain name can't contain characters from UTF-8 (with some accents, hooks, ... )

I am open to discussion but I would suggest a workaround to convert all characters to ASCII an then use URLExtract to find the URLs and its position and extract the URLs from original text.

@hwo411
Copy link

hwo411 commented Mar 4, 2024

Also applies to fully Cyrillic domains like сайт.рф (even if you prepend it with https://). Would be great to see it fixed.

E.g., twitter-text in Ruby handles this properly: https://github.com/twitter/twitter-text/blob/master/rb/lib/twitter-text/regex.rb#L257)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants