Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in read_xml.raw: Input is not proper UTF-8, indicate encoding ! #70

Open
DanChaltiel opened this issue Mar 26, 2023 · 0 comments
Open

Comments

@DanChaltiel
Copy link

DanChaltiel commented Mar 26, 2023

Hi,

Running spelling::spell_check_test() fails on the crosstable package with the following error:

spelling::spell_check_package()
#>Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
#>  Input is not proper UTF-8, indicate encoding !
#>Bytes: 0x93 0x63 0x79 0x94 [9]

I have no clue where this error can come from and the error message is unfortunately not very informative.

Would it be possible to terminate early from spelling instead of xml2 so that the path is in the error message?
Of course, if we can also have the line and the specific bad character, it would be even better!

Note that in this case, UTF8 is the default encoding in the package's DESCRIPTION and in RStudio parameters. R CMD CHECK completes without error so I guess any encoding problem is not that severe, don't you think?

REPREX

EDIT

After more debugging, it seems to pertain to this line:

doc <- xml2::xml_ns_strip(xml2::read_xml(md))

In my case, it pointed to my README.md file which indeed contained special characters. I have no idea how they ended up there though, and they are far too numerous that I can correct it manually (a knitting problem from README.Rmd I guess).

EDIT2

Since this confusing problem is not that rare (#52, #58, #62), a fix might be found useful.

Here are some proposals:

  1. simply use a tryCatch() on xml2::xml_ns_strip() so that we can add path in the error message
  2. add a warning in the specific case of non-UTF8 characters:
  text <- readLines(path, warn = FALSE, encoding = "UTF-8")
  invalid = !validUTF8(text)
  if(any(invalid)){
    warning(message = c("The file ", path, " has non-UTF-8 characters on rows: ", paste(which(invalid), collapse=", ")))
  }
  1. use this trick from xfun::read_utf8() to ignore the problem (spell_check_package() will have no error):
  opts = options(encoding = "native.enc")
  on.exit(options(opts), add = TRUE)
  text <- readLines(path, warn = FALSE, encoding = "UTF-8")

We can do the 3 at the same time. I can make a PR if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant