Add support for spell checking roxygen comments #3

jimhester · 2017-09-07T20:08:50Z

roxygen2::parse_file() parses the roxygen comments
in each file. Text from relevant tags is then searched for spelling
errors with hunspell::hunspell() to find misspelled words. Because
roxygen does not store the original positions of parsed tags we then
need to find the misspelled word locations in the original roxygen
comment lines of the source. This is done by find_word_positions().

roxygen2::parse_file() is not in the current CRAN version of roxygen2, but I believe @hadley will be submitting a new version to CRAN in the next week or so.

I used Rcpp mainly for convenience if you would prefer to remove the dependency I can do so.

find_word_positions() returns both the line and the start of the words, this was done to support a later enhancement of having RStudio Markers for misspelled words as suggested in r-lib/devtools#1564. But that will require additional changes in other parts of the code, so I will do that in a separate PR in the future.

`roxygen2::parse_file()` parses the roxygen comments in each file. Text from relevant tags is then searched for spelling errors with `hunspell::hunspell()` to find misspelled words. Because roxygen does not store the original positions of parsed tags we then need to find the misspelled word locations in the original roxygen comment lines of the source. This is done by `find_word_positions()`.

codecov-io · 2017-09-07T20:20:21Z

Codecov Report

Merging #3 into master will decrease coverage by 5.99%.
The diff coverage is 1.69%.

@@          Coverage Diff           @@
##           master      #3   +/-   ##
======================================
- Coverage    45.2%   39.2%   -6%     
======================================
  Files           6       6           
  Lines         250     278   +28     
======================================
- Hits          113     109    -4     
- Misses        137     169   +32

Impacted Files	Coverage Δ
R/spell-check.R	`35.77% <0%> (+1.03%)`	⬆️
R/check-files.R	`42.85% <0%> (-16.08%)`	⬇️
src/find_word_positions.cpp	`3.84% <3.84%> (ø)`
R/language.R
R/remove-chunks.R	`88.88% <0%> (+2.52%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38e0782...46cf627. Read the comment docs.

hadley · 2017-09-07T20:37:13Z

roxygen2 should generally be storing the positions of the tags (because they're used for errors)

jeroen · 2017-09-07T20:45:55Z

Currently it reports spelling errors both in the roxygen and Rd files. Perhaps that is sensible, but we could also just skip Rd files that start with the % Generated by roxygen2: ... header?
Is it really needed to source actual R code? Perhaps we should prefilter only the #' lines? Some packages do funky things or the code can only be sourced after a proper ./configure. E.g. the curl package errors for this reason:

> spell_check_package("~/workspace/curl")
 Show Traceback
 
 Rerun with Debug
 Error in (function() { : 
  Failed to find 'tools/option_table.txt' from:/Users/jeroen/workspace/spelling

jimhester · 2017-09-07T20:58:56Z

I think roxygen needs access the objects in general, but perhaps for this use case we might be able to avoid it? I don't think the current API has any way to do that however.

jeroen · 2017-09-08T10:10:03Z

I'm on vacation next week, will review this when im back.

jeroen

Finally time again to look at this. Tested on a few packages, but I'm getting a lot of false positives. Some examples:

Parser should ignore inline code chunks. Currently we get lots of false positives for code inside backticks (for markdown-roxygen) or \code{} chunks.
Parser should skip \href{} and \url{}
Parser should skip @examples and other non-text blocks tags.

Note how the Rd parser uses tools::RdTextFilter which says:

This function blanks out all non-text in an Rd file, for spell checking or other uses.

Ideally the roxygen2 spell checker should behave similarly.

jimhester · 2017-11-06T13:08:14Z

Do you have an example package where you are seeing this? Pretty sure the code already does the things you mention.

jeroen · 2017-11-06T13:46:13Z

For example spelling the spelling package itself gives:

> spelling::spell_check_package(use_wordlist = FALSE)
  WORD       FOUND IN
CMD        description:3
hunspell   spell_check_files.Rd:17
           spell_check_package.Rd:21
           wordlist.Rd:19
pkg        spell-check.R:21, 22
           spell_check_package.Rd:19
           wordlist.Rd:17
rmd        spell-check.R:5, 22
           spell_check_package.Rd:33
rnw        spell-check.R:5, 22
           spell_check_package.Rd:33

Here the rmd and rnw at spell-check.R:22 are actually inline markdown code. Also pkg at spell-check.R:21 refers to a parameter name, not actual text.

jimhester · 2017-11-06T14:46:11Z

Ah yes, now I remember. These problems stem from the fact that these terms are misspelled elsewhere in the same roxygen blocks. Because roxygen does not provide accurate line information for parsed blocks (r-lib/roxygen2#664) we don't have the information to find the actual location of the misspelled word. Which is why we have to use find_word_positions() to find the location of the misspelled words in the un-parsed roxygen tags. Because this second search does not used parsed text it does not differentiate between these things, so you will get false positives from words that are misspelled elsewhere in the same roxygen block.

In the cases you cite it is true one of the matches should be ignored, but the other case is normal text. E.g. pkg is a parameter name in line 21, but it is normal text in line 22

spelling/R/spell-check.R

Line 22 in 65d419d

    
           #' @param vignettes also check all `rmd` and `rnw` files in the pkg `vignettes` folder

And rmd and rnw are inline code in line 22, but they are again normal text in line 5

spelling/R/spell-check.R

Line 5 in 65d419d

    
           #' Parse and spell check R package manual pages, rmd/rnw vignettes, and text fields in the

If we had accurate line information from roxygen this effect would be greatly diminished, as we would only search the exact line for the misspelled word.

jimhester requested a review from jeroen September 7, 2017 20:08

jimhester force-pushed the roxygen branch from 14b880d to 46cf627 Compare September 7, 2017 20:09

jimhester mentioned this pull request Sep 8, 2017

Include line information for tags in roxy_block r-lib/roxygen2#664

Closed

jeroen reviewed Nov 4, 2017

View reviewed changes

jeroen force-pushed the roxygen branch from 7cb8133 to 46cf627 Compare December 13, 2017 13:36

jeroen mentioned this pull request Dec 12, 2018

Spell checking hadley/adv-r#1166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for spell checking roxygen comments #3

Add support for spell checking roxygen comments #3

jimhester commented Sep 7, 2017

codecov-io commented Sep 7, 2017 •

edited

hadley commented Sep 7, 2017

jeroen commented Sep 7, 2017

jimhester commented Sep 7, 2017

jeroen commented Sep 8, 2017

jeroen left a comment •

edited

jimhester commented Nov 6, 2017

jeroen commented Nov 6, 2017 •

edited

jimhester commented Nov 6, 2017

Add support for spell checking roxygen comments #3

Are you sure you want to change the base?

Add support for spell checking roxygen comments #3

Conversation

jimhester commented Sep 7, 2017

codecov-io commented Sep 7, 2017 • edited

Codecov Report

hadley commented Sep 7, 2017

jeroen commented Sep 7, 2017

jimhester commented Sep 7, 2017

jeroen commented Sep 8, 2017

jeroen left a comment • edited

Choose a reason for hiding this comment

jimhester commented Nov 6, 2017

jeroen commented Nov 6, 2017 • edited

jimhester commented Nov 6, 2017

codecov-io commented Sep 7, 2017 •

edited

jeroen left a comment •

edited

jeroen commented Nov 6, 2017 •

edited