Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate CSV linter #3493

Open
wesley-dean-flexion opened this issue Apr 18, 2024 · 3 comments
Open

Integrate CSV linter #3493

wesley-dean-flexion opened this issue Apr 18, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@wesley-dean-flexion
Copy link
Contributor

wesley-dean-flexion commented Apr 18, 2024

tl;dr: I would like to add csvclean (from csvkit) as a linter. I'm happy to do the work if people think this is a good idea.

Is your feature request related to a problem? Please describe.

One of my repos includes CSV files and they can sub-optimal. Just as we have linters (and reformaters) for JSON, XML, and YAML, I would like to add a CSV linter.

Describe the solution you'd like

There exists a package, csvkit, that includes a tool to lint and cleanup CSV files:

https://csvkit.readthedocs.io/en/latest/scripts/csvclean.html

I would like to add CSV_CSVCLEAN (the name isn't consequential to me; I just picked it because of YAML_YAMLLINT) that would lint the list of files. When run with APPLY_FIXES, it would not include the --dry-run flag to csvclean; when run without APPLY_FIXES set, it would include the --dry-run flag).

Running csvclean on a CSV file results in two files being created, per the documentation

Outputs [basename]_out.csv and [basename]_err.csv, the former containing all valid rows and the latter containing all error rows along with line numbers and descriptions

I noticed that running csvclean on a known messy file (i.e., one that produces errors due to being not totally valid) will NOT set $? but it will generate [basename]_err.csv, so something like this might be helpful:

csvcleanwrapper() {
  csvkit "$@" "$(-z "$APPLY_FIXES" ] && echo "--dry-run")" -- - && [ -e stderr_err.csv ]
}

(so $? is set, $@ would have the additional arguments, $APPLY_FIXES is set to something when we want to fix stuff, etc.. point: a little "syntactic sugar" could be helpful in making it work the way we want.)

Describe alternatives you've considered

I haven't thought through very many alternatives. I did look through prettier to see if it could clean up CSV like it can for YAML and such; however, it does not appear to have that functionality. If it's there and I missed it, cool, there's that much less work that needs to be done.

Additional context

There are a few images on Docker Hub that provide csvkit, but they're largely several years old. For what it's worth, csvkit regularly provides revisions, the most recent of which (latest / v1.5.0) was released on 28 March, 2024. (point: existing images are behind the current release). I can put together a pipeline to watch the csvkit repo for new releases and package / publish updated images.

I'm happy to do the work to implement this and submit a PR assuming folks are cool with the idea.

The fact that CSV has a bunch of limitations, that JSON, TOML, XML, or YAML (etc.) may be a better match to represent data. That's given and I don't dispute it. Unfortunately, it's not my call about how the data are represented but I do have responsibilities to make sure the pipeline from developer to production detects (and notifies me on) as much noise as possible.

@wesley-dean-flexion wesley-dean-flexion added the enhancement New feature or request label Apr 18, 2024
@nvuillam
Copy link
Member

Hi @wesley-dean-flexion :)

That seems to be a good idea , you have my go to start implementing :)

About the complexity to call csvkit, you might need to create a python class to handle it :)

@wesley-dean-flexion
Copy link
Contributor Author

wesley-dean-flexion commented Apr 22, 2024

(apologies.. I muscle-memoried cvsclean instead of csvclean when creating the branch... ugh...)

Started with csv-clean which is not yet ready for anything. A few questions:

  1. linter_name is the command to run to do the linting; in this case, it would be csvclean because the name of the executable is csvclean .. right?
  2. cli_lint_extra_args can be used to pass --dry-run while cli_lint_fix_remove_args can also be set to --dry-run so that in non-fix mode, it'll pass --dry-run while fix mode will not pass --dry-run... right?
  3. we can use Python regex mechanics (e.g., (?i) to make a regex case-insensitive, [[:space:]]+ for 1 or more white spaces, etc.).. right?
  4. csvclean gives differently-formatted output depending on if it's run with --dry-run or not:
$ csvclean --dry-run acronyms.csv
Line 289: Expected 4 columns, found 5 columns
Line 1196: Expected 4 columns, found 6 columns
Line 1241: Expected 4 columns, found 8 columns
Line 1242: Expected 4 columns, found 2 columns
Line 1307: Expected 4 columns, found 3 columns

### note: NO acronyms_err.csv generated here

$ csvclean acronyms.csv
5 errors logged to acronyms_err.csv

### note: acronyms_err.csv and acronyms_out.csv ARE generated here

so, this is where a wrapper which would live in the linters directory would reside... right? The Python class would look to see if we're running in fix mode or not and apply the --dry-run flag as-needed, grab the correct output, make sure the original file is what's pushed when in fix mode, etc.. right?

@wesley-dean-flexion
Copy link
Contributor Author

I'm working with @jpmckinney on some interface changes (wireservice/csvkit#1239) that ought to simplify this integration. As a result, when v2.0.0 comes out, a lot of what I wrote before will no longer matter.

Additionally, I submitted wireservice/csvkit#1240 to containerize the tool and publish official images that could be used instead of building the tool via pip during the MegaLinter build process. Hopefully this will simplify the build and isolate MegaLinter from any build problems, interface refactoring, etc..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants