Skip to content

A simple command-line tool to see the difference between two CSV files.

License

Notifications You must be signed in to change notification settings

blue-monk/csv-diff-python2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

csv-diff-python2

Python Version testing coverage License

๐ŸŒฟ Overview

A simple command-line tool to see the difference between two CSV files.

This tool reports in the following style, and you can choose how to report.

  1. Report the number of differences and line numbers
  2. Report diff marks along with the contents of each CSV line
    • You can choose the following report styles
      • Horizontal (Side-by-side) display style
      • Vertical display style
    • You can choose to report only the lines with differences or all lines

๐ŸŒด DEMO

DEMO


๐ŸŒฟ Table of Contents

๐ŸŒฟ Why csv-diff?

The diff command that compares files is unaware of key columns (like primary keys in a database). Therefore, it may give undesired results in detecting differences in CSV files that have key columns.

For example, consider comparing the contents of tables in two databases that are inaccessible to each other. One way is to output each database's data as a CSV file and compare it. In this case, the diff command does not pay attention to the key columns, so lines with different keys may be compared. It is not possible to make an accurate judgment of the difference with the key in mind.

This tool, on the other hand, recognizes key columns and detects differences. Specify the key columns as an argument at the time of execution. You can get the comparison result you want.

๐ŸŒฟ Features

  • CSV delimiter, line feed character, presence/absence of header, etc. are automatically determined (can be specified)
  • Make a comparison after matching with the key columns
  • You can specify columns that are not compared
  • Differences can be displayed side-by-side (more suitable when the number of columns is small)
  • Differences can be displayed in vertical order (more suitable when the number of columns is large)
  • Differences are indicated by the following marks, which we call DIFF-MARK
    • !: There is a difference
    • <: Exists only on the left side
    • >: Exists only on the right side
  • It is also possible to display only the number of differences and the line number with the difference
  • It is possible to compare one file with commas and one file with tabs
  • Low memory consumption
  • Only Python standard modules are used and provided as a single file, so it is easy to install even on an isolated environment

๐ŸŒฟ Requirements

Runtime

  • Python2.7.18 or later

    If you want to use it with Python3, please use csv-diff-python3.

CSV files

  • Must be sorted by key columns

๐ŸŒฟ Installation

With pip

pip install git+https://github.com/blue-monk/csv-diff-python2

It may be safer to install it on a virtual environment created with virtualenv.

Manual installation

Place csvdiff.py in any directory on the machine where Python 2.7 is installed.
It will be easier to use if you place it in a directory defined on PATH.

๐ŸŒฟ Run

If installed with pip

$ csvdiff2 -h

If installed manually

$ python csvdiff.py -h

or

$ chmod +x csvdiff.py
$ ./csvdiff.py -h

๐ŸŒฟ How to use

See the Wiki for more details.

Get help

$ ./csvdiff.py -h

One example

Here is one example with the following sample data in appendix/csv_samples/.
See the Wiki for more details.

Sample data

Suppose the keys are the 0th column and the 2nd column.

  • sample_lhs.csv

    head1, head2, head3, head4, head5
    key1-2, value1-2, key2-2, value2-2, 20201224T035908
    key1-3, value1-3, key2-3, value2-3, 20201224T180527
    key1-4, value1-4, key2-4, value2-4, 20201225T104851
    key1-5, value1-5, key2-5, value2-5, 20201225T142142
    
  • sample_rhs.csv

    head1, head2, head3, head4, head5
    key1-1, value1-1, key2-1, value2-1, 20210108T142358
    key1-2, value1-3, key2-2, value2-z, 20210108T174216
    key1-4, value1-4, key2-4, value2-4, 20210109T090245
    key1-5, value1-v, key2-5, value2-5, 20210109T111231
    

Example of use

To view the contents of different lines, Use the -d (--show-difference-only) option.
If you also want to see the number of differences, put the -c option (--show-count).

$ ../../src/csvdiff2/csvdiff.py sample_lhs.csv sample_rhs.csv -k 0,2 -dc

============ Report ============

* Differences
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sample_lhs.csv                                                        sample_rhs.csv                                                       Column indices with difference
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                   >  2 ['key1-1', 'value1-1', 'key2-1', 'value2-1', '20210108T142358']
2 ['key1-2', 'value1-2', 'key2-2', 'value2-2', '20201224T035908']  !  3 ['key1-2', 'value1-3', 'key2-2', 'value2-z', '20210108T174216']  @ [1, 3, 4]
3 ['key1-3', 'value1-3', 'key2-3', 'value2-3', '20201224T180527']  <
4 ['key1-4', 'value1-4', 'key2-4', 'value2-4', '20201225T104851']  !  4 ['key1-4', 'value1-4', 'key2-4', 'value2-4', '20210109T090245']  @ [4]
5 ['key1-5', 'value1-5', 'key2-5', 'value2-5', '20201225T142142']  !  5 ['key1-5', 'value1-v', 'key2-5', 'value2-5', '20210109T111231']  @ [1, 4]

* Count & Row number
same lines           : 0
left side only    (<): 1 :-- Row Numbers      -->: [3]
right side only   (>): 1 :-- Row Numbers      -->: [2]
with differences  (!): 3 :-- Row Number Pairs -->: [(2, 3), (4, 4), (5, 5)]
  • Differences are indicated by the following DIFF-MARKs

    • ! : There is a difference
    • < : Exists only on the left side
    • > : Exists only on the right side
  • The number displayed before each CSV line data is the line number of the actual file

    • line number is 1 based
  • For rows with differences, the column indices with differences will be displayed after @

    • column index is 0 based

๐ŸŒฟ Notices

  • For large amounts of data

    In the case of a horizontal report,
    it takes longer than a vertical report because all lines are scanned in advance to collect information for report formatting.
    For large amounts of data, consider vertical reports.

๐ŸŒฟ Known Issues

  • Problems with reporting Japanese characters (and similar multibyte characters, maybe)

    For example,
    in the case of a UTF-8 CSV file containing Japanese,
    the Japanese part will be displayed in UTF-8 byte string representation in the report.
    For now, I'm not sure how to handle Japanese in Python2.
    However, it seems that the difference judgment is performed without any problem.

  • Workaround for only one line

    If the CSV file contains only one line, it will be recognized as a header.
    You need to specify the option -H n to be recognized as CSV without a header.

๐ŸŒฟ Contributing

Reporting Bugs

Python 2.7 reached the end of its life on January 1st, 2020.
We don't intend to do any maintenance on the python2 version.
Consider using csv-diff-python3.

However, if it is a minor bug, it may be addressed, so please raise it in the issue.

๐ŸŒฟ License

csv-diff-python2 is released under the MIT license. Please read the LICENSE file for more information.