Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in repositories and projects_with_repository_data) has columns messed, and so completely junk #2887

Open
KOLANICH opened this issue Nov 22, 2021 · 4 comments

Comments

@KOLANICH
Copy link

KOLANICH commented Nov 22, 2021

The screenshots are from LibreOffice, but other software sees the data as messed too.

a
c

Also some rows contain junk:

b

projects file seems to be OK.

Other files haven't been tested.

@KOLANICH KOLANICH changed the title The data in the latest (1.6.0-2020-01-12) exported dataset (in repositories and projects_with_repository_data) has columns messed, and so completely junk The data in the latest (1.6.0-2020-01-12) exported dataset in CSV format (in repositories and projects_with_repository_data) has columns messed, and so completely junk Nov 22, 2021
@ftarlaci
Copy link

The reason for the issue you refer to is a simple shift problem, at least for the projects_with_repository_fields file which can be resolved by simply loading the file into a Pandas or Dask dataframe (in Python) with index_col=False attribute or any equivalent of this behavior in other languages.

@KOLANICH
Copy link
Author

The problem is that they are not uniformly shifted. Some lines are shifted by one amount, another lines by another amount, so for different lines the same colums contain different data (at least as exploration in LO Calc has showed) and to fix the data nontrivial logic is needed, which will likely won't work reliably. So the data is completely junk.

Also, I am not going to use pandas, pandas is damn slow. I gonna use a custom importer in C++ using Ben Strasser's fastest CSV parsing lib (the schema is defined in compile time).

@ftarlaci
Copy link

ftarlaci commented Jan 3, 2022

Well, isn't that beauty of open source; you work on making it better if you can? Anyways, I would like to leave you with one of my favorite quotes:
"Everyone in open source is doing everyone else a favor to varying levels of commitment. We should treat one another accordingly.”

Good luck.

@KOLANICH
Copy link
Author

KOLANICH commented Jan 4, 2022

You are right. But I am out of capacity to work on this project too. In fact I am not even sure that these datasets gonna be useful for the study at all.

Good luck.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants