Update MysqlImport.php #4007

tridoxx · 2023-08-29T09:00:37Z

The detectDelimiter function in the code aims to automatically detect the delimiter used in a CSV file. This is useful because in CSV files, the delimiter (the character that separates fields) can vary depending on regional settings and language. For example, in the United States and other English-speaking countries, the most common delimiter is a comma (,), while in some European and Latin American countries, a semicolon (;) is used as the delimiter.

The detectDelimiter function works as follows:

It defines a set of possible delimiters to check: , (comma), ; (semicolon), and \t (tabulation).

It opens the CSV file and reads a limited number of lines (in this case, a maximum of 10 lines) for analysis.

For each possible delimiter, it counts the average number of columns in the analyzed lines. This is done by dividing the total sum of columns across all lines by the number of lines.

The delimiter that produces the highest average number of columns is considered the most likely delimiter and is selected as the detected delimiter.

In summary, this function helps automatically determine the correct delimiter for a given CSV file, which is especially useful when working with CSV files that may have been generated in different regional settings or languages, ensuring that the import process is accurate without requiring manual intervention to specify the delimiter.

but for use semicolon you need edit this file docroot/core/lib/Drupal/Core/Database/Connection.php

and change the variable 'allow_delimiter_in_query' => from false to true

and the front-end and database are correctly estructured.

fixes [org/repo/issue#]

Test coverage exists
Documentation exists

QA Steps

for test you can use this harvest process and data.json only have 3 datasets.

drush dkan:harvest:register '{ "identifier": "50_datasets", "extract": { "type": "\Harvest\ETL\Extract\DataJson", "uri": "https://raw.githubusercontent.com/tridoxx/urlsdatosabiertos/main/medatapequeno.json" }, "transforms": [], "load": { "type": "\Drupal\harvest\Load\Dataset" } }'
drush dkan:harvest:run 50_datasets
drush queue:run datastore_import

[ x] Add manual QA steps in checklist format for a reviewer to perform to confirm that the feature or fix is working. Include as much details as possible so that the reviewer doesn't lose time figuring out how to perform steps.

The detectDelimiter function in the code aims to automatically detect the delimiter used in a CSV file. This is useful because in CSV files, the delimiter (the character that separates fields) can vary depending on regional settings and language. For example, in the United States and other English-speaking countries, the most common delimiter is a comma (,), while in some European and Latin American countries, a semicolon (;) is used as the delimiter. The detectDelimiter function works as follows: It defines a set of possible delimiters to check: , (comma), ; (semicolon), and \t (tabulation). It opens the CSV file and reads a limited number of lines (in this case, a maximum of 10 lines) for analysis. For each possible delimiter, it counts the average number of columns in the analyzed lines. This is done by dividing the total sum of columns across all lines by the number of lines. The delimiter that produces the highest average number of columns is considered the most likely delimiter and is selected as the detected delimiter. In summary, this function helps automatically determine the correct delimiter for a given CSV file, which is especially useful when working with CSV files that may have been generated in different regional settings or languages, ensuring that the import process is accurate without requiring manual intervention to specify the delimiter. but for use semicolon you need edit this file docroot/core/lib/Drupal/Core/Database/Connection.php and change the variable 'allow_delimiter_in_query' => from false to true

dafeder · 2023-10-04T16:08:35Z

@tridoxx interesting approach, and I think this has a lot of potential. Are you sure the highest number of columns is the best measure of which is the best delimiter? It seems like there are a lot of cases where this would not be true. Imagine a tab-separated file with only three columns, but one of them was a long text field where there were often several commas? I would recommend also checking to ensure that the number of columns per row is identical; if not, we have clearly not correctly identified the delimiter.

Also, to merge this it would need to meet Drupal coding standards and contain tests for the new methods. Thanks!

paul-m · 2023-10-05T17:52:16Z

modules/datastore/modules/datastore_mysql_import/src/Service/MysqlImport.php

+    $sample_lines = [];
+    $line_count = 0;
+
+    while (($line = fgets($handle)) !== false && $line_count < $max_lines) {


Perhaps use file() instead, so PHP can do the array-making instead of your code. Also, some CSV files are very very large and might break this by running out of memory.

paul-m · 2023-10-05T17:55:30Z

modules/datastore/modules/datastore_mysql_import/src/Service/MysqlImport.php

+    foreach ($delimiters as $delimiter) {
+      $column_counts = [];
+
+      foreach ($sample_lines as $sample_line) {


It's very likely that there are more lines than delimiters, so putting the lines loop inside the delimiters loop means if the delimiter isn't comma, you'll go through this process for all lines at least twice.

It's also possible that there's only one column with no actual delimiters... in which case we'll do it three times and end up with default of comma.

paul-m requested changes Oct 5, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update MysqlImport.php #4007

Update MysqlImport.php #4007

tridoxx commented Aug 29, 2023

dafeder commented Oct 4, 2023

paul-m Oct 5, 2023

paul-m Oct 5, 2023

Update MysqlImport.php #4007

Are you sure you want to change the base?

Update MysqlImport.php #4007

Conversation

tridoxx commented Aug 29, 2023

QA Steps

dafeder commented Oct 4, 2023

paul-m Oct 5, 2023

Choose a reason for hiding this comment

paul-m Oct 5, 2023

Choose a reason for hiding this comment