-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update MysqlImport.php #4007
base: 2.x
Are you sure you want to change the base?
Update MysqlImport.php #4007
Conversation
The detectDelimiter function in the code aims to automatically detect the delimiter used in a CSV file. This is useful because in CSV files, the delimiter (the character that separates fields) can vary depending on regional settings and language. For example, in the United States and other English-speaking countries, the most common delimiter is a comma (,), while in some European and Latin American countries, a semicolon (;) is used as the delimiter. The detectDelimiter function works as follows: It defines a set of possible delimiters to check: , (comma), ; (semicolon), and \t (tabulation). It opens the CSV file and reads a limited number of lines (in this case, a maximum of 10 lines) for analysis. For each possible delimiter, it counts the average number of columns in the analyzed lines. This is done by dividing the total sum of columns across all lines by the number of lines. The delimiter that produces the highest average number of columns is considered the most likely delimiter and is selected as the detected delimiter. In summary, this function helps automatically determine the correct delimiter for a given CSV file, which is especially useful when working with CSV files that may have been generated in different regional settings or languages, ensuring that the import process is accurate without requiring manual intervention to specify the delimiter. but for use semicolon you need edit this file docroot/core/lib/Drupal/Core/Database/Connection.php and change the variable 'allow_delimiter_in_query' => from false to true
@tridoxx interesting approach, and I think this has a lot of potential. Are you sure the highest number of columns is the best measure of which is the best delimiter? It seems like there are a lot of cases where this would not be true. Imagine a tab-separated file with only three columns, but one of them was a long text field where there were often several commas? I would recommend also checking to ensure that the number of columns per row is identical; if not, we have clearly not correctly identified the delimiter. Also, to merge this it would need to meet Drupal coding standards and contain tests for the new methods. Thanks! |
$sample_lines = []; | ||
$line_count = 0; | ||
|
||
while (($line = fgets($handle)) !== false && $line_count < $max_lines) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps use file()
instead, so PHP can do the array-making instead of your code. Also, some CSV files are very very large and might break this by running out of memory.
foreach ($delimiters as $delimiter) { | ||
$column_counts = []; | ||
|
||
foreach ($sample_lines as $sample_line) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very likely that there are more lines than delimiters, so putting the lines loop inside the delimiters loop means if the delimiter isn't comma, you'll go through this process for all lines at least twice.
It's also possible that there's only one column with no actual delimiters... in which case we'll do it three times and end up with default of comma.
The detectDelimiter function in the code aims to automatically detect the delimiter used in a CSV file. This is useful because in CSV files, the delimiter (the character that separates fields) can vary depending on regional settings and language. For example, in the United States and other English-speaking countries, the most common delimiter is a comma (,), while in some European and Latin American countries, a semicolon (;) is used as the delimiter.
The detectDelimiter function works as follows:
It defines a set of possible delimiters to check: , (comma), ; (semicolon), and \t (tabulation).
It opens the CSV file and reads a limited number of lines (in this case, a maximum of 10 lines) for analysis.
For each possible delimiter, it counts the average number of columns in the analyzed lines. This is done by dividing the total sum of columns across all lines by the number of lines.
The delimiter that produces the highest average number of columns is considered the most likely delimiter and is selected as the detected delimiter.
In summary, this function helps automatically determine the correct delimiter for a given CSV file, which is especially useful when working with CSV files that may have been generated in different regional settings or languages, ensuring that the import process is accurate without requiring manual intervention to specify the delimiter.
but for use semicolon you need edit this file docroot/core/lib/Drupal/Core/Database/Connection.php
and change the variable 'allow_delimiter_in_query' => from false to true
and the front-end and database are correctly estructured.
fixes [org/repo/issue#]
QA Steps
for test you can use this harvest process and data.json only have 3 datasets.
drush dkan:harvest:register '{ "identifier": "50_datasets", "extract": { "type": "\Harvest\ETL\Extract\DataJson", "uri": "https://raw.githubusercontent.com/tridoxx/urlsdatosabiertos/main/medatapequeno.json" }, "transforms": [], "load": { "type": "\Drupal\harvest\Load\Dataset" } }'
drush dkan:harvest:run 50_datasets
drush queue:run datastore_import