Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update MysqlImport.php #4007

Open
wants to merge 1 commit into
base: 2.x
Choose a base branch
from
Open

Update MysqlImport.php #4007

wants to merge 1 commit into from

Conversation

tridoxx
Copy link

@tridoxx tridoxx commented Aug 29, 2023

The detectDelimiter function in the code aims to automatically detect the delimiter used in a CSV file. This is useful because in CSV files, the delimiter (the character that separates fields) can vary depending on regional settings and language. For example, in the United States and other English-speaking countries, the most common delimiter is a comma (,), while in some European and Latin American countries, a semicolon (;) is used as the delimiter.

The detectDelimiter function works as follows:

It defines a set of possible delimiters to check: , (comma), ; (semicolon), and \t (tabulation).

It opens the CSV file and reads a limited number of lines (in this case, a maximum of 10 lines) for analysis.

For each possible delimiter, it counts the average number of columns in the analyzed lines. This is done by dividing the total sum of columns across all lines by the number of lines.

The delimiter that produces the highest average number of columns is considered the most likely delimiter and is selected as the detected delimiter.

In summary, this function helps automatically determine the correct delimiter for a given CSV file, which is especially useful when working with CSV files that may have been generated in different regional settings or languages, ensuring that the import process is accurate without requiring manual intervention to specify the delimiter.

but for use semicolon you need edit this file docroot/core/lib/Drupal/Core/Database/Connection.php

and change the variable 'allow_delimiter_in_query' => from false to true

and the front-end and database are correctly estructured.

fixes [org/repo/issue#]

  • Test coverage exists
  • Documentation exists

QA Steps

for test you can use this harvest process and data.json only have 3 datasets.

  • [ x] Add manual QA steps in checklist format for a reviewer to perform to confirm that the feature or fix is working. Include as much details as possible so that the reviewer doesn't lose time figuring out how to perform steps.

The detectDelimiter function in the code aims to automatically detect the delimiter used in a CSV file. This is useful because in CSV files, the delimiter (the character that separates fields) can vary depending on regional settings and language. For example, in the United States and other English-speaking countries, the most common delimiter is a comma (,), while in some European and Latin American countries, a semicolon (;) is used as the delimiter.

The detectDelimiter function works as follows:

It defines a set of possible delimiters to check: , (comma), ; (semicolon), and \t (tabulation).

It opens the CSV file and reads a limited number of lines (in this case, a maximum of 10 lines) for analysis.

For each possible delimiter, it counts the average number of columns in the analyzed lines. This is done by dividing the total sum of columns across all lines by the number of lines.

The delimiter that produces the highest average number of columns is considered the most likely delimiter and is selected as the detected delimiter.

In summary, this function helps automatically determine the correct delimiter for a given CSV file, which is especially useful when working with CSV files that may have been generated in different regional settings or languages, ensuring that the import process is accurate without requiring manual intervention to specify the delimiter.

but for use semicolon you need edit this file docroot/core/lib/Drupal/Core/Database/Connection.php

and change the variable 'allow_delimiter_in_query' => from false to true
@dafeder
Copy link
Member

dafeder commented Oct 4, 2023

@tridoxx interesting approach, and I think this has a lot of potential. Are you sure the highest number of columns is the best measure of which is the best delimiter? It seems like there are a lot of cases where this would not be true. Imagine a tab-separated file with only three columns, but one of them was a long text field where there were often several commas? I would recommend also checking to ensure that the number of columns per row is identical; if not, we have clearly not correctly identified the delimiter.

Also, to merge this it would need to meet Drupal coding standards and contain tests for the new methods. Thanks!

$sample_lines = [];
$line_count = 0;

while (($line = fgets($handle)) !== false && $line_count < $max_lines) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use file() instead, so PHP can do the array-making instead of your code. Also, some CSV files are very very large and might break this by running out of memory.

foreach ($delimiters as $delimiter) {
$column_counts = [];

foreach ($sample_lines as $sample_line) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very likely that there are more lines than delimiters, so putting the lines loop inside the delimiters loop means if the delimiter isn't comma, you'll go through this process for all lines at least twice.

It's also possible that there's only one column with no actual delimiters... in which case we'll do it three times and end up with default of comma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants