Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relational Databases (MySQL, PostgreSQL) Integration #195

Open
wants to merge 58 commits into
base: main
Choose a base branch
from

Conversation

dsaad68
Copy link
Contributor

@dsaad68 dsaad68 commented Apr 5, 2024

cuallee

  • relational databases (MySQL, PostgreSQL)
  • unit test
  • docs
  • pyspark
  • snowpark
  • pandas
  • duckdb
  • polars
  • other

Relational Databases (MySQL, PostgreSQL) Integration

Utilizing Polars with the ConnectorX engine to read from a database, tables in relational databases can now be checked and validated. ConnectorX, written in Rust, has native support for Apache Arrow, enabling it to transfer data directly into a Polars's DataFrame without copying the data (zero-copy). More information about the ConnectorX engine can be found here.

This feature allows for the future integration of other DBMS, such as Redshift (via the PostgreSQL protocol) and ClickHouse (via the MySQL protocol), into cuallee.

✨ Feature Enhancements:

  • Added support for validating tables in PostgreSQL and MySQL, enabling more robust data integrity checks.
  • Added necessary dependencies to pyproject.toml.

🧪 Testing:

  • Implemented unit tests for all Compute methods to ensure their reliability and correctness. (Unit Tests for unimplemented methods are available but marked as skip.)
  • Added PyTest fixtures for automated unit tests for PostgreSQL and MySQL.
  • Added test cases as init-db-psql.sql and init-db-mysql.sql.
  • Added a unit test in test_validate.py to validate the type of outputted DataFrame, ensuring the resulting DataFrame is type polars_dataframe.
  • Added a unit test in test_validate.py to validate the error when a column is not found in the table.

📖 Documentation:

  • Updated the README.md
  • Added How-To to set up the test environment.

⛑️ Know Issues:

  • It's important to note that most of PostgreSQL's Compute are inherited from DuckDB's Compute methods, so any changes in DuckDB's Compute methods will affect PostgreSQL's Compute methods.

  • has_std check could fail due to floating point precision. The approach suggested below could resolve it, but it needs to introduce a precision error parameter:

    # PostgreSQL
    f"CAST(STDDEV_SAMP({rule.column}) AS FLOAT) - CAST({rule.value} AS FLOAT) < {percision_error}"
    
    # MySQL
    f"""IF( 
          CAST(STDDEV_SAMP({rule.column}) AS FLOAT) - CAST({rule.value} AS FLOAT) < {percision_error}, 
          'True', 'False')
    """

🦺 Limitation:

PostgreSQL

Not all checks are currently available for PostgreSQL. Unavailable checks:

  • is_daily
  • has_entropy
  • has_workflow

MySQL

Not all checks are currently available for MySQL. Unavailable checks:

  • is_daily
  • has_entropy
  • has_workflow
  • has_percentile
  • has_correlation
  • is_inside_interquartile_range

✈️ Future Improvements:

  1. Improve the Compute methods to support complex queries:
    Because the Compute methods are used to create a unified query, it is impossible to create complex queries with multiple Compute methods. This section needed to be improved to support complex queries.

    # In `psql_validation.py` and `mysql_validation.py`
    
    unified_columns = ",\n\t".join([
            # This is the same as compute.`rule.method`(rule)
            operator.methodcaller(rule.method, rule)(Compute(check.table_name))
            + f' AS "{rule.key}"'
            for rule in check.rules ])
    
    unified_query = f"""
    SELECT
    \t{unified_columns}
    FROM
    \t{check.table_name}
    """
  2. Inherency Enhancement:
    A significant portion of PostgreSQL's Compute functionalities are derived from DuckDB's Compute methods.
    Therefore, any modifications to the Compute methods in DuckDB will directly impact the functionalities in PostgreSQL.
    Consolidating these shared methods into a separate class for better manageability is advisable.

@canimus
Copy link
Owner

canimus commented Apr 13, 2024

@dsaad68 thanks for this great addition. I will conduct a review and proceed with the integration.

@dsaad68
Copy link
Contributor Author

dsaad68 commented Apr 30, 2024

If there are any issues with this suggestion, please let me know.

@canimus
Copy link
Owner

canimus commented May 11, 2024

If there are any issues with this suggestion, please let me know.

Hi @dsaad68 there is no issues with the submission. We are in a ongoing review for the Journal of Open Source software, and they have indicated the lack of some documentation. Before adding more functionality, we would like to pass the JOSS review, with the consolidated docs, at least from the core classes and modules. I think once we pass this hurdle, we will be adding the new functionality. Hope that explains the delays. Thank you in advance.

@dsaad68
Copy link
Contributor Author

dsaad68 commented May 11, 2024

@canimus I totally understand, good luck. Let me know if I can help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants