Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically Infer Data Schema for Inserted Data #2068

Closed
jieguangzhou opened this issue May 17, 2024 · 1 comment
Closed

Automatically Infer Data Schema for Inserted Data #2068

jieguangzhou opened this issue May 17, 2024 · 1 comment

Comments

@jieguangzhou
Copy link
Collaborator

jieguangzhou commented May 17, 2024

When inserting data, if the data is of a special type, the schema will be inferred automatically.

Steps:

  1. Infer the data to get the schema.
  2. Check if the table and schema are already created; if not, automatically create the table.
  3. Insert the data.

Therefore, users no longer need to worry about the schema concept, unless they want to customize a new schema.

Here is the current user experience:

@pytest.mark.parametrize(
    "db", [DBConfig.mongodb_empty, DBConfig.sqldb_empty], indirect=True
)
def test_insert_with_schema(db):
    import numpy as np
    import PIL.Image

    from superduperdb.ext.numpy.encoder import NumpyDataTypeFactory
    from superduperdb.ext.pillow.encoder import pil_image

    data = {
        'img': PIL.Image.open('test/material/data/test.png'),
        'array': np.array([1, 2, 3]),
    }

    schema = Schema(
        'schema',
        fields={
            'img': pil_image,
            'array': NumpyDataTypeFactory.create(data['array']),
        },
    )

    table = Table('documents', schema=schema)
    db.add(table)

    table_or_collection = db['documents']
    datas = [Document(data)]

    table_or_collection.insert(datas).execute()

    datas_from_db = list(table_or_collection.select().execute())
    for d, d_db in zip(datas, datas_from_db):
        assert d['img'].size == d_db['img'].size
        assert np.all(d['array'] == d_db['array'])

Here is the optimized user experience:

@pytest.mark.parametrize(
    "db", [DBConfig.mongodb_empty, DBConfig.sqldb_empty], indirect=True
)
def test_insert_with_schema(db):
    import numpy as np
    import PIL.Image

    data = {
        'img': PIL.Image.open('test/material/data/test.png'),
        'array': np.array([1, 2, 3]),
    }

    table_or_collection = db['documents']
    datas = [Document(data)]

    table_or_collection.insert(datas).execute()

    datas_from_db = list(table_or_collection.select().execute())
    for d, d_db in zip(datas, datas_from_db):
        assert d['img'].size == d_db['img'].size
        assert np.all(d['array'] == d_db['array'])
@jieguangzhou
Copy link
Collaborator Author

close the same issue: #2068

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant