Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Azure Table Storage integration #13335

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

Falven
Copy link
Contributor

@Falven Falven commented May 7, 2024

Description

This PR introduces four new classes, AzureKVStore, AzureDocumentStore, AzureIndexStore and AzureChatStore that integrate with Azure Table Storage via the azure-data-table SDK to provide key-value, document, index, and chat history storage functionality, respectively. These changes will work for any of Azure's NoSQL data storage offerings supported by azure-data-table (CosmosDB, Azure Table...)

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Ran various configurations from the sample notebook testing both the AzureDocumentStore and AzureIndexStore and therefore the AzureKVStore. Added documents to the AzureDocumentStore and verified they were uploaded to an Azure Table Storage. Additionally I added various indexes: summary, vector, and keyword, verified they are reflected in Azure Table Storage. Finally, I tested persisting and querying each index. Also tested the AzureChatStore in it's own separate Notebook.

  • Added new notebook (that tests end-to-end)
  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods
pre-commit install
pre-commit installed at .git/hooks/pre-commit
git ls-files | xargs pre-commit run black --files
[INFO] Initializing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO] Initializing environment for https://github.com/charliermarsh/ruff-pre-commit.
[INFO] Initializing environment for https://github.com/psf/black-pre-commit-mirror.
[INFO] Initializing environment for https://github.com/psf/black-pre-commit-mirror:.[jupyter].
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-mypy.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-mypy:types-requests,types-Deprecated,types-redis,types-setuptools,types-PyYAML,types-protobuf==4.24.0.4.
[INFO] Initializing environment for https://github.com/adamchainz/blacken-docs.
[INFO] Initializing environment for https://github.com/adamchainz/blacken-docs:black==23.10.1.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-prettier.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-prettier:prettier@3.0.3.
[INFO] Initializing environment for https://github.com/codespell-project/codespell.
[INFO] Initializing environment for https://github.com/codespell-project/codespell:tomli.
[INFO] Initializing environment for https://github.com/srstevenson/nb-clean.
[INFO] Initializing environment for https://github.com/pappasam/toml-sort.
[INFO] Installing environment for https://github.com/psf/black-pre-commit-mirror.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/adamchainz/blacken-docs.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
black-src................................................................Passed
black-docs-py........................................(no files to check)Skipped
black-docs-text..........................................................Passed
pre-commit install && git ls-files | xargs pre-commit run --show-diff-on-failure --files
pre-commit installed at .git/hooks/pre-commit
[INFO] Installing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/charliermarsh/ruff-pre-commit.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/pre-commit/mirrors-mypy.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/pre-commit/mirrors-prettier.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/codespell-project/codespell.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/srstevenson/nb-clean.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/pappasam/toml-sort.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
check BOM - deprecated: use fix-byte-order-marker........................Passed
check for merge conflicts................................................Passed
check for broken symlinks............................(no files to check)Skipped
check toml...............................................................Passed
check yaml...........................................(no files to check)Skipped
detect private key.......................................................Passed
fix end of files.........................................................Passed
mixed line ending........................................................Passed
trim trailing whitespace.................................................Passed
ruff.....................................................................Passed
black-src................................................................Passed
mypy.....................................................................Passed
black-docs-py........................................(no files to check)Skipped
black-docs-text..........................................................Passed
prettier.................................................................Passed
codespell................................................................Passed
nb-clean.............................................(no files to check)Skipped
toml-sort-fix............................................................Passed

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label May 7, 2024
@Falven Falven changed the title Feature/azure_table_storage feat: Azure Table Storage May 7, 2024
@Falven Falven changed the title feat: Azure Table Storage feat: Azure Table Storage integration May 7, 2024
@Falven
Copy link
Contributor Author

Falven commented May 7, 2024

Working on a Notebook use case.

Copy link
Collaborator

@logan-markewich logan-markewich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh this looks pretty good!

If you add some example (either to the readme or a notebook, or both), I can merge 👍🏻 (and also assuming cicd passes lol)

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label May 8, 2024
@Falven
Copy link
Contributor Author

Falven commented May 8, 2024

tbh this looks pretty good!

If you add some example (either to the readme or a notebook, or both), I can merge 👍🏻 (and also assuming cicd passes lol)

There's one other core issue I am trying to solve - MongoDB supports native BSON serialization/deserialization of complex values. Azure Table Storage does not, so I need to add this. Don't merge this one yet. 😅

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels May 8, 2024
@logan-markewich
Copy link
Collaborator

Ah good catch @Falven ! Let me know when this is closer to done and I'll come back for a final check/linting and merge 👍🏻

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Falven Falven force-pushed the feature/azure_table_storage branch 10 times, most recently from 855652c to 33b2d2a Compare May 9, 2024 17:36
@Falven
Copy link
Contributor Author

Falven commented May 9, 2024

Ah good catch @Falven ! Let me know when this is closer to done and I'll come back for a final check/linting and merge 👍🏻

@logan-markewich all done, including a sample Notebook. There was a little more complexity because of Property Limitations. Properties that exceed the limit needed to be split. It is working well in my testing.

FYI, an observation, docstore/ref_doc_info puts all node_ids into a single property value. This will become huge even for a small number of documents. We should consider devising an alternative strategy to store these values rather than all in one property.

@logan-markewich
Copy link
Collaborator

@Falven isn't ref_doc_info stored per parent document ID? If a single document has many nodes, yes it will he big, but it's not every node in the index

(I might be misremembering how that works though)

@Falven
Copy link
Contributor Author

Falven commented May 9, 2024

@Falven isn't ref_doc_info stored per parent document ID? If a single document has many nodes, yes it will he big, but it's not every node in the index

(I might be misremembering how that works though)

@logan-markewich right now we're storing one Document/record in docstore/ref_doc_info with metadata and node_ids properties. This node_ids property is a giant array of GUIDs. Because they're in one giant array, they are convenient to retrieve but will quickly overfill property size limitations for Azure Storage (1 property can have as much as 64KiB data) and other DBs.

How I am solving this right now is by serializing the data into multiple properties, node_ids_part_1 -> node_ids_part_n, where each property has 64KiB of serialized parts of the GUID array which are concatenated and serialized back into a single GUID array on retrieval.

However with this approach you will still quickly saturate the overall record size requirements of 1MiB for Azure Storage, 2MB for CosmosDB and 16MB for MongoDB.

A better approach would be to redesign this document type to split the referenced nodes across more than 1 record. We can consider this in a future modification.

Here are some screenshots of the storage:
CleanShot 2024-05-10 at 14 58 59@2x
CleanShot 2024-05-10 at 14 59 13@2x
CleanShot 2024-05-10 at 14 59 26@2x
CleanShot 2024-05-10 at 14 59 43@2x

@Falven
Copy link
Contributor Author

Falven commented May 10, 2024

Added AzureChatStore to store chat history remotely in Azure Table Storage or CosmosDB.
Includes accompanying documentation and notebook.

CleanShot 2024-05-10 at 15 09 51@2x
CleanShot 2024-05-10 at 15 10 02@2x

One caveat is that it has some shared code with AzureKVStore. Not sure where to extract this common code to since these are separate packages/integrations.

@logan-markewich
Copy link
Collaborator

logan-markewich commented May 10, 2024

@Falven Wouldn't It have to be a HUGE document to take up more than 16MB of UUIDs? Like... hundreds of thousands of IDs, which would mean one document had that many nodes?

On the duplicated code side -- I'm kind of fine with it for now. We could consider a utils package folder for common shared utils. I know other packages like huggingface would have a similar need

Something like
from llama_index.utils.azure import ...
from llama_index.utils.huggingface import ...

With each of those being a namespaced package

@Falven
Copy link
Contributor Author

Falven commented May 10, 2024

@Falven Wouldn't It have to be a HUGE document to take up more than 16MB of UUIDs? Like... hundreds of thousands of IDs, which would mean one document had that many nodes?

On the duplicated code side -- I'm kind of fine with it for now. We could consider a utils package folder for common shared utils. I know other packages like huggingface would have a similar need

Something like from llama_index.utils.azure import ... from llama_index.utils.huggingface import ...

With each of those being a namespaced package

Yes, for MongoDB this is not as relevant due to the larger limit. But in the case of Azure Table Storage there is a 64 KiB max record property size, this means that around 1800 36-character strings (UUIDS) max would fit in a single property.
I think it's the KeywordIndex that generates a ton of UUIDs/node_ids (10648 to be exact which is 851842 bytes ~831KiB and would need to be split into 14 properties node_id_part_1 -> node_id_part_14 for the Paul Graham essay which is not a big file). Would this likely be due to it having a UUID/node_id per keyword or something? The point is that some index types will quickly exhaust this property limit and splitting into parts allows the record to remain below the 1MiB limit, but may not be the most elegant solution here.

CleanShot 2024-05-10 at 19 51 55@2x

@Falven Falven marked this pull request as draft May 12, 2024 03:39
@Falven Falven force-pushed the feature/azure_table_storage branch from 3bdae70 to 9191946 Compare May 20, 2024 05:49
@Falven
Copy link
Contributor Author

Falven commented May 20, 2024

@logan-markewich All done. Included the namespaced shared package. Some tests unrelated to this PR seem to be failing, not sure how we deal with that so we can merge.

Copy link
Collaborator

@logan-markewich logan-markewich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort on this @Falven ! A ton of features in this PR, and I think its good to go

@logan-markewich
Copy link
Collaborator

ugh one GitHub issue after another 😢 Sorry, trying to figure out the CICD

@logan-markewich logan-markewich enabled auto-merge (squash) May 22, 2024 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants