New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Azure Table Storage integration #13335
base: main
Are you sure you want to change the base?
Conversation
Working on a Notebook use case. |
llama-index-integrations/storage/docstore/llama-index-storage-docstore-azure/README.md
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh this looks pretty good!
If you add some example (either to the readme or a notebook, or both), I can merge 👍🏻 (and also assuming cicd passes lol)
There's one other core issue I am trying to solve - MongoDB supports native BSON serialization/deserialization of complex values. Azure Table Storage does not, so I need to add this. Don't merge this one yet. 😅 |
Ah good catch @Falven ! Let me know when this is closer to done and I'll come back for a final check/linting and merge 👍🏻 |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
855652c
to
33b2d2a
Compare
@logan-markewich all done, including a sample Notebook. There was a little more complexity because of Property Limitations. Properties that exceed the limit needed to be split. It is working well in my testing. FYI, an observation, |
@Falven isn't ref_doc_info stored per parent document ID? If a single document has many nodes, yes it will he big, but it's not every node in the index (I might be misremembering how that works though) |
@logan-markewich right now we're storing one Document/record in How I am solving this right now is by serializing the data into multiple properties, However with this approach you will still quickly saturate the overall record size requirements of 1MiB for Azure Storage, 2MB for CosmosDB and 16MB for MongoDB. A better approach would be to redesign this document type to split the referenced nodes across more than 1 record. We can consider this in a future modification. |
@Falven Wouldn't It have to be a HUGE document to take up more than 16MB of UUIDs? Like... hundreds of thousands of IDs, which would mean one document had that many nodes? On the duplicated code side -- I'm kind of fine with it for now. We could consider a Something like With each of those being a namespaced package |
Yes, for MongoDB this is not as relevant due to the larger limit. But in the case of Azure Table Storage there is a 64 KiB max record property size, this means that around 1800 36-character strings (UUIDS) max would fit in a single property. |
…complex value serialization/deserialization.
…rresponding deserialization logic.
…able Storage or CosmosDB.
…mitations for RefDocInfo.
3bdae70
to
9191946
Compare
@logan-markewich All done. Included the namespaced shared package. Some tests unrelated to this PR seem to be failing, not sure how we deal with that so we can merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort on this @Falven ! A ton of features in this PR, and I think its good to go
ugh one GitHub issue after another 😢 Sorry, trying to figure out the CICD |
Description
This PR introduces four new classes,
AzureKVStore
,AzureDocumentStore
,AzureIndexStore
andAzureChatStore
that integrate with Azure Table Storage via theazure-data-table
SDK to provide key-value, document, index, and chat history storage functionality, respectively. These changes will work for any of Azure's NoSQL data storage offerings supported byazure-data-table
(CosmosDB, Azure Table...)Version Bump?
Did I bump the version in the
pyproject.toml
file of the package I am updating? (Except for thellama-index-core
package)Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Ran various configurations from the sample notebook testing both the
AzureDocumentStore
andAzureIndexStore
and therefore theAzureKVStore
. Added documents to theAzureDocumentStore
and verified they were uploaded to an Azure Table Storage. Additionally I added various indexes: summary, vector, and keyword, verified they are reflected in Azure Table Storage. Finally, I tested persisting and querying each index. Also tested theAzureChatStore
in it's own separate Notebook.Suggested Checklist:
make format; make lint
to appease the lint gods