feat: Azure Table Storage integration #13335

Falven · 2024-05-07T17:53:30Z

Description

This PR introduces four new classes, AzureKVStore, AzureDocumentStore, AzureIndexStore and AzureChatStore that integrate with Azure Table Storage via the azure-data-table SDK to provide key-value, document, index, and chat history storage functionality, respectively. These changes will work for any of Azure's NoSQL data storage offerings supported by azure-data-table (CosmosDB, Azure Table...)

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Yes
No

Type of Change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Ran various configurations from the sample notebook testing both the AzureDocumentStore and AzureIndexStore and therefore the AzureKVStore. Added documents to the AzureDocumentStore and verified they were uploaded to an Azure Table Storage. Additionally I added various indexes: summary, vector, and keyword, verified they are reflected in Azure Table Storage. Finally, I tested persisting and querying each index. Also tested the AzureChatStore in it's own separate Notebook.

Added new notebook (that tests end-to-end)
I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

pre-commit install
pre-commit installed at .git/hooks/pre-commit
git ls-files | xargs pre-commit run black --files
[INFO] Initializing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO] Initializing environment for https://github.com/charliermarsh/ruff-pre-commit.
[INFO] Initializing environment for https://github.com/psf/black-pre-commit-mirror.
[INFO] Initializing environment for https://github.com/psf/black-pre-commit-mirror:.[jupyter].
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-mypy.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-mypy:types-requests,types-Deprecated,types-redis,types-setuptools,types-PyYAML,types-protobuf==4.24.0.4.
[INFO] Initializing environment for https://github.com/adamchainz/blacken-docs.
[INFO] Initializing environment for https://github.com/adamchainz/blacken-docs:black==23.10.1.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-prettier.
[INFO] Initializing environment for https://github.com/pre-commit/mirrors-prettier:prettier@3.0.3.
[INFO] Initializing environment for https://github.com/codespell-project/codespell.
[INFO] Initializing environment for https://github.com/codespell-project/codespell:tomli.
[INFO] Initializing environment for https://github.com/srstevenson/nb-clean.
[INFO] Initializing environment for https://github.com/pappasam/toml-sort.
[INFO] Installing environment for https://github.com/psf/black-pre-commit-mirror.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/adamchainz/blacken-docs.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
black-src................................................................Passed
black-docs-py........................................(no files to check)Skipped
black-docs-text..........................................................Passed
pre-commit install && git ls-files | xargs pre-commit run --show-diff-on-failure --files
pre-commit installed at .git/hooks/pre-commit
[INFO] Installing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/charliermarsh/ruff-pre-commit.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/pre-commit/mirrors-mypy.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/pre-commit/mirrors-prettier.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/codespell-project/codespell.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/srstevenson/nb-clean.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/pappasam/toml-sort.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
check BOM - deprecated: use fix-byte-order-marker........................Passed
check for merge conflicts................................................Passed
check for broken symlinks............................(no files to check)Skipped
check toml...............................................................Passed
check yaml...........................................(no files to check)Skipped
detect private key.......................................................Passed
fix end of files.........................................................Passed
mixed line ending........................................................Passed
trim trailing whitespace.................................................Passed
ruff.....................................................................Passed
black-src................................................................Passed
mypy.....................................................................Passed
black-docs-py........................................(no files to check)Skipped
black-docs-text..........................................................Passed
prettier.................................................................Passed
codespell................................................................Passed
nb-clean.............................................(no files to check)Skipped
toml-sort-fix............................................................Passed

Falven · 2024-05-07T20:20:12Z

Working on a Notebook use case.

llama-index-integrations/storage/docstore/llama-index-storage-docstore-azure/README.md

logan-markewich

tbh this looks pretty good!

If you add some example (either to the readme or a notebook, or both), I can merge 👍🏻 (and also assuming cicd passes lol)

Falven · 2024-05-08T16:15:27Z

tbh this looks pretty good!

If you add some example (either to the readme or a notebook, or both), I can merge 👍🏻 (and also assuming cicd passes lol)

There's one other core issue I am trying to solve - MongoDB supports native BSON serialization/deserialization of complex values. Azure Table Storage does not, so I need to add this. Don't merge this one yet. 😅

logan-markewich · 2024-05-09T03:43:15Z

Ah good catch @Falven ! Let me know when this is closer to done and I'll come back for a final check/linting and merge 👍🏻

review-notebook-app · 2024-05-09T04:01:43Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Falven · 2024-05-09T18:35:35Z

Ah good catch @Falven ! Let me know when this is closer to done and I'll come back for a final check/linting and merge 👍🏻

@logan-markewich all done, including a sample Notebook. There was a little more complexity because of Property Limitations. Properties that exceed the limit needed to be split. It is working well in my testing.

FYI, an observation, docstore/ref_doc_info puts all node_ids into a single property value. This will become huge even for a small number of documents. We should consider devising an alternative strategy to store these values rather than all in one property.

logan-markewich · 2024-05-09T20:43:26Z

@Falven isn't ref_doc_info stored per parent document ID? If a single document has many nodes, yes it will he big, but it's not every node in the index

(I might be misremembering how that works though)

Falven · 2024-05-09T21:06:32Z

@Falven isn't ref_doc_info stored per parent document ID? If a single document has many nodes, yes it will he big, but it's not every node in the index

(I might be misremembering how that works though)

@logan-markewich right now we're storing one Document/record in docstore/ref_doc_info with metadata and node_ids properties. This node_ids property is a giant array of GUIDs. Because they're in one giant array, they are convenient to retrieve but will quickly overfill property size limitations for Azure Storage (1 property can have as much as 64KiB data) and other DBs.

How I am solving this right now is by serializing the data into multiple properties, node_ids_part_1 -> node_ids_part_n, where each property has 64KiB of serialized parts of the GUID array which are concatenated and serialized back into a single GUID array on retrieval.

However with this approach you will still quickly saturate the overall record size requirements of 1MiB for Azure Storage, 2MB for CosmosDB and 16MB for MongoDB.

A better approach would be to redesign this document type to split the referenced nodes across more than 1 record. We can consider this in a future modification.

Here are some screenshots of the storage:

Falven · 2024-05-10T19:10:29Z

Added AzureChatStore to store chat history remotely in Azure Table Storage or CosmosDB.
Includes accompanying documentation and notebook.

One caveat is that it has some shared code with AzureKVStore. Not sure where to extract this common code to since these are separate packages/integrations.

logan-markewich · 2024-05-10T20:52:54Z

@Falven Wouldn't It have to be a HUGE document to take up more than 16MB of UUIDs? Like... hundreds of thousands of IDs, which would mean one document had that many nodes?

On the duplicated code side -- I'm kind of fine with it for now. We could consider a utils package folder for common shared utils. I know other packages like huggingface would have a similar need

Something like
from llama_index.utils.azure import ...
from llama_index.utils.huggingface import ...

With each of those being a namespaced package

Falven · 2024-05-10T23:42:23Z

@Falven Wouldn't It have to be a HUGE document to take up more than 16MB of UUIDs? Like... hundreds of thousands of IDs, which would mean one document had that many nodes?

On the duplicated code side -- I'm kind of fine with it for now. We could consider a utils package folder for common shared utils. I know other packages like huggingface would have a similar need

Something like from llama_index.utils.azure import ... from llama_index.utils.huggingface import ...

With each of those being a namespaced package

Yes, for MongoDB this is not as relevant due to the larger limit. But in the case of Azure Table Storage there is a 64 KiB max record property size, this means that around 1800 36-character strings (UUIDS) max would fit in a single property.
I think it's the KeywordIndex that generates a ton of UUIDs/node_ids (10648 to be exact which is 851842 bytes ~831KiB and would need to be split into 14 properties node_id_part_1 -> node_id_part_14 for the Paul Graham essay which is not a big file). Would this likely be due to it having a UUID/node_id per keyword or something? The point is that some index types will quickly exhaust this property limit and splitting into parts allows the record to remain below the 1MiB limit, but may not be the most elegant solution here.

…complex value serialization/deserialization.

…rresponding deserialization logic.

…able Storage or CosmosDB.

…mitations for RefDocInfo.

Falven · 2024-05-20T18:47:07Z

@logan-markewich All done. Included the namespaced shared package. Some tests unrelated to this PR seem to be failing, not sure how we deal with that so we can merge.

logan-markewich

Thanks for the effort on this @Falven ! A ton of features in this PR, and I think its good to go

logan-markewich · 2024-05-22T00:56:02Z

ugh one GitHub issue after another 😢 Sorry, trying to figure out the CICD

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label May 7, 2024

Falven changed the title ~~Feature/azure_table_storage~~ feat: Azure Table Storage May 7, 2024

Falven changed the title ~~feat: Azure Table Storage~~ feat: Azure Table Storage integration May 7, 2024

logan-markewich reviewed May 8, 2024

View reviewed changes

llama-index-integrations/storage/docstore/llama-index-storage-docstore-azure/README.md Show resolved Hide resolved

logan-markewich approved these changes May 8, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label May 8, 2024

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels May 8, 2024

Falven force-pushed the feature/azure_table_storage branch 10 times, most recently from 855652c to 33b2d2a Compare May 9, 2024 17:36

Falven requested a review from logan-markewich May 10, 2024 19:42

Falven marked this pull request as draft May 12, 2024 03:39

Falven added 17 commits May 20, 2024 01:48

Add configurable partition_key, collection/table name sanitation and …

2aabbda

…complex value serialization/deserialization.

Added basic Azure Table index store.

20a67df

Added key and input sanitation, logic for property size limits and co…

4497ece

…rresponding deserialization logic.

Add example notebook.

00c7818

Update serialization/deserialization logic to account for Cosmos.

1bd5558

Update Google Colab link.

e1c12b5

Pass service mode.

f2a891d

Added more documentation.

c95d681

Update class_author fields.

496d3a2

Decompose _serialize and _deserialize for clarity.

a54d863

Implement AzureChatStore to persist conversation history to Azure T…

125f19a

…able Storage or CosmosDB.

Readibility enhancements.

c8caf31

AzureDocumentStore implementation bypassing property and item size li…

bf05289

…mitations for RefDocInfo.

Fix comments, add select.

c9095de

Add shared namespaced package.

ea059a8

Fix

2ea3bfd

Use comprehension here.

9191946

Falven force-pushed the feature/azure_table_storage branch from 3bdae70 to 9191946 Compare May 20, 2024 05:49

Update partition key strategy.

7d288d6

Falven and others added 4 commits May 21, 2024 11:12

Pipeline complaining about this missing file.

8da8b59

Merge branch 'main' into feature/azure_table_storage

cd82ada

remove lock

66ea9b9

missing header

eab7812

logan-markewich approved these changes May 22, 2024

View reviewed changes

missing BUILD file

1259a7f

Merge branch 'main' into feature/azure_table_storage

1a77a2e

logan-markewich enabled auto-merge (squash) May 22, 2024 20:53

Merge branch 'main' into feature/azure_table_storage

d541dd8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Azure Table Storage integration #13335

feat: Azure Table Storage integration #13335

Falven commented May 7, 2024 •

edited

Falven commented May 7, 2024

logan-markewich left a comment

Falven commented May 8, 2024 •

edited

logan-markewich commented May 9, 2024

review-notebook-app bot commented May 9, 2024

Falven commented May 9, 2024

logan-markewich commented May 9, 2024

Falven commented May 9, 2024 •

edited

Falven commented May 10, 2024 •

edited

logan-markewich commented May 10, 2024 •

edited

Falven commented May 10, 2024 •

edited

Falven commented May 20, 2024 •

edited

logan-markewich left a comment

logan-markewich commented May 22, 2024

feat: Azure Table Storage integration #13335

Are you sure you want to change the base?

feat: Azure Table Storage integration #13335

Conversation

Falven commented May 7, 2024 • edited

Description

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Falven commented May 7, 2024

logan-markewich left a comment

Choose a reason for hiding this comment

Falven commented May 8, 2024 • edited

logan-markewich commented May 9, 2024

review-notebook-app bot commented May 9, 2024

Falven commented May 9, 2024

logan-markewich commented May 9, 2024

Falven commented May 9, 2024 • edited

Falven commented May 10, 2024 • edited

logan-markewich commented May 10, 2024 • edited

Falven commented May 10, 2024 • edited

Falven commented May 20, 2024 • edited

logan-markewich left a comment

Choose a reason for hiding this comment

logan-markewich commented May 22, 2024

Falven commented May 7, 2024 •

edited

Falven commented May 8, 2024 •

edited

Falven commented May 9, 2024 •

edited

Falven commented May 10, 2024 •

edited

logan-markewich commented May 10, 2024 •

edited

Falven commented May 10, 2024 •

edited

Falven commented May 20, 2024 •

edited