Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading from Azure with adlsv2 managed identity #2005

Open
jaychia opened this issue Mar 12, 2024 · 15 comments
Open

Support reading from Azure with adlsv2 managed identity #2005

jaychia opened this issue Mar 12, 2024 · 15 comments
Assignees

Comments

@jaychia
Copy link
Contributor

jaychia commented Mar 12, 2024

Is your feature request related to a problem? Please describe.

In Azure, sometimes users may use Azure managed identity instead of just pure credentials.

Daft should support this. This may require some API changes on the AzureConfig.

Resources:

(What are managed identities for Azure resources?) https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview
(How to use it from databricks) https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/azure-managed-identities

@jaychia jaychia added the p1 Important to tackle soon, but preemptable by p0 label Mar 12, 2024
@jaychia jaychia added p2 Nice to have features and removed p1 Important to tackle soon, but preemptable by p0 labels Apr 22, 2024
@djouallah
Copy link

I did tried daft to read iceberg table using a catalog, it worked well with s3 but got no implemented error when using Azure ADLSGen2 ? is it planned ?

@jaychia
Copy link
Contributor Author

jaychia commented May 5, 2024

Hi @djouallah !

This is planned, but requires some work on our end to correctly set up the Azure environment to test it. Let me bump up the priority on this :)

@jaychia jaychia added p0 Priority 0 - to be addressed immediately and removed p2 Nice to have features labels May 5, 2024
@jaychia
Copy link
Contributor Author

jaychia commented May 5, 2024

@djouallah could you provide an example of your setup to help guide development of this feature?

@djouallah
Copy link

I am getting this

i get this error
DaftCoreException: DaftError::External Source not yet implemented: abfss

@jaychia
Copy link
Contributor Author

jaychia commented May 6, 2024

Source not yet implemented:

Thanks! Actually this error might be because you're using abfss:// instead of abfs://.

Daft currently recognizes only az:// and abfs:// as valid ABFS URLs. Could you try using abfs:// instead and let me know if that works/how it fails?

@jaychia jaychia removed the p0 Priority 0 - to be addressed immediately label May 6, 2024
@samster25
Copy link
Member

It might be a simple fix to just map abfss:// to our Azure reader. It would be great if you can try it out @djouallah!

@djouallah
Copy link

can you map it please changing from abfss to abfs will break other engines :(

@samster25
Copy link
Member

@djouallah Yeah happy to map it on our end, we just want to verify that it indeed fixes your issue.
If you could run this on your end

df = daft.read_parquet("abfs://PATH_TO_PARQUET_FILE_THAT_IS_ABFSS")

and verify that it works

@djouallah
Copy link

djouallah commented May 7, 2024

I understand, I am reading from a catalog, I can't change the URL
image

@samster25
Copy link
Member

Actually just took a look at what azure's fsspec implementation looks like and they just map it.. I'll push up a PR!

@samster25
Copy link
Member

PR UP: #2244

@samster25
Copy link
Member

samster25 commented May 9, 2024

@djouallah The latest version of daft should now have the fix for reading abfss!

pip  install getdaft==0.2.24

@djouallah
Copy link

@samster25 thanks, now I get this
image

although the credential are already defined in the catalog


from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.io.fsspec import FsspecFileIO
catalog = SqlCatalog(
    "default",
    **{
        "uri"                : postgresql_db ,
        "adlfs.account-name" : account_name ,
        "adlfs.account-key"  : AZURE_STORAGE_ACCOUNT_KEY,
        "adlfs.tenant-id"    : azure_storage_tenant_id,
        "py-io-impl"         : "pyiceberg.io.fsspec.FsspecFileIO"
    },
)

polars works fine
image

@samster25
Copy link
Member

Ah, we might have missed an option when converting the Iceberg credentials to our IOConfig. I'll make a fix ASAP!

passing currently in storage_options for polars?

@djouallah
Copy link

no polars just figure out it by default, did not add anything

@kevinzwang kevinzwang self-assigned this May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

4 participants