Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion/tableau): optionally ingest multiple sites and create site containers #10498

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

haeniya
Copy link
Contributor

@haeniya haeniya commented May 14, 2024

Description

With this PR, the Tableau ingestion can be configured to ingest multiple sites at once and to add the sites as containers. We have around 26 sites in the company, and it would be pretty cumbersome to manage 26 ingestions separately. Additionally, because we have so many sites and projects, we need the sites to show up as containers in Datahub.

To achieve this I introduced some new optional config properties: ingest_multiple_sites, add_site_container and site_name_pattern. All existing recipes should continue to work as-is.

A recipe could look something like this:

source:
  type: tableau
  config:
    # Coordinates
    connect_uri: https://tableau1-test.swisscom.com
    ingest_multiple_sites: True
    add_site_container: True
    site_name_pattern:
      allow: ["^Data.*$"]
    platform_instance: tableau_test

    # Credentials
    username: "${TABLEAU_USER}"
    password: "${TABLEAU_PASSWORD}"

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

And that's how it looks like after ingesting:
Bildschirmfoto 2024-05-14 um 11 17 57

I already saw this asked a few times in the Slack channel as well:

Please let me know what you think and I will proceed with adding some documentation and tests to this.

Many thanks.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels May 14, 2024
@@ -762,8 +809,9 @@ def _populate_projects_registry(self) -> None:

def _authenticate(self) -> None:
try:
self.server = self.config.make_tableau_client()
logger.info("Authenticated to Tableau server")
site_content_url = self.current_site.content_url if self.current_site else self.config.site
Copy link
Contributor Author

@haeniya haeniya May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done because I saw there is some code that does re-authentication:


With this we always re-authenticate with the current site or fallback to the site property provided in the config (when authenticating the first time there's no current_site set).


# We rely on automatic browse paths (v2) when creating containers. That's why we need to sort the projects here.
# Otherwise, nested projects will not have the correct browse paths if not created in correct order / hierarchy.
self.tableau_project_registry = OrderedDict(sorted(projects_to_ingest.items(), key=lambda item: len(item[1].path)))
Copy link
Contributor Author

@haeniya haeniya May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to a bug we were experiencing when dealing with nested projects. If nested projects were added before their parent the browse path (v2) would be incorrect (wrong hierarchy). Let me know if there is a better way to do this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works great, and is a good catch

if site.state != 'Active' or not self.config.site_name_pattern.allowed(site.name):
continue
self.current_site = site
self.server.auth.switch_site(site)
Copy link
Contributor Author

@haeniya haeniya May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed, that the documentation states that this method is not allowed with Tableau Cloud: https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_ref_authentication.htm#switch_site
Unfortunately, I'm not able to test it with Tableau Cloud. I had a solution in a previous commit that did the switch manually (by just re-authenticating). We could go back to this if needed. WDYT? Do we need to support Tableau Cloud?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do need support for tableau cloud, so it would be better to manually switch

One other thing I need to test on my end - I want to double check that listing sites is supported with tableau cloud

@haeniya
Copy link
Contributor Author

haeniya commented May 21, 2024

I tried to fix the tests but for some reason when I run pytest tests/integration/tableau/test_tableau_ingest.py --update-golden-files most of the golden files get changed to an empty list which I don't think is correct. Btw - I already started to mock the new Tableau server methods in tableau_ingest_common. Any idea what I'm doing wrong?

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some comments about code structuring, and we'll probably need a test for the ingest_multiple_sites and add_site_container functionality

add_site_container: bool = Field(
False,
description="When enabled, sites are added as containers and therefore visible in the folder structure within Datahub.",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually it will make sense to make ingest_multiple_sites and add_site_container default to enabled, right

@@ -536,6 +560,7 @@ def __init__(
self.tableau_project_registry: Dict[str, TableauProject] = {}
self.workbook_project_map: Dict[str, str] = {}
self.datasource_project_map: Dict[str, str] = {}
self.current_site: Optional[SiteItem] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call _init_ingestion_variables here

I'd prefer to have all variable initialization in a single spot

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broadly, resetting initialization variables is usually a sign that the class isn't specific enough. I think it makes sense to have a TableauSiteSource class that is site-specific, and the TableauSource simply lists the sites and then calls TableauSiteSource.get_workunits_internal()

Most of the existing methods would move to the TableauSiteSource, so this would be a pretty small refactoring. The state management would also become a lot easier to follow and structurally guarantee that we don't forget to reset some variable, and would mean that we need fewer tests for this code

if (
site.state != "Active"
or not self.config.site_name_pattern.allowed(site.name)
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to log these as skipped

if site.state != 'Active' or not self.config.site_name_pattern.allowed(site.name):
continue
self.current_site = site
self.server.auth.switch_site(site)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do need support for tableau cloud, so it would be better to manually switch

One other thing I need to test on my end - I want to double check that listing sites is supported with tableau cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants