Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redshift Ingestion broken in 0.13.2 #10435

Open
joshua-pgatour opened this issue May 6, 2024 · 4 comments
Open

Redshift Ingestion broken in 0.13.2 #10435

joshua-pgatour opened this issue May 6, 2024 · 4 comments
Labels
bug Bug report

Comments

@joshua-pgatour
Copy link

Describe the bug

Execution finished with errors.
{'exec_id': '2236cd44-90eb-4781-9563-05ea51f4bbd4',
 'infos': ['2024-05-03 19:27:34.726045 INFO: Starting execution for task with name=RUN_INGEST',
           '2024-05-03 19:55:40.923833 INFO: Caught exception EXECUTING task_id=2236cd44-90eb-4781-9563-05ea51f4bbd4, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 140, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 282, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': ['2024-05-03 19:55:40.923635 ERROR: The ingestion process was killed by signal SIGKILL likely because it ran out of memory. You can '
            'resolve this issue by allocating more memory to the datahub-actions container.']}

I have tried to increase my memory up to 32gb and I still get this error.  I've turned off Lineage and Profiling and even tried just tables, no views.  

Occassionally I will get this error instead of the memory one:

  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 391, in _batch_workunits_by_urn
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 184, in auto_materialize_referenced_tags
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 91, in auto_status_aspect
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 468, in get_workunits_internal
    yield from self.extract_lineage(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 987, in extract_lineage
    lineage_extractor.populate_lineage(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 659, in populate_lineage
    table_renames, all_tables_set = self._process_table_renames(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 872, in _process_table_renames
    all_tables[database][schema].add(prev_name)
KeyError: 'pgat_competitions_x'

In this case it seems to be trying to reference a schema name that I have filtered out in the ingest recipe.
@joshua-pgatour joshua-pgatour added the bug Bug report label May 6, 2024
@hsheth2
Copy link
Collaborator

hsheth2 commented May 8, 2024

@joshua-pgatour what CLI version is this with?

I merged a fix related to this in #9967

@joshua-pgatour
Copy link
Author

joshua-pgatour commented May 9, 2024

Thank you for the reply. I have tried going back one version at a time on datahub-actions dockerhub releases and the KeyError seems to stop happening around v10. However, I still have a memory issue. I have pretty much maxed out the size my pod can be and it still fails with memory SIGKILL. Any suggestions on getting around this? Here is my current recipe:

`source:
type: redshift
config:
host_port: ''
database: pgat
username:
table_lineage_mode: mixed
include_table_lineage: false
include_tables: true
include_views: false
profiling:
enabled: false
profile_table_level_only: false
stateful_ingestion:
enabled: true
password: '${redshift_secret2}'
schema_pattern:
allow:
- 'curated_.*'

`

@joshua-pgatour
Copy link
Author

So I figured out how to change the CLI version in the ingest recipe. I'm sorry I thought it was controlled by the datahub-actions container version. (Didn't know it was controlled in the recipe). 0.10.5.1 CLI works fine on 16gb memory and there is no KeyError. I will experiment at what point this breaks. But I gotta believe there's a memory leak in newer versions.

@joshua-pgatour
Copy link
Author

I can confirm that v0.13.2 has the memory problem. v0.13.1 works, but in my testing the ingest process has slowed significantly since 0.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

2 participants