-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add verbose logging option to ProviderDataIngester
#4068
base: main
Are you sure you want to change the base?
Conversation
To log for a certain DAG set SHOULD_VERBOSE_LOG to a list containing that dag id e.g. ["smk_workflow"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catalog/dags/providers/provider_api_scripts/provider_data_ingester.py
Outdated
Show resolved
Hide resolved
catalog/dags/providers/provider_api_scripts/provider_data_ingester.py
Outdated
Show resolved
Hide resolved
Co-authored-by: sarayourfriend <24264157+sarayourfriend@users.noreply.github.com>
Co-authored-by: sarayourfriend <24264157+sarayourfriend@users.noreply.github.com>
Thanks for you feedback @sarayourfriend! I pushed some commits and I'm running the linter locally now but I gotta go now. I will hopefully push the linting changes tonight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much @nicoepp! I've left a few suggestions to clarify the (admittedly confusing 😄) distinction between terms like "record"/"batch item" in the logs. In addition, I think we should add at least one more verbose log to log the full result of get_batch_data
.
Since that is quite the verbose log, I've also asked that we not enable this logging by default locally, treating this more as an optional development tool. (CC @AetherUnbound to give you a chance to disagree with me if that's not what you had in mind! 😄)
catalog/dags/providers/provider_api_scripts/provider_data_ingester.py
Outdated
Show resolved
Hide resolved
catalog/dags/providers/provider_api_scripts/provider_data_ingester.py
Outdated
Show resolved
Hide resolved
catalog/dags/providers/provider_api_scripts/provider_data_ingester.py
Outdated
Show resolved
Hide resolved
catalog/dags/providers/provider_api_scripts/provider_data_ingester.py
Outdated
Show resolved
Hide resolved
catalog/dags/providers/provider_api_scripts/provider_data_ingester.py
Outdated
Show resolved
Hide resolved
I'm playing around with the idea to log the full result of I modified my code now to just log the first 5 and the last 5 dictionaries in each list returned by |
Absolutely perfect, even just the first five is probably fine for our needs! |
Here is the full log file (from the screenshot) for better reference: |
Ok. Thanks for the feedback. I'm only logging the first 5 dictionaries of the batch for now. I also pushed a commit making the verbose logging always controlled by the Airflow variable. No more checking the "ENVIRONMENT" variable while deciding to log or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Just a few very small comments and then we should ship this 💯
@@ -401,6 +407,7 @@ def get_batch(self, query_params: dict) -> tuple[list | None, bool]: | |||
|
|||
# Build a list of records from the response | |||
batch = self.get_batch_data(response_json) | |||
self._verbose_log("Batch data:", batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._verbose_log("Batch data:", batch) | |
self._verbose_log( | |
f"Got batch with {len(batch)} items. The first items are:", | |
batch[:5] if batch else None | |
) |
Can we update this message with the actual batch len, to make sure that it's not misleading that only 5 are actually logged?
We could also do the list slicing here, and then have _verbose_log
log all the items in data
to make it more flexible for future use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm running into several problems changing the code as suggested:
It seems like batch is not always a list, it sometimes can be a dict which makes the tests fail since a dict cannot be sliced. This was not a problem if we did the slicing inside the _verbose_log
method because we would first check if the Variable
was set and the tests never activate verbose logging. See following screenshot showing the test that is mocking a dict batch instead of a list batch.
Also, the len(batch)
fails since batch can be None
. So, code that could handle all this would look something like this:
# ...
# Build a list of records from the response
batch = self.get_batch_data(response_json)
batch_to_log = batch if batch and not isinstance(batch, dict) else []
self._verbose_log(
f"Got batch with {len(batch_to_log)} items. The first items are:",
batch_to_log[:5]
)
# Optionally, apply some logic to the response to determine whether
# ingestion should continue or if should be short-circuited. By default
# this will return True and ingestion continues.
should_continue = self.get_should_continue(response_json)
return batch, should_continue
This code isn't the nicest looking IMO. Is this what we want? Ideas on how to improve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stacimc I pushed a commit now that tries to address this with code that is a bit nicer looking. E.g. for the exceptional case where batch is a dict
in the tests I thought it was better to do the handling of that and the batch slicing inside the _verbose_log
method. So, I added an optional param to indicate if you want to slice the batch and by how much before you print it.
Can you take a look and tell what you think?
logger.info(msg) | ||
if data: | ||
for item in data[:5]: | ||
logger.info(f"\t{item}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the added formatting, much visually cleaner! However instead of logging each line separately, I think we should do this all in one log statement by instead appending the items from data
to the msg
before logging. Something like:
if data:
msg += "\n".join([f"\t{item}" for item in data])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed code that does all this in one log statement now
I've also just noticed the catalog test failures -- that's because we are mocking all the results of MockVariable.get.side_effect = [
20, # ingestion_limit
{}, # skipped_ingestion_errors
+ [], # should_verbose_log
] |
Based on the low urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was ready for review 6 day(s) ago. PRs labelled with low urgency are expected to be reviewed within 5 weekday(s)2. @nicoepp, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nico, I'm converting this to a draft. Please let us know when this is ready for review. Thanks!
catalog/dags/providers/provider_api_scripts/provider_data_ingester.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Krystle Salazar <github@krysal.co>
and make amount of lines logged configurable
To log for a certain DAG set SHOULD_VERBOSE_LOG to a list containing that dag id.
For example, set the SHOULD_VERBOSE_LOG Airflow variable to: ["smk_workflow"]
Fixes
Fixes #1420 by @AetherUnbound
Description
Testing Instructions
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin