-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feed URL has been added to the stats. #5982
base: master
Are you sure you want to change the base?
Feed URL has been added to the stats. #5982
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5982 +/- ##
=======================================
Coverage 89.19% 89.19%
=======================================
Files 162 162
Lines 11287 11289 +2
Branches 1833 1833
=======================================
+ Hits 10067 10069 +2
Misses 928 928
Partials 292 292
|
@@ -480,6 +480,8 @@ def _handle_store_error(self, f, logmsg, spider, slot_type): | |||
|
|||
def _handle_store_success(self, f, logmsg, spider, slot_type): | |||
logger.info("Stored %s", logmsg, extra={"spider": spider}) | |||
feed_uri = logmsg.split(":")[-1].strip() | |||
self.crawler.stats.set_value(f"feedexport/feed_url/{feed_uri.split('/')[-1]}", feed_uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be a Scrapy Extensions that users can opt-in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you prefer? @Gallaecio @wRAR
@@ -480,6 +480,8 @@ def _handle_store_error(self, f, logmsg, spider, slot_type): | |||
|
|||
def _handle_store_success(self, f, logmsg, spider, slot_type): | |||
logger.info("Stored %s", logmsg, extra={"spider": spider}) | |||
feed_uri = logmsg.split(":")[-1].strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid this type of text transformation, is quite possible it will break on the future if the format changes. If you need slot.uri
you can pass it as an additional argument
@@ -480,6 +480,8 @@ def _handle_store_error(self, f, logmsg, spider, slot_type): | |||
|
|||
def _handle_store_success(self, f, logmsg, spider, slot_type): | |||
logger.info("Stored %s", logmsg, extra={"spider": spider}) | |||
feed_uri = logmsg.split(":")[-1].strip() | |||
self.crawler.stats.set_value(f"feedexport/feed_url/{feed_uri.split('/')[-1]}", feed_uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, use alternatives like https://docs.python.org/3/library/pathlib.html or https://docs.python.org/3/library/urllib.parse.html to operate with URIs paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are multiple feeds exported to the same top level path - For example, to the same bucket - only the last one will be keep. That behavior could be okish depending on your user case, but at least , it should be clearly documented.
A part from the concerns on my previous comments, you will also need to add some docs and unit tests for the PR. |
To be honest, I am not sure it makes sense to add this to the stats at all. What’s the scenario we want to address here? Not being sure that the right values are being used? Maybe logging the contents of |
Yeah, I'm not sure if this is useful as well. |
The idea is to facilitate Spidermon monitors to check that FEEDs have been uploaded and also to have a quick way to find the upload FEED URL rather than searching across million of logs. |
The log message would be at the beginning, so the number of log messages should not be an issue.
This makes sense, and I get that a log message does not work for this, but I wonder if there is a better way, or at least if the stat should be defined differently. Given there are
|
I feel like it's not the only time when something is added to stats just so spidermon is able to take it from there, but I don't think we have better ways to pass this. |
In this PR I have added the feed URL to the stats of the spider's job. Now anybody can easily get the feed URL from the stats of a job instead of searching for it in the logs of a job.
Testing:
I have uploaded these changes to the internal PYPI server. You can easily install this version of Scrapy by running the below command.
pip install scrapy (Add remaining command from internal PYPI docs)
After installing this version of Scrapy you can run the below spider by the following command.
scrapy crawl feed_exporter_tester -o testing/test.json
After the completion of execution, you can check the stats of the job and hopefully, you will get a stat like this ('feedexport/feed_url/test.json': 'testing/test.json').