Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datapusher COPY mode #221

Closed
wants to merge 7 commits into from
Closed

Conversation

jqnatividad
Copy link
Contributor

@jqnatividad jqnatividad commented Jan 3, 2021

Resolves #220.

When a queued file's size is > COPY_MODE_SIZE (bytes) and a COPY_ENGINE_WRITE_URL is specified, datapusher uses Postgres COPY, similar to xloader, otherwise, use existing datapusher logic.

The main difference being, that we still use messytables to guess the column data types, which xloader currently doesn't do.

Also fixes #219, as the Datastore message is now more informative.

datapusher-copy

use postgres COPY for faster datapusher
@jqnatividad
Copy link
Contributor Author

Currently working on an implementation where a typical resource is in the millions of rows.

At the time I started, xloader was not migrated to 2.9 yet, so I used datapusher and tried to improve it with HA capabilities, which were merged upstream.

Even with multiple workers, it was still too slow for my use case, and investigated switching to xloader. However, xloader doesn't have type-guessing, which my client requires - thus this PR.

I borrowed heavily from xloader, :)

It is very fast now, even a tad faster than xloader - squeezed some more performance out of COPY by using the COPY FREEZE option - obviating the need to maintain WAL logs during COPY, and doing a VACUUM ANALYZE immediately after.

@amercader @davidread appreciate your review/feedback.

PEP8 reformatting;
Improve Datapusher non-COPY messages;
remove unnecessary TRYs in COPY mode
@jqnatividad
Copy link
Contributor Author

jqnatividad commented Jan 3, 2021

Took the opportunity to fully fix #219. Messages are now only emitted every 20 seconds, and also includes rowcount and elapsed time.

Here are some benchmarks for the same file - a sample of NYC 311 calls with 100k rows on a VirtualBox VM running Postgres 11 and CKAN 2.9.1 on an 8gb Ubuntu 16.04 instance .

Existing datapusher process: 288.17 seconds
Using COPY: 77.24 seconds, including a VACUUM ANALYZE.

Screenshot from 2021-01-03 17-18-38
Screenshot from 2021-01-03 17-19-36

when COPY fails, we need it in the info message.
Going back to old behavior. Misunderstood the meaning of 'url_type'
 If COPY_MODE_SIZE is zero, or the filesize is less than the COPY_MODE_SIZE threshold in bytes, push thru Datastore API.

Otherwise, use COPY if we have a COPY_WRITE_ENGINE_URL.

Corrected minor typos and streamlined exception handling.
with open(tmp.name, 'rb') as f:
header_line = f.readline()
try:
sniffer = csv.Sniffer()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, you're using the file as a raw CSV to push to the database, relying on the database's native interpretation of the datatypes/nulls/etc for conversion. This is only going to work for files that have come in as CSV, not ones that are uploaded as .xls/ods and have been interpreted/converted by Messytables. The only criteria for the legacy vs copy is file-size, so some previously handled files won't be handled anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EricSoroos, this is an old PR and I have since created a fork of Datapusher (https://github.com/dathere/datapusher-plus) that now also supports xls/ods files, always uses Postgres copy and dropped the legacy datapusher stuff, and replaced messytables with qsv.

Would be interested in your feedback.

@jqnatividad
Copy link
Contributor Author

Closing this now that there is https://github.com/dathere/datapusher-plus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Postgres COPY in Datapusher replace 'saving chunk' message with a meaningful message
2 participants