Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ckanext-activity: performance improvements #8169

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

duttonw
Copy link

@duttonw duttonw commented Apr 8, 2024

Fixes #

#7953

Problem: Dashboards take over 180seconds to load for a logged in normal user, emailing takes over 60minutes for less than 60 users getting notifications due to: high cpu/disk utilization on postgres on inefficient data heavy queries.

Steps to reproduce:
Have a couple of datasets which have been updating every 15min for the last 10 years creating 'activity' diff's. Have about 2.1million rows in activity table with data blob field fully populated.

Create default user and follow datasets (or the org). Go and visit dashboard and see how long it takes to load. Go and trigger email notifications on updated dataset/resources.

example org activity stream: https://www.data.qld.gov.au/organization/environment-science-and-innovation

Proposed fixes:

Speed up activity stream loading

ckanext-activity plugin:

  • Skip or lazy-load heavy CLOB column
  • Use subselect instead of union for more effective indexing/querying

These changes have been deployed and tested in www.data.qld.gov.au under commit qld-gov-au@88d932e which is on top of ckan 2.10.4 . This has allowed us to reenable email notifications hourly as this was too much of a database cpu+disk and batch layer cron overlaps since we moved from 2.9 to 2.10 and the activity table 'data' column started to be fleshed out with json blob data of the point in time history.

Features:

  • includes tests covering changes
  • includes updated documentation
  • includes user-visible changes
  • includes API changes
  • includes bugfix for possible backport

Please [X] all the boxes above that apply

Co-Author: @ThrawnCA

Problem: when activity table is very large with huge clob data like a resource which is updated every 15minutes for the last 5 years.
  Doing joins/unions on posgress makes sub processes pull in the full tables which the majority is thrown away.

Fix: Ignore data col until when needed, usually we only want data on delta diffing and when used in email notifications or
  Dashboard viewing, its not needed. When it is needed, we only need 30 sub selects based on primary key which is super fast and cachable.

- Skip or lazy-load heavy CLOB column
    - Use subselect instead of union for more effective indexing
@duttonw duttonw force-pushed the ckanext-activity_performance_improvements branch from db1d02f to 52f3c99 Compare April 8, 2024 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant