Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter deletion of temp datasets #3743

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

kaixi-wang
Copy link
Contributor

@kaixi-wang kaixi-wang commented Oct 31, 2023

Changes:

  • add a flag for not deleting datasets referenced in saved views
  • don't add deleted datasets to singletons list

Todo:

  • add tests

@kaixi-wang kaixi-wang self-assigned this Oct 31, 2023
@codecov
Copy link

codecov bot commented Oct 31, 2023

Codecov Report

Attention: 21 lines in your changes are missing coverage. Please review.

Comparison is base (1ebf3f0) 16.17% compared to head (f623b3a) 16.17%.
Report is 5 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #3743   +/-   ##
========================================
  Coverage    16.17%   16.17%           
========================================
  Files          641      641           
  Lines        74137    74137           
  Branches       982      982           
========================================
  Hits         11988    11988           
  Misses       62149    62149           
Flag Coverage Δ
app 16.17% <55.31%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...perators/src/OperatorInvocationRequestExecutor.tsx 0.00% <ø> (ø)
...ackages/relay/src/fragments/mediaFieldsFragment.ts 100.00% <100.00%> (ø)
app/packages/relay/src/graphQLSyncFragmentAtom.ts 52.17% <100.00%> (ø)
app/packages/state/src/recoil/index.ts 100.00% <100.00%> (ø)
app/packages/state/src/recoil/selectors.ts 50.98% <ø> (ø)
app/packages/state/src/recoil/mediaFields.ts 83.33% <83.33%> (ø)
app/packages/operators/src/built-in-operators.ts 0.00% <0.00%> (ø)
app/packages/state/src/recoil/schema.ts 39.97% <42.10%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific problem you're trying to solve? There should be another way to solve it.

Views like ToPatches() that use temporary datasets are designed to automatically regenerate their backing datasets if they're ever deleted:

if state != last_state or not fod.dataset_exists(name):

In fact it is preferable that these temporary datasets do get deleted; if the old collection still exists, then loading a saved patches view would just use it, even though the root dataset may have changed significantly since the view was saved. The patches view would then be out of sync, which is counter to the goal of dataset views: they're supposed to be a virtual concept that's always in sync (yes, creating temporary collections is an expensive operation; but I didn't have a better idea at the time to implement these views).

The way to force a patches view to "resync" with the underlying dataset is to call view.reload(), which, case and point, deletes the temporary dataset, because this is a safe thing to do:

self._patches_dataset.delete()

@brimoor
Copy link
Contributor

brimoor commented Oct 31, 2023

Tracking whether a collection is in-use or not is a relevant problem that connects with work that @swheaton is doing on versioning. But the relevant definition of "in use" is whether a user is literally using the dataset at that moment, not whether a saved view that's not currently being used stores a reference to a dataset.

@kaixi-wang
Copy link
Contributor Author

kaixi-wang commented Oct 31, 2023

Is there a specific problem you're trying to solve? There should be another way to solve it.

Views like ToPatches() that use temporary datasets are designed to automatically regenerate their backing datasets if they're ever deleted:

if state != last_state or not fod.dataset_exists(name):

In fact it is preferable that these temporary datasets do get deleted; if the old collection still exists, then loading a saved patches view would just use it, even though the root dataset may have changed significantly since the view was saved. The patches view would then be out of sync, which is counter to the goal of dataset views: they're supposed to be a virtual concept that's always in sync (yes, creating temporary collections is an expensive operation; but I didn't have a better idea at the time to implement these views).

The way to force a patches view to "resync" with the underlying dataset is to call view.reload(), which, case and point, deletes the temporary dataset, because this is a safe thing to do:

self._patches_dataset.delete()

regenerating the view is expensive and can take time, so when loading saved views in the app, if the temp dataset doesn't already exist, the app will throw an error. Also, because build view is called multiple times when a single view is loaded, the app will try to create the same temp dataset, leading to a duplicate index error.

I think a better approach is to update the temp dataset when the root dataset is updated rather than just randomly delete and completely regenerate views, especially if the datasets are large.

@brimoor
Copy link
Contributor

brimoor commented Oct 31, 2023

Regenerating the temp dataset only when the root dataset is updated would have ideal properties from a reuse standpoint, but we don't have a way to track Dataset.last_updated_at just yet. It does connect with work that @swheaton is doing on Sample.last_updated_at though. Any efforts to reuse, retain, or otherwise change the behavior of temporary dataset cleanup should really be done in conjunction with dataset versioning, as they both use "hidden datasets" and have similar cleanup needs.

If loading a saved view in the App builds the view multiple times, I can see how that could lead to temporary patch datasets getting generated multiple times concurrently, which is wasteful from a computation standpoint. Sounds like we should try to remove duplicate view load calls, for efficiency but also just for cleanliness.

Can you provide a reproducible example of seeing a duplicate index error? Can't guess how/why that would be happening.

@manivoxel51
Copy link
Contributor

If loading a saved view in the App builds the view multiple times,

@kaixi-wang can you share a video on how to confirm we are loading a view multiple times when saved view loads? I just checked the network tab and I only see a set_view call once. maybe you mean in the SDK side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants