Smarter deletion of temp datasets #3743

kaixi-wang · 2023-10-31T00:29:46Z

Changes:

add a flag for not deleting datasets referenced in saved views
don't add deleted datasets to singletons list

Todo:

add tests

…ingleton

codecov · 2023-10-31T00:35:33Z

Codecov Report

Attention: 21 lines in your changes are missing coverage. Please review.

Comparison is base (1ebf3f0) 16.17% compared to head (f623b3a) 16.17%.
Report is 5 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #3743   +/-   ##
========================================
  Coverage    16.17%   16.17%           
========================================
  Files          641      641           
  Lines        74137    74137           
  Branches       982      982           
========================================
  Hits         11988    11988           
  Misses       62149    62149

Flag	Coverage Δ
app	`16.17% <55.31%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...perators/src/OperatorInvocationRequestExecutor.tsx	`0.00% <ø> (ø)`
...ackages/relay/src/fragments/mediaFieldsFragment.ts	`100.00% <100.00%> (ø)`
app/packages/relay/src/graphQLSyncFragmentAtom.ts	`52.17% <100.00%> (ø)`
app/packages/state/src/recoil/index.ts	`100.00% <100.00%> (ø)`
app/packages/state/src/recoil/selectors.ts	`50.98% <ø> (ø)`
app/packages/state/src/recoil/mediaFields.ts	`83.33% <83.33%> (ø)`
app/packages/operators/src/built-in-operators.ts	`0.00% <0.00%> (ø)`
app/packages/state/src/recoil/schema.ts	`39.97% <42.10%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

brimoor

Is there a specific problem you're trying to solve? There should be another way to solve it.

Views like ToPatches() that use temporary datasets are designed to automatically regenerate their backing datasets if they're ever deleted:

fiftyone/fiftyone/core/stages.py

Line 7453 in 4407043

if state != last_state or not fod.dataset_exists(name):

In fact it is preferable that these temporary datasets do get deleted; if the old collection still exists, then loading a saved patches view would just use it, even though the root dataset may have changed significantly since the view was saved. The patches view would then be out of sync, which is counter to the goal of dataset views: they're supposed to be a virtual concept that's always in sync (yes, creating temporary collections is an expensive operation; but I didn't have a better idea at the time to implement these views).

The way to force a patches view to "resync" with the underlying dataset is to call view.reload(), which, case and point, deletes the temporary dataset, because this is a safe thing to do:

fiftyone/fiftyone/core/patches.py

Line 335 in 4407043

self._patches_dataset.delete()

brimoor · 2023-10-31T01:46:21Z

Tracking whether a collection is in-use or not is a relevant problem that connects with work that @swheaton is doing on versioning. But the relevant definition of "in use" is whether a user is literally using the dataset at that moment, not whether a saved view that's not currently being used stores a reference to a dataset.

kaixi-wang · 2023-10-31T16:21:07Z

Is there a specific problem you're trying to solve? There should be another way to solve it.

Views like ToPatches() that use temporary datasets are designed to automatically regenerate their backing datasets if they're ever deleted:

fiftyone/fiftyone/core/stages.py

Line 7453 in 4407043

if state != last_state or not fod.dataset_exists(name):

In fact it is preferable that these temporary datasets do get deleted; if the old collection still exists, then loading a saved patches view would just use it, even though the root dataset may have changed significantly since the view was saved. The patches view would then be out of sync, which is counter to the goal of dataset views: they're supposed to be a virtual concept that's always in sync (yes, creating temporary collections is an expensive operation; but I didn't have a better idea at the time to implement these views).

The way to force a patches view to "resync" with the underlying dataset is to call view.reload(), which, case and point, deletes the temporary dataset, because this is a safe thing to do:

fiftyone/fiftyone/core/patches.py

Line 335 in 4407043

self._patches_dataset.delete()

regenerating the view is expensive and can take time, so when loading saved views in the app, if the temp dataset doesn't already exist, the app will throw an error. Also, because build view is called multiple times when a single view is loaded, the app will try to create the same temp dataset, leading to a duplicate index error.

I think a better approach is to update the temp dataset when the root dataset is updated rather than just randomly delete and completely regenerate views, especially if the datasets are large.

brimoor · 2023-10-31T19:01:31Z

Regenerating the temp dataset only when the root dataset is updated would have ideal properties from a reuse standpoint, but we don't have a way to track Dataset.last_updated_at just yet. It does connect with work that @swheaton is doing on Sample.last_updated_at though. Any efforts to reuse, retain, or otherwise change the behavior of temporary dataset cleanup should really be done in conjunction with dataset versioning, as they both use "hidden datasets" and have similar cleanup needs.

If loading a saved view in the App builds the view multiple times, I can see how that could lead to temporary patch datasets getting generated multiple times concurrently, which is wasteful from a computation standpoint. Sounds like we should try to remove duplicate view load calls, for efficiency but also just for cleanliness.

Can you provide a reproducible example of seeing a duplicate index error? Can't guess how/why that would be happening.

manivoxel51 · 2023-11-01T13:21:26Z

If loading a saved view in the App builds the view multiple times,

@kaixi-wang can you share a video on how to confirm we are loading a view multiple times when saved view loads? I just checked the network tab and I only see a set_view call once. maybe you mean in the SDK side?

check for datasets in saved views and don't add deleted datasets to s…

f623b3a

…ingleton

kaixi-wang self-assigned this Oct 31, 2023

brimoor requested changes Oct 31, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smarter deletion of temp datasets #3743

Smarter deletion of temp datasets #3743

kaixi-wang commented Oct 31, 2023 •

edited

codecov bot commented Oct 31, 2023 •

edited

brimoor left a comment

brimoor commented Oct 31, 2023

kaixi-wang commented Oct 31, 2023 •

edited

brimoor commented Oct 31, 2023

manivoxel51 commented Nov 1, 2023

Smarter deletion of temp datasets #3743

Are you sure you want to change the base?

Smarter deletion of temp datasets #3743

Conversation

kaixi-wang commented Oct 31, 2023 • edited

codecov bot commented Oct 31, 2023 • edited

Codecov Report

brimoor left a comment

Choose a reason for hiding this comment

brimoor commented Oct 31, 2023

kaixi-wang commented Oct 31, 2023 • edited

brimoor commented Oct 31, 2023

manivoxel51 commented Nov 1, 2023

kaixi-wang commented Oct 31, 2023 •

edited

codecov bot commented Oct 31, 2023 •

edited

kaixi-wang commented Oct 31, 2023 •

edited