-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smarter deletion of temp datasets #3743
base: develop
Are you sure you want to change the base?
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #3743 +/- ##
========================================
Coverage 16.17% 16.17%
========================================
Files 641 641
Lines 74137 74137
Branches 982 982
========================================
Hits 11988 11988
Misses 62149 62149
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific problem you're trying to solve? There should be another way to solve it.
Views like ToPatches()
that use temporary datasets are designed to automatically regenerate their backing datasets if they're ever deleted:
fiftyone/fiftyone/core/stages.py
Line 7453 in 4407043
if state != last_state or not fod.dataset_exists(name): |
In fact it is preferable that these temporary datasets do get deleted; if the old collection still exists, then loading a saved patches view would just use it, even though the root dataset may have changed significantly since the view was saved. The patches view would then be out of sync, which is counter to the goal of dataset views: they're supposed to be a virtual concept that's always in sync (yes, creating temporary collections is an expensive operation; but I didn't have a better idea at the time to implement these views).
The way to force a patches view to "resync" with the underlying dataset is to call view.reload()
, which, case and point, deletes the temporary dataset, because this is a safe thing to do:
fiftyone/fiftyone/core/patches.py
Line 335 in 4407043
self._patches_dataset.delete() |
Tracking whether a collection is in-use or not is a relevant problem that connects with work that @swheaton is doing on versioning. But the relevant definition of "in use" is whether a user is literally using the dataset at that moment, not whether a saved view that's not currently being used stores a reference to a dataset. |
regenerating the view is expensive and can take time, so when loading saved views in the app, if the temp dataset doesn't already exist, the app will throw an error. Also, because build view is called multiple times when a single view is loaded, the app will try to create the same temp dataset, leading to a duplicate index error. I think a better approach is to update the temp dataset when the root dataset is updated rather than just randomly delete and completely regenerate views, especially if the datasets are large. |
Regenerating the temp dataset only when the root dataset is updated would have ideal properties from a reuse standpoint, but we don't have a way to track If loading a saved view in the App builds the view multiple times, I can see how that could lead to temporary patch datasets getting generated multiple times concurrently, which is wasteful from a computation standpoint. Sounds like we should try to remove duplicate view load calls, for efficiency but also just for cleanliness. Can you provide a reproducible example of seeing a duplicate index error? Can't guess how/why that would be happening. |
@kaixi-wang can you share a video on how to confirm we are loading a view multiple times when saved view loads? I just checked the network tab and I only see a set_view call once. maybe you mean in the SDK side? |
Changes:
Todo: