Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pagination support to Files Source plugins #18059

Open
wants to merge 31 commits into
base: dev
Choose a base branch
from

Conversation

davelopez
Copy link
Contributor

@davelopez davelopez commented Apr 26, 2024

Currently, when listing contents in a remote file source, we retrieve all possible files and directories inside, then perform client-side pagination and search. While this approach is enough for small sources with a limited number of files, larger repositories could potentially contain a vast number of files, making navigation within these sources very slow and difficult.

This addresses the issue by passing limit and offset parameters to paginate the contents. However, this approach comes with its own drawbacks. Specifically, we must carefully handle sorting, filtering, and searching server-side, which may prove challenging depending on the plugin implementation. Also, pagination is only applicable in a "non-recursive" view of the file source.

The pull request includes the following changes:

  • A new serializable property for each plugin called supports_pagination. This property determines if the plugin can paginate the contents server-side so the client can delegate that.
  • Implementation of pagination for every PyFilesystem2-based and Invenio-based plugin.
  • A new plugin named TempFileSource, which simplifies testing of PyFilesystem2-based plugins. This plugin utilizes the built-in OS Filesystem. I wonder if we could even replace the posix plugin with this... unless there are specific functionalities unique to the posix plugin 🤔
  • Fix an issue in PyFilesystem2 plugins when listing them recursively, which was detected during testing with the new TempFileSource plugin.
  • A new serializable property for each plugin called supports_search. This property determines if the plugin can filter results by a search query. The syntax of the query is up to the plugin implementation, but a basic "search by name" should be supported as a minimum.
  • A new serializable property for each plugin called supports_sorting. Most plugins (if any) won't support this, but it can signal the client to disable sorting in the UI when pagination is supported but server-side sorting is not, so the listings are always in sync.
  • Listing file source contents will now return the total number of matches for the request. This is to be able to handle pagination correctly when using server-side pagination.

FilesSourcesPagination

TODO

  • Support server-side pagination.
    • Add total_matches as an additional return value when listing contents.
  • Support server-side filtering/search.
  • Support server-side sorting.
  • Update UI to use server-side pagination

How to test the changes?

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@jmchilton
Copy link
Member

Looking great so far!

@davelopez davelopez force-pushed the file_sources_pagination branch 2 times, most recently from 591200e to d3cef4e Compare May 3, 2024 11:31
This probably never worked as the information required to serialize the files was not there.
This flag will inform if the file source plugin is capable of filtering the list of contents by providing a search query string. The supported query syntax is up to the file source implementation, but a basic "search by name" should be supported.
This is necessary to handle pagination properly in the UI.
@davelopez
Copy link
Contributor Author

davelopez commented May 17, 2024

I guess this test failure:

tests/files/test_gcsfs.py:23: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/files/_util.py:193: in assert_simple_file_realize
    res, count = file_source.list("/", recursive=recursive, user_context=user_context)
galaxy/files/sources/__init__.py:441: in list
    return self._list(path, recursive, user_context, opts, limit, offset, query)
galaxy/files/sources/_pyfilesystem2.py:76: in _list
    count = self._get_total_matches_count(h, path, filter)
galaxy/files/sources/_pyfilesystem2.py:93: in _get_total_matches_count
    directories_count = fs.glob(directory_glob_pattern).count()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Globber(GCSFS('genomics-public-data', root_path=''), '//*/')

    def count(self):
        # type: () -> Counts
        """Count files / directories / data in matched paths.
    
        Example:
            >>> my_fs.glob('**/*.py').count()
            Counts(files=2, directories=0, data=55)
    
        Returns:
            `~Counts`: A named tuple containing results.
    
        """
        directories = 0
        files = 0
        data = 0
        for _path, info in self._make_iter(namespaces=["details"]):
            if info.is_dir:
                directories += 1
            else:
                files += 1
>           data += info.size
E           TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'

Is caused either by an upstream implementation error or by this known limitation of GCSFS https://fs-gcsfs.readthedocs.io/en/stable/#limitations

As a workaround, I will catch the error and default to 0 as directories are not a "real" thing in Google Cloud Storage or try to call fix_storage() as stated in the documentation but this might make navigating this file source slower 🤔

Update

Using fix_storage() didn't solve the issue and created more issues, so I'll try the other workaround even if pagination might not work correctly with the Google Cloud Storage plugging. In the worst case, we can set supports_pagination to false for this plugin and fall back to the existing behavior.

@davelopez davelopez marked this pull request as ready for review May 17, 2024 09:37
@github-actions github-actions bot added this to the 24.1 milestone May 17, 2024
@davelopez
Copy link
Contributor Author

This should be finally ready for review. The remaining converter and integration test failures seem unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants