Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BaseFilesystemReader interface + implement in SimpleDirectoryReader #13424

Merged
merged 6 commits into from May 17, 2024

Conversation

Javtor
Copy link
Contributor

@Javtor Javtor commented May 11, 2024

Description

This is just the changes made to llama-index-core in #13408. Opened a separate PR for these because the changes done to integrations there (like S3 and SharePoint) depend on the core package version being bumped first to pass CI.

#13408 also contains the unit tests for the implementation in SimpleDirectoryReader

Version Bump?

  • Yes

Type of Change

  • New feature (non-breaking change which adds functionality)

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 11, 2024
Copy link
Contributor

@sourabhdesai sourabhdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainly have a question here about whether the read_file methods should actually be returning the file content 🤔 e.g. bytes?

return self.list_files(**kwargs)

@abstractmethod
def get_file_info(self, input_file: Path, **kwargs) -> Dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am wondering if this should return a FileInfo/FileMetadata pydantic class which just contains some minimum required metadata fields to standardize this a bit 🤔

Comment on lines 42 to 43
def read_file(self, input_file: Path, **kwargs) -> List[Document]:
"""Read file from filesystem and return documents."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this return the file content instead? to separate the file parsing from file loading

@@ -71,6 +71,141 @@ class Config:
arbitrary_types_allowed = True


class BaseResourcesReader(BaseReader):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, and probably fine to merge for now.

But in general, nesting base class under base class can be a confusing pattern to maintain 😅 Longer term I'd prefer just a single base class, or mixin's that add functionality

Because now we have BaseReader -> BaseResourcesReader -> BaseFileSystemReader ->

@@ -43,7 +43,7 @@ name = "llama-index-core"
packages = [{include = "llama_index"}]
readme = "README.md"
repository = "https://github.com/run-llama/llama_index"
version = "0.10.37"
version = "0.10.37.post1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi this would need to be manually published after merging (happy to do that)

from llama_index.core.async_utils import run_jobs, get_asyncio_module
from llama_index.core.schema import Document
from tqdm import tqdm


class BaseFilesystemReader(BaseResourcesReader, ABC):
@abstractmethod
def read_file_content(self, input_file: Path, **kwargs) -> bytes:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of curious how this differs from load_resource() -- feels like it would be the same method?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this returns bytes

Copy link
Collaborator

@logan-markewich logan-markewich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for now to unblock work, but

  • there are too many base classes
  • BaseFilesSystemReader doesn't seem entirely needed? But I could be wrong

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label May 16, 2024
@Javtor Javtor merged commit 980e617 into main May 17, 2024
8 checks passed
@Javtor Javtor deleted the javier/fs-reader-interface-simpledirreader branch May 17, 2024 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants