New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BaseFilesystemReader interface + implement in SimpleDirectoryReader #13424
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mainly have a question here about whether the read_file
methods should actually be returning the file content 🤔 e.g. bytes
?
return self.list_files(**kwargs) | ||
|
||
@abstractmethod | ||
def get_file_info(self, input_file: Path, **kwargs) -> Dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am wondering if this should return a FileInfo
/FileMetadata
pydantic class which just contains some minimum required metadata fields to standardize this a bit 🤔
def read_file(self, input_file: Path, **kwargs) -> List[Document]: | ||
"""Read file from filesystem and return documents.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this return the file content instead? to separate the file parsing from file loading
…rface-simpledirreader
@@ -71,6 +71,141 @@ class Config: | |||
arbitrary_types_allowed = True | |||
|
|||
|
|||
class BaseResourcesReader(BaseReader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, and probably fine to merge for now.
But in general, nesting base class under base class can be a confusing pattern to maintain 😅 Longer term I'd prefer just a single base class, or mixin's that add functionality
Because now we have BaseReader -> BaseResourcesReader -> BaseFileSystemReader ->
@@ -43,7 +43,7 @@ name = "llama-index-core" | |||
packages = [{include = "llama_index"}] | |||
readme = "README.md" | |||
repository = "https://github.com/run-llama/llama_index" | |||
version = "0.10.37" | |||
version = "0.10.37.post1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi this would need to be manually published after merging (happy to do that)
from llama_index.core.async_utils import run_jobs, get_asyncio_module | ||
from llama_index.core.schema import Document | ||
from tqdm import tqdm | ||
|
||
|
||
class BaseFilesystemReader(BaseResourcesReader, ABC): | ||
@abstractmethod | ||
def read_file_content(self, input_file: Path, **kwargs) -> bytes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of curious how this differs from load_resource()
-- feels like it would be the same method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this returns bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving for now to unblock work, but
- there are too many base classes
- BaseFilesSystemReader doesn't seem entirely needed? But I could be wrong
…rface-simpledirreader
Description
This is just the changes made to
llama-index-core
in #13408. Opened a separate PR for these because the changes done to integrations there (like S3 and SharePoint) depend on the core package version being bumped first to pass CI.#13408 also contains the unit tests for the implementation in SimpleDirectoryReader
Version Bump?
Type of Change