Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs/perf best practices update #7586

Closed

Conversation

dvryaboy
Copy link

Closes #(Insert issue number closed by this PR)

Change Description

Adds some more details to perf tips regarding reading directly from the backing object store, based on conversation in Slack [1][2]

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the awesome contribution! Our performance guidelines definitely has room for improvements. Few comments:

  1. I don't think Fuse examples are relevant in this doc.
  2. Should import really go under "Operate directly on the storage"? I'll let @talSofer to decide.

lakeFS offers multiple ways to do that:
lakeFS offers multiple ways to do that.

### Use zero-copy import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is the appropriate location for this section. For example, I could import data to lakeFS and then read it directly from lakeFS. So that flow isn't "operating directly on the object store"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, The zero-copy import section talks about how to regularly feed data into lakeFS rather than how to interact with data already managed by lakeFS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you prefer to organize this information?

The existing flow makes sense to me but I'm not dead set on it.

The reason it makes sense to me is that when I looked at the LakeFS architecture diagram, the first question in my mind was "how do I bypass the lakefs service for bulk operations on data", as that's the obvious bottleneck assuming the object storage is S3 or similar. This then leads to 3 questions, which are addressed in this section:

  1. If I have data already in S3, how do I make LakeFS aware of it without re-copying? Answer: use zero-copy writes / imports . (This is where DVC fell out of our evaluation, btw...)
  2. If I already have a large dataset in LakeFS, how do I add new data? 2 answers: another zero copy import is fine, or write by performing metadata operations in lakefs and writing directly to storage
  3. How do I do efficient, scaleable reads? Answer: get the urls from the metadata service, talk to S3/GCS directly using the urls.

And the fuse stuff got in here because the follow up to the third question is, what if I am not writing my own reader, but rather using fuse to mount a bucket - does this mean I'm SOL for using LakeFS at all, and if not, do all the reads go through the slow way, streaming data through the LakeFS service? And the answer to that is also no, LakeFS thought of that, it's all done via symlinks and you can read directly from the storage.

docs/understand/performance-best-practices.md Outdated Show resolved Hide resolved
docs/understand/performance-best-practices.md Show resolved Hide resolved
* Read an object getObject (lakeFS OpenAPI) and add `--presign`. You'll get a link to download an object.
* Use statObject (lakeFS OpenAPI) which will return `physical_address` that is the actual S3/GCS path that you can read with any S3/GCS client.

### Reading directly from GCS when using GCS-Fuse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this belongs in a dedicated GCS-Fuse page, not in a general best performance practices guide.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted the wording and section header level a bit to clarify the relevance, but I do think it's worth calling out here because "how the heck will this work scaleably, if at all" is a performance question, and it doesn't hurt to answer it here and point at the GCS-Fuse page for details.

The GCS-Fuse page should probably live under integrations/gcs-fuse and perhaps also linked from deploy/gcp.md, not inside Vertex, as it's not just vertex that uses it, but I'll save that for a separate pull request :)

Also while we are on the topic, this is a very clever way to integrate with Fuse, kudos.

docs/understand/performance-best-practices.md Outdated Show resolved Hide resolved
Copy link
Contributor

@talSofer talSofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dvryaboy for your valued contribution!

I added some comments and agreed with @itaiad200's suggestions to

  1. Relocate the GCS-Fuse docs to a dedicated page. They are already at https://docs.lakefs.io/integrations/vertex_ai.html#using-lakefs-with-cloud-storage-fuse. We will be happy to see contributions/changes to that page if you are interested!
  2. Keep the use zero-copy import section in its original place.

We will be happy to get your thoughts on this :)

This is achieved by `lakectl fs upload --pre-sign` (Docs: [lakectl-upload][lakectl-upload]). The equivalent OpenAPI endpoint will return a URL to which the user can upload the file(s) in question.

### Read directly from the object store
lakeFS maintains versions by keeping track of each file; the commit and branch paths (`my-repo/commits/{commit_id}`, `my-repo/branches/main`) are virtual and resolved by the lakefs service to iterate through appropriate files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this part belongs to the concepts and model page.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a version of this is in the concepts and model page ("A lot of what lakeFS does is to manage how lakeFS paths translate to physical paths on the object store.", etc).

Do you feel this sort of thing needs to be DRYed up? My thought was that "repetition doesn't spoil the prayer", as the saying goes, and a brief mention here helps set the context without expecting the reader to have gone in detail through other pages.

@@ -31,9 +26,61 @@ When accessing data using the branch name (e.g. `lakefs://repo/main/path`) lakeF
For more information, see [how uncommitted data is managed in lakeFS][representing-refs-and-uncommitted-metadata].

## Operate directly on the storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to breakdown this section down into sub-sections, i'd suggest that each section describe a way to operate directly on the storage rather than distinguishing between reads and writes that are supported by most ways. That is, I'd use a structure like:

Operate directly on the storage

Pre-sign URLs

lakeFS Hadoop Filesystem

Staging API

WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that can work but can't personally commit to that as I don't quite grok the details of the staging API and the lakeFS HDFS setup (sadly, I know more than I would like to about HDFS itself :) ).

Is this something you'd like to address in this PR or is it possible to address in a subsequent one?

lakeFS offers multiple ways to do that:
lakeFS offers multiple ways to do that.

### Use zero-copy import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, The zero-copy import section talks about how to regularly feed data into lakeFS rather than how to interact with data already managed by lakeFS.

@dvryaboy
Copy link
Author

@talSofer and @itaiad200 , thank you for the feedback!
I incorporated a few of your suggestions and explained my motivation/reasoning for a couple of the things you pushed back on. LMK what you think after reading the explanation. In terms of which of these things I hold on to strongly vs weakly, I have a moderate preference for keeping gcs-fuse and high level concept info in here even if it's duplicated elsewhere, and a low level of preference for organizing things the way I did but can be easily convinced to organize differently if my arguments didn't sway you.

@dvryaboy dvryaboy requested a review from talSofer April 24, 2024 21:04
Copy link

This PR is now marked as stale after 30 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

@github-actions github-actions bot added the stale label May 25, 2024
Copy link

github-actions bot commented Jun 1, 2024

Closing this PR because it has been stale for 7 days with no activity.

@github-actions github-actions bot closed this Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants