Docs/perf best practices update #7586

dvryaboy · 2024-03-20T21:51:15Z

Closes #(Insert issue number closed by this PR)

Change Description

Adds some more details to perf tips regarding reading directly from the backing object store, based on conversation in Slack [1][2]

docs/understand/performance-best-practices.md

itaiad200

Thanks for the awesome contribution! Our performance guidelines definitely has room for improvements. Few comments:

I don't think Fuse examples are relevant in this doc.
Should import really go under "Operate directly on the storage"? I'll let @talSofer to decide.

itaiad200 · 2024-04-07T12:26:47Z

docs/understand/performance-best-practices.md

-lakeFS offers multiple ways to do that:
+lakeFS offers multiple ways to do that.
+
+### Use zero-copy import


I'm not sure that this is the appropriate location for this section. For example, I could import data to lakeFS and then read it directly from lakeFS. So that flow isn't "operating directly on the object store"

+1, The zero-copy import section talks about how to regularly feed data into lakeFS rather than how to interact with data already managed by lakeFS.

How would you prefer to organize this information?

The existing flow makes sense to me but I'm not dead set on it.

The reason it makes sense to me is that when I looked at the LakeFS architecture diagram, the first question in my mind was "how do I bypass the lakefs service for bulk operations on data", as that's the obvious bottleneck assuming the object storage is S3 or similar. This then leads to 3 questions, which are addressed in this section:

If I have data already in S3, how do I make LakeFS aware of it without re-copying? Answer: use zero-copy writes / imports . (This is where DVC fell out of our evaluation, btw...)

If I already have a large dataset in LakeFS, how do I add new data? 2 answers: another zero copy import is fine, or write by performing metadata operations in lakefs and writing directly to storage

How do I do efficient, scaleable reads? Answer: get the urls from the metadata service, talk to S3/GCS directly using the urls.

And the fuse stuff got in here because the follow up to the third question is, what if I am not writing my own reader, but rather using fuse to mount a bucket - does this mean I'm SOL for using LakeFS at all, and if not, do all the reads go through the slow way, streaming data through the LakeFS service? And the answer to that is also no, LakeFS thought of that, it's all done via symlinks and you can read directly from the storage.

docs/understand/performance-best-practices.md

itaiad200 · 2024-04-07T12:33:17Z

docs/understand/performance-best-practices.md

+* Read an object getObject (lakeFS OpenAPI) and add `--presign`. You'll get a link to download an object.
+* Use statObject (lakeFS OpenAPI) which will return `physical_address` that is the actual S3/GCS path that you can read with any S3/GCS client.
+
+### Reading directly from GCS when using GCS-Fuse


I believe this belongs in a dedicated GCS-Fuse page, not in a general best performance practices guide.

I adjusted the wording and section header level a bit to clarify the relevance, but I do think it's worth calling out here because "how the heck will this work scaleably, if at all" is a performance question, and it doesn't hurt to answer it here and point at the GCS-Fuse page for details.

The GCS-Fuse page should probably live under integrations/gcs-fuse and perhaps also linked from deploy/gcp.md, not inside Vertex, as it's not just vertex that uses it, but I'll save that for a separate pull request :)

Also while we are on the topic, this is a very clever way to integrate with Fuse, kudos.

docs/understand/performance-best-practices.md

talSofer

Thank you @dvryaboy for your valued contribution!

I added some comments and agreed with @itaiad200's suggestions to

Relocate the GCS-Fuse docs to a dedicated page. They are already at https://docs.lakefs.io/integrations/vertex_ai.html#using-lakefs-with-cloud-storage-fuse. We will be happy to see contributions/changes to that page if you are interested!
Keep the use zero-copy import section in its original place.

We will be happy to get your thoughts on this :)

talSofer · 2024-04-08T10:14:34Z

docs/understand/performance-best-practices.md

+This is achieved by `lakectl fs upload --pre-sign` (Docs: [lakectl-upload][lakectl-upload]). The equivalent OpenAPI endpoint will return a URL to which the user can upload the file(s) in question.
+
+### Read directly from the object store
+lakeFS maintains versions by keeping track of each file; the commit and branch paths (`my-repo/commits/{commit_id}`, `my-repo/branches/main`) are virtual and resolved by the lakefs service to iterate through appropriate files. 


I believe that this part belongs to the concepts and model page.

I think a version of this is in the concepts and model page ("A lot of what lakeFS does is to manage how lakeFS paths translate to physical paths on the object store.", etc).

Do you feel this sort of thing needs to be DRYed up? My thought was that "repetition doesn't spoil the prayer", as the saying goes, and a brief mention here helps set the context without expecting the reader to have gone in detail through other pages.

talSofer · 2024-04-08T10:18:08Z

docs/understand/performance-best-practices.md

@@ -31,9 +26,61 @@ When accessing data using the branch name (e.g. `lakefs://repo/main/path`) lakeF
 For more information, see [how uncommitted data is managed in lakeFS][representing-refs-and-uncommitted-metadata].

 ## Operate directly on the storage


If we want to breakdown this section down into sub-sections, i'd suggest that each section describe a way to operate directly on the storage rather than distinguishing between reads and writes that are supported by most ways. That is, I'd use a structure like:

Operate directly on the storage

Pre-sign URLs

lakeFS Hadoop Filesystem

Staging API

WDYT?

I think that can work but can't personally commit to that as I don't quite grok the details of the staging API and the lakeFS HDFS setup (sadly, I know more than I would like to about HDFS itself :) ).

Is this something you'd like to address in this PR or is it possible to address in a subsequent one?

talSofer · 2024-04-08T10:19:24Z

docs/understand/performance-best-practices.md

-lakeFS offers multiple ways to do that:
+lakeFS offers multiple ways to do that.
+
+### Use zero-copy import


+1, The zero-copy import section talks about how to regularly feed data into lakeFS rather than how to interact with data already managed by lakeFS.

dvryaboy · 2024-04-10T19:42:59Z

@talSofer and @itaiad200 , thank you for the feedback!
I incorporated a few of your suggestions and explained my motivation/reasoning for a couple of the things you pushed back on. LMK what you think after reading the explanation. In terms of which of these things I hold on to strongly vs weakly, I have a moderate preference for keeping gcs-fuse and high level concept info in here even if it's duplicated elsewhere, and a low level of preference for organizing things the way I did but can be easily convinced to organize differently if my arguments didn't sway you.

github-actions · 2024-05-25T01:45:58Z

This PR is now marked as stale after 30 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

github-actions · 2024-06-01T01:49:31Z

Closing this PR because it has been stale for 7 days with no activity.

dvryaboy added 2 commits March 20, 2024 14:43

Add more detailed docs regarding reading direct from object store

cde5a88

typo fix

8f464b2

emulatorchen reviewed Apr 1, 2024

View reviewed changes

docs/understand/performance-best-practices.md Show resolved Hide resolved

itaiad200 requested changes Apr 7, 2024

View reviewed changes

talSofer requested changes Apr 8, 2024

View reviewed changes

Small improvements to address PR comments

87775f6

dvryaboy requested a review from talSofer April 24, 2024 21:04

github-actions bot added the stale label May 25, 2024

github-actions bot closed this Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs/perf best practices update #7586

Docs/perf best practices update #7586

dvryaboy commented Mar 20, 2024

itaiad200 left a comment

itaiad200 Apr 7, 2024

talSofer Apr 8, 2024

dvryaboy Apr 10, 2024

itaiad200 Apr 7, 2024

dvryaboy Apr 10, 2024

talSofer left a comment

talSofer Apr 8, 2024

dvryaboy Apr 10, 2024

talSofer Apr 8, 2024

dvryaboy Apr 10, 2024

talSofer Apr 8, 2024

dvryaboy commented Apr 10, 2024

github-actions bot commented May 25, 2024

github-actions bot commented Jun 1, 2024

		@@ -31,9 +26,61 @@ When accessing data using the branch name (e.g. `lakefs://repo/main/path`) lakeF
		For more information, see [how uncommitted data is managed in lakeFS][representing-refs-and-uncommitted-metadata].

		## Operate directly on the storage

Docs/perf best practices update #7586

Docs/perf best practices update #7586

Conversation

dvryaboy commented Mar 20, 2024

Change Description

itaiad200 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

talSofer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Operate directly on the storage

Pre-sign URLs

lakeFS Hadoop Filesystem

Staging API

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dvryaboy commented Apr 10, 2024

github-actions bot commented May 25, 2024

github-actions bot commented Jun 1, 2024