Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff-able documents #16290

Open
davidbrochart opened this issue May 6, 2024 · 9 comments
Open

Diff-able documents #16290

davidbrochart opened this issue May 6, 2024 · 9 comments

Comments

@davidbrochart
Copy link
Contributor

Problem

There is work happening on supporting suggestions in RTC, just like in Google Docs. In the backend, the architecture consists of creating a fork of a shared document, that is used to receive the suggestion changes as well as the changes from the root document. When the suggestion is accepted, the fork is merged back into the root document.
In the current solution, the frontend is quite limited: one chooses to view either the root document or the fork. This has the advantage of fitting easily into the existing architecture (we just unplug the shared document provider from the root and plug it into the fork), but the UI is quite poor: one cannot see the differences e.g. side by side (like in nbdime) or in a unified view like in Google Docs.

Proposed Solution

To be able to show document diffs, I think that we would need a document model that can hold multiple shared documents. This would be very dependent on the underlying CRDT technology (i.e. Yjs), because we need to interpret deltas in order to show diffs (e.g. this cell was deleted, this text was inserted, etc.).
This issue is to start a discussion on whether to change the existing document models so that they support diffs, or to start new document models. Starting new models would probably result in a lot of code duplication from the existing ones, but changing existing models would make them rely on CRDTs even more.
I might miss other points, so I'd like to hear what people think about that.

@jupyterlab-probot jupyterlab-probot bot added the status:Needs Triage Applied to new issues that need triage label May 6, 2024
@fcollonval
Copy link
Member

I don't understand the solution.

Displaying a diff between two models, is comparing instantaneous snapshots of two models and producing a diff model (unrelated to the document model) for it (this is for example what nbdime is doing). What is the advantage of having a parent document?

An advantage I see to manipulate CRDT models is in the creation of the diff model as it could be simplified if the common base version is known and the modifications compressed/simplified between that common version and the current versions.

@davidbrochart
Copy link
Contributor Author

But nbdime is trying to infer changes, right? With CRDTs, and the root-fork model we have more information, because the diff is just the result of all deltas. For instance, moving a cell is a CRDT operation that is not ambiguous, but without this information we just have two cells that seem identical at different indices, that look like a move operation. There might be other cases that are impossible to infer exactly, while it is possible with CRDTs because we have the history of all changes.
Another case that comes to mind, some text that is deleted and replaced with some other text, maybe several times. We will be able to show the deletions (with strikethrough) and the additions separately.
Also, without CRDTs there is a performance cost, as the diff has to be computed each time there is a change, and this is going to be more and more complex with time. With CRDTs, we just interpret deltas as they come, live.

@fcollonval
Copy link
Member

But nbdime is trying to infer changes, right?

Yes

With CRDTs, and the root-fork model we have more information, because the diff is just the result of all deltas. For instance, moving a cell is a CRDT operation that is not ambiguous, but without this information we just have two cells that seem identical at different indices, that look like a move operation. There might be other cases that are impossible to infer exactly, while it is possible with CRDTs because we have the history of all changes. Another case that comes to mind, some text that is deleted and replaced with some other text, maybe several times. We will be able to show the deletions (with strikethrough) and the additions separately.

I agree with that. This was my last comment.

Note

Move object (aka cell) does not exist in yjs

Also, without CRDTs there is a performance cost, as the diff has to be computed each time there is a change, and this is going to be more and more complex with time. With CRDTs, we just interpret deltas as they come, live.

You will need to compute a diff anyway.

For example if you have the text evolution:

  1. Hello John
  2. Hello Mary
  3. Hello Johnny

When diffing 3 and 1, the user should see: addition of 'ny' at position 10. I don't know how easy that can be inferred from modification 1 (replace 'John' by 'Mary') and modification 2 (replace 'Mar' by 'Johnn').


Nevertheless, could you elaborate how having a document model that holding multiple shared documents will help?

@fcollonval
Copy link
Member

fcollonval commented May 7, 2024

One additional point, producing an easy-to-understand diff model is hard as a efficient diff model for the computer (like CRDTs) may not be easy to understand for the end user. I'm typically thinking of character based diff vs word/semantic group diff. You can look for pathological examples in the work done in VS Code for their second diff viewer: https://code.visualstudio.com/updates/v1_81#_diff-editor

@davidbrochart
Copy link
Contributor Author

davidbrochart commented May 7, 2024

Move object (aka cell) does not exist in yjs

Hmm wasn't it released as an alpha in Yjs? At least y-crdt has move_to for Array, see this issue.

When diffing 3 and 1, the user should see: addition of 'ny' at position 10

Or should the user see "Hello John Mary Johnny", meaning all intermediate deltas should be displayed? Then the processing doesn't increase with time, since we're just incrementally displaying the deltas.

Nevertheless, could you elaborate how having a document model that holding multiple shared documents will help?

I'm not sure about that, that's also why I would like your feedback. Do you think we could keep the current document models, and have a new diff-able "meta" document model that would hold instances of individual document models? It seems to me that at the minimum we need a document model that knows how to display diffs between individual document models.

@fcollonval
Copy link
Member

Hmm wasn't it released as an alpha in Yjs? At least y-crdt has move_to for Array, see this issue.

I don't see it in YArray. And the reference move PR on Yjs, I have been monitoring has not seen much activity.

Or should the user see "Hello John Mary Johnny", meaning all intermediate deltas should be displayed? Then the processing doesn't increase with time, since we're just incrementally displaying the deltas.

I don't think so. Because I gave you a case with only few changes. But if we get 10 modifications on the same line it won't be readable. And again, I'm not sure the CRDT will be user friendly like John Mary Johnny.

Nevertheless, could you elaborate how having a document model that holding multiple shared documents will help?

I'm not sure about that, that's also why I would like your feedback. Do you think we could keep the current document models, and have a new diff-able "meta" document model that would hold instances of individual document models? It seems to me that at the minimum we need a document model that knows how to display diffs between individual document models.

We certainly gonna need some kind of new model. But does it need to keep track of the individual document model. I don't know. By default I would say no on the ground of keeping thing simple and less coupling is simpler.

@davidbrochart
Copy link
Contributor Author

I don't think so. Because I gave you a case with only few changes. But if we get 10 modifications on the same line it won't be readable. And again, I'm not sure the CRDT will be user friendly like John Mary Johnny.

In Google Docs, the "Editing" mode corresponds to what you are suggesting: only the final result is displayed. But the "Suggesting" mode shows all intermediate changes, just like I shown.

We certainly gonna need some kind of new model. But does it need to keep track of the individual document model. I don't know. By default I would say no on the ground of keeping thing simple and less coupling is simpler.

I guess my question was really focused on a suggesting mode, where a user must be notified of other users making suggestions on the same document. In Google Docs, suggestion notifications appear on the right with a window allowing to accept it or discard it. So I would say that the new document model has to keep track of all the fork models that correspond to suggestions.

image

@echarles
Copy link
Member

Thx for starting the discussion @davidbrochart. Is the discussion around a general diffing feature, or specific to the suggesting feature. I would assume those feature to be quite different (even if they have intersections). We could add commenting to the picture if we want to expand.

@davidbrochart
Copy link
Contributor Author

My use-case is implementing an RTC suggestion system, where users can see the diffs in a unified view. Whether this can go into a general diffing system is the question.
For instance, nbdime is a diffing system for notebooks, but it doesn't fit the requirements of being real-time, nor providing a unified view (diffs are shown side by side).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants