Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use tree-sitter #174

Open
tpapp opened this issue Jun 6, 2022 · 37 comments
Open

use tree-sitter #174

tpapp opened this issue Jun 6, 2022 · 37 comments

Comments

@tpapp
Copy link
Collaborator

tpapp commented Jun 6, 2022

tree-sitter framework for incremental parsing of source code. The Julia implementation is now, in my opinion, fairly mature.

I am asking for comments about replacing our ad-hoc regexp-based parsing mechanisms with it. Specifically, it would help with

  1. font locking,
  2. indentation,
  3. code navigation,

resolving a host of issues.

I am aware that it isn't perfect, as nothing is, but at least improvements would go to a repository that helps all Julia users, not just those who use Emacs.

EDIT Some links:

  1. Mastering Emacs has a nice writeup about tree-sitter, and also mentions combobulate which also looks useful
  2. tree-sitter docs
@tpapp
Copy link
Collaborator Author

tpapp commented Jun 6, 2022

@FelipeLema: since you wrote an indentation solution using tree-sitter, I would appreciate your comments. I recall you mentioning it a while ago.

@FelipeLema
Copy link
Contributor

FelipeLema commented Jun 6, 2022

I agree that currently maintaining the regex parser is taking a toll on code readability and maintanability, even considering people who are well versed in (Emacs) Lisp. I agree that tree-sitter is a tool that has many eyes on them ("it is production quality") and that end users have only good things to say about it.

However, I'd say it gets complicated for Julia. There's 2 big aspects to consider: how well the parser is maintained and how well it can be maintained (by the Julia community).

First, not only is the Julia tree-sitter parser not complete, but its pace may not fit the requirements of the Julia community (incl tooling people & Julia devs). It didn't fit me particularly, I ran into this problem rather quickly. When I started using the tree-sitter indent tool I found a bug. Then I reported it, debugged it, proposed a fix and I'm still waiting on the problem to be addressed. Can the tree-sitter parser be hijacked by the julia community so we can improve it at a quicker pace? I honestly don't know.

After that, I endend up writing a Julia code formatter because I thought it would easier to do so rather than pushing for changes upstream in the Julia tree-sitter parser. The name of this tool I did is actually a misnomer because it parses the code in a Julia buffer and does operations on its AST just as tree-sitter does.

So this brings up the second concern about tree-sitter: it is written in another language (and right now we need to write fixes in JS for the Julia parser). Correct me if I'm wrong, but I believe that most of the Julia community understands the long-term problems of maintaining several languages for their workflows. I would personally rather avoid writing JS if possible (which is what I ended up doing in the paragraph above).

My recommendation is to use CSTParser (or even the parser that comes with Julia binaries) to parse the AST and to copy code from tree-sitter.el to handle the items mentioned at the top entry of this discussion. From my experience with the Julia code formatter I'd recommend using DaemonMode for Emacs-Julia communication (to have little-to-no response delay) as using json-rpc, as LSP does, may bring problems when used in Windows.

Using CSTParser may have a positive effect as it lowers the barriers for Julia end users to participate in maintaining this package (kinda what the Racket community was betting on when they switched to ChezScheme).

All this being said, I want to note that I use tree-sitter in neovim on an everyday basis and that everything (except for the Julia tree-sitter parser) works A-OK.

@tpapp
Copy link
Collaborator Author

tpapp commented Jun 7, 2022

@FelipeLema: thanks for the detailed explanation (incidentally, would you consider giving the PR you mentioned a friendly ping? Perhaps it was just overlooked --- that happens).

Conceptually, I can think of the following components for the features we need:

  1. a parsing framework, if applicable (eg tree-sitter, LSP)
  2. Julia-specific part within that framework (eg tree-sitter-julia, LanguageServer.jl)
  3. IDE-specific integration with the framework that calls this code, if applicable (emacs-tree-sitter, emacs-lsp)
  4. a bit of mode-specific glue that calls this

Currently in julia-mode we pretty much do everything above with hacks using regexps.

CSTParser.jl does 1+2. We would have to maintain part 3 ourselves, most likely using a daemon-based approach you outline. The advantage is indeed doing a lot in Julia. The disadvantage is all the framework associated with maintaining a running instance of Julia --- doable but somewhat heavyweight.

Tree-sitter would allow us to combine effort for 1, 2, and 3 with other projects (editors other than Emacs, languages other than Julia). That said, I realize that a lot of layers can introduce problems, too.

Interestingly, the LSP spec includes semantic tokens since 3.16. I wonder if that's supported in practice for Julia with Emacs, @gdkrmr and @non-Jedi, it would be great if you could share your thoughts about this. If we could make that work, it would cover pretty much everything for us.

@dahtah
Copy link

dahtah commented Jun 7, 2022

Note that julia-snail already includes an interface to CSTParser.jl. It still relies on julia-mode for syntax highlighting and formatting, though.

@non-Jedi
Copy link
Contributor

non-Jedi commented Jun 7, 2022

Interestingly, the LSP spec includes semantic tokens since 3.16. I wonder if that's supported in practice for Julia with Emacs, @gdkrmr and @non-Jedi, it would be great if you could share your thoughts about this. If we could make that work, it would cover pretty much everything for us.

LanguageServer.jl doesn't support the semantictokens set of capabilities at this time unfortunately. I'm also really not sure if the architecture of the LSP would give you sufficient responsiveness for indentation and syntax highlighting. Any time you press enter, emacs would have to make a round-trip with the language server (plaintext over pipes) before deciding on indentation level.

@FelipeLema
Copy link
Contributor

FelipeLema commented Jun 7, 2022

worth noting: there's an active (very active? somewhat active?) support for parsing Julia in Scintilla

using Scintilla would have the benefit of an active support, but would have to integrate it to Emacs ourselves

@tpapp
Copy link
Collaborator Author

tpapp commented Jun 7, 2022

I did not dig into the details, but I am still under the impression that tree-sitter would be the path of least resistance, because of:

  1. existing Emacs integration,
  2. existing Julia support, even if we need to fix things,
  3. it is fast,
  4. no need to spin up a Julia instance, so we could keep things reasonably simple in julia-emacs.

@non-Jedi
Copy link
Contributor

non-Jedi commented Jun 7, 2022

I'm generally on board with integrating with tree-sitter. Even if support for julia syntax isn't perfect, it's probably better than what we have now especially wrt indentation.

We would need to update our lowest supported emacs version to 25 for dynamic module support.

@tpapp
Copy link
Collaborator Author

tpapp commented Jun 8, 2022

That's fine with me, Emacs 25 has been released almost 6 years ago.

@gdkrmr
Copy link

gdkrmr commented Jun 8, 2022

I don't really have any experience with syntax highlighting in Emacs, please correct me if I am wrong

  • I don't think doing it through LanguageServer.jl/CSTParser.jl would be a good idea without a fallback. There is too much that can go wrong if you have to spin up Julia processes in the background. I also like the current modular approach: if for some reason lsp-julia fails, julia-mode is still very much usable (and users can switch to eglot). Including CSTParser.jl would probably mean that we have to put everything in a large monolith (why spin up two separate Julia processes if CSTParser and LanguageServer could run in a single process).
  • tree-sitter sounds interesting and I don't mind increasing the minimum Emacs version to 25.
  • tree-sitter uses dynamic modules which should be much more responsive than queries to an external Julia binary.

Questions:

  • Would it be possible to wrap the Julia Language Server in a dynamic module?
  • How does VSCode do syntax highlighting? Are they using CSTParser?

@ronisbr
Copy link
Contributor

ronisbr commented Dec 16, 2022

Is there anyway to enable tree-sitter in Julia mode? I installed, enabled, but it does not seem to have any effect.

@FelipeLema
Copy link
Contributor

can you paste or point to the code you're dealing with?

@ronisbr
Copy link
Contributor

ronisbr commented Dec 17, 2022

Nothing in particular. I am just trying to see if we can have a better performance. For example, scrolling this file is somewhat slow for me:

https://github.com/ronisbr/PrettyTables.jl/blob/master/src/backends/text/print.jl

@gdkrmr
Copy link

gdkrmr commented Dec 17, 2022

emacs 29 will add native tree sitter support! Does anyone know how it works? Will there be an extra process or is it going to be a dynamic module? tree-sitter-julia also seems to be reasonably active, not sure if it is ready yet (https://github.com/tree-sitter/tree-sitter-julia).

@ronisbr
Copy link
Contributor

ronisbr commented Dec 19, 2022

To use tree-sitter, you need to rewrite the major mode. I am doing some experiment in a julia-ts-mode with very good outcome. There are rough edges, which I do not know yet if they are caused by emacs or julia grammar. I will publish this file tomorrow so people can test.

@ronisbr
Copy link
Contributor

ronisbr commented Dec 19, 2022

Here is the major mode:

https://github.com/ronisbr/julia-ts-mode

You need to add the file julia-ts-mode.el to your path and add (require 'julia-ts-mode). Notice that you also need to install the Julia tree-sitter grammar.

I have to say that I am really amazed how easy it was to setup everything and the speed is definitely much faster than the current mode. Now, I need to work on navigation and imenu support.

@non-Jedi
Copy link
Contributor

This is awesome. Thanks @ronisbr. I'll need to compile emacs 29 for myself to try this. Maybe you could instead define julia-ts-mode as a derived mode of julia-mode and only override indent-line-function and font-lock-defaults (that way we don't lose the pieces of julia-mode not related to indentation and font-locking)?

@tpapp would we be willing to make future releases of julia-mode only compatible with emacs 29+? This seems like a feature which would make doing so worthwhile.

@ronisbr
Copy link
Contributor

ronisbr commented Dec 19, 2022

Hi @non-Jedi !

This is awesome. Thanks @ronisbr. I'll need to compile emacs 29 for myself to try this. Maybe you could instead define julia-ts-mode as a derived mode of julia-mode and only override indent-line-function and font-lock-defaults (that way we don't lose the pieces of julia-mode not related to indentation and font-locking)?

The idea is to test the tree-sitter integration and then commit to this repository. I will not register julia-ts-mode. Many major modes are defining something just as you said. Thus, the user can select if they want to old behavior or tree-sitter, if available. I think this is the best scenario.

@tpapp would we be willing to make future releases of julia-mode only compatible with emacs 29+? This seems like a feature which would make doing so worthwhile.

Yes, probably we will need to require Emacs 29 to make this integration works.

@tpapp
Copy link
Collaborator Author

tpapp commented Dec 19, 2022

@non-Jedi: yes, definitely. @ronisbr: thanks for doing this. I believe that this is the best way to solve a long list of problems.

@ronisbr
Copy link
Contributor

ronisbr commented Dec 19, 2022

Perfect! I will ping this thread when I finish the initial version so that you can help me to integrate everything :)

@ronisbr
Copy link
Contributor

ronisbr commented Jan 5, 2023

Just an update:

I have been using Julia tree-sitter grammar for almost 3 weeks now. My doom configuration with this mode is here: https://github.com/ronisbr/doom.d/tree/emacs-29

Everything is working wonderfully! I found just two minor issues reported here:

tree-sitter/tree-sitter-julia#88

tree-sitter/tree-sitter-julia#73

The experience so far has been amazing.

@chriselrod
Copy link

chriselrod commented Jan 6, 2023

eglot-jl currently explicitly lists julia-mode as a dependency, but it seems like it'd be straightforward to give julia-ts-mode a try. I'm not sure what I'd be looking for -- just see if I don't encounter problems, and maybe if it feels snappier?
I'm currently using the builtin c++-ts-mode, and maybe it adds more features, but I haven't tried them yet.
mark-paragraph still seems dumb / based on spacing rather than syntax.

function foo()

#mark begins
end
function bar()
# mark ends

end

Or maybe I need to look into documentation, and I'm supposed to replace functions like mark-paragraph with tree-sitter-powered versions.

@ronisbr
Copy link
Contributor

ronisbr commented Jan 6, 2023

just see if I don't encounter problems, and maybe if it feels snappier?

Yes!

Or maybe I need to look into documentation, and I'm supposed to replace functions like mark-paragraph with tree-sitter-powered versions.

There is some support for navigation, but I did not change anything related to mark-paragraph.

@non-Jedi
Copy link
Contributor

non-Jedi commented Jan 6, 2023

@chriselrod Looking at eglot-jl again, I don't think julia-mode is needed as a dependency. All that would be required to use it with julia-ts-mode would be to modify eglot-server-programs to instead include '(julia-ts-mode . eglot-jl--ls-invocation).

@ronisbr I haven't had a chance to build emacs 29 and test this, but would you mind skimming through the open issues when you get a chance and seeing which ones would be solved by your julia-ts-mode implementation? I would like to test for at least #118, #111, #56, #12, #11 (especially #11!!) #3, #2. If you don't have time, that's understandable, and I'll take a look when I get a chance.

@chriselrod
Copy link

chriselrod commented Jan 10, 2023

With this, eglot-jl should be compatible with both julia-mode and julia-ts-mode: non-Jedi/eglot-jl#36
(I just swapped julia-mode for (julia-mode julia-ts-mode)).

Building emacs is fairly straightfoward.
On fedora, you can install all the dependencies via:

sudo dnf install -y dnf-utils libgccjit-devel libtree-sitter-devel stow
sudo yum-builddeps emacs # install a million deps

Then to build

# cd somewhere/so/you/do/not/clutter
git clone git://git.savannah.gnu.org/emacs.git
cd emacs
./autogen.sh
mkdir build
cd build
CFLAGS="-O3 -march=native -fno-semantic-interposition" CXXFLAGS="-O3 -march=native -fno-semantic-interposition" ../configure --with-native-compilation --with-wide-int --with-json --with-tree-sitter
time make NATIVE_FULL_AOT=1 -j(nproc) # if using fish
# time make NATIVE_FULL_AOT=1 -j$(nproc) # if not using fish
sudo make install prefix=/usr/local/stow/emacs
cd /usr/local/stow
sudo stow emacs

stow is nice so you can easily clear out something you've installed (sudo stow -D emacs will remove all the symlinks it creates).

You may want to change flags/configuration options/etc.
I obviously added native compilation and treesitter above.
You may want pure-GTK if you're using Wayland, but I need to build with X because I'm using EXWM.

@chriselrod
Copy link

chriselrod commented Jan 10, 2023

#118
julia-ts-mode only highlights quote and end.

#111
New line after quote does not indent; typing without tab:

map(1:3) do x
x
end
f(map(1:3) do x
x
end)

mark and tab:

map(1:3) do x
  x
end
f(map(1:3) do x
  x
end)

#56

x .|>
  f

It initially didn't have the indent, but as soon as I made another line, it auto-indented.

#12

module A
import Base: *
a = 1
b = *
c = 2
end

is what I get typing it out (no spurious indent).
Mark and tab preserves the correct relative indent, except it indents everything inside the module.
The module keyword is not itself highlighted.

module A
  import Base: *
  a = 1
  b = *
  c = 2
end

#11
This behavior is customizable https://github.com/ronisbr/julia-ts-mode/blob/197e6e81a8d3d519df81fd21931a1e156ec1fc10/julia-ts-mode.el#L44-L84

Typing it out:

function1(a, b, c
  d, e, f)
function2(
a, b, c
  d, e, f)
for i in Float64[1, 2, 3, 4
  5, 6, 7, 8]
end
for i in Float64[
1, 2, 3, 4
  5, 6, 7, 8]
end
a = function3(function()
return 1
end)
a = function4(
function ()
return 1
end)

Neither leading 5 is highlighted, but all the other numbers are.
Mark and tab:

function1(a, b, c
  d, e, f)
function2(
  a, b, c
  d, e, f)
for i in Float64[1, 2, 3, 4
  5, 6, 7, 8]
end
for i in Float64[
  1, 2, 3, 4
  5, 6, 7, 8]
end
a = function3(function()
  return 1
end)
a = function4(
  function ()
    return 1
  end)

So, still some problems.
-[ ] Leading numbers on new line not highlighted
-[ ] Light on different faces/highlighting in general
-[ ] Not very eager about indenting new lines, but seems pretty good in terms of consistent indentation levels when you ask for them.

I'd like in in the loops to be highlighted, as well as of course the 5s.

As for speed -- I'm not sure.
I had a file with a large literal matrix defined, and it seemed slower than I remember my experience being yesterday with julia-mode.

@chriselrod
Copy link

#3
No special highlighting for user

println("hello $user")

#2
No highlighting for any of these variables.

@chriselrod
Copy link

chriselrod commented Jan 10, 2023

Just to confirm, with this file:
https://gist.github.com/chriselrod/5d09f5156ee49f1d2822df1638093b76#file-highssimplex-jl
julia-mode seems quite fast, while julia-ts-mode lags for seconds or more while scrolling up and down.

julia-ts-mode highlights the giant matrix, but julia-mode does not.
Perhaps that is the cause of the performance difference?
I'll have to try more normal files.

Thankfully, we have 5k and 6k long otherwise more typical .jl files at work to test on. =)
EDIT: it seems fast at navigating those.

@ronisbr
Copy link
Contributor

ronisbr commented Jan 10, 2023

Thanks for the amazing investigation @chriselrod !

#118
julia-ts-mode only highlights quote and end.

Ooops :D I forgot to add anything related with interpolation expressions. Now we have:

Captura de Tela 2023-01-10 às 10 11 41

-[ ] Not very eager about indenting new lines, but seems pretty good in terms of consistent indentation levels when you ask for them.

I saw problems like this also in other languages. It seems some limitation on either the Emacs implementation or tree sitter itself.
For example, when you type:

if a == 2
end

And press enter after 2, the line is not indented. The reason is that there is no node inside the if statement. Hence, Emacs just does not know that we want to shift the indentation. I have no idea how to fix it. Perhaps tree-sitter/tree-sitter-julia#73 will improve it.

Neither leading 5 is highlighted, but all the other numbers are.

This happens because it is a syntax error. We can highlight errors. However, the grammar is not 100% and some errors are false positives.

#3
No special highlighting for user

println("hello $user")

Done! I added the support for string interpolations. The font face is the same as in the constant, but bold.

Captura de Tela 2023-01-10 às 10 36 49

(The underlines are LSP errors)

#2
No highlighting for any of these variables.

I think I did not fully understand what is the desired behavior. Can you please explain to me?

julia-ts-mode highlights the giant matrix, but julia-mode does not.
Perhaps that is the cause of the performance difference?
I'll have to try more normal files.

Yes! If you set trees it-font-lock-level to 2, where the literals are not highlighted, it is much faster.

@ronisbr
Copy link
Contributor

ronisbr commented Jan 10, 2023

I'd like in in the loops to be highlighted, as well as of course the 5s.

The in highlighting is done! The 5 in your example is impossible because tree-sitter will mention that there is an error.

Captura de Tela 2023-01-10 às 10 54 03

By the way, I will add the error in the last font lock level together with the operators. Thus, the user can decide.

Now, if you set treesit-font-lock-level to 4, you will see:

Captura de Tela 2023-01-10 às 10 58 00

@non-Jedi
Copy link
Contributor

non-Jedi commented Jan 10, 2023

#2
No highlighting for any of these variables.

I think I did not fully understand what is the desired behavior. Can you please explain to me?

Any time a variable is assigned to, the variable name should be highlighted with font-lock-variable-name-face, e.g.:

  • the x in x = 5
  • a and b in let a = 1, b = 2
  • x and y in x, y = 4, 5
  • a and b in a = 5 + (b = 3)
  • a in global a
  • b in local b
  • i in for i=1:10 (this one is arguable)
  • x and y and the T in the where clause in function f(x::T, y) where T (this one is arguable)

But there are similar forms which should not be highlighted which makes this difficult to do without the full parser we get with tree-sitter, for example, calling a function with a keyword argument, named tuples, and setindex! sugar (not sure if we should consider setproperty! as variable assignment for this purpose...).

@ronisbr
Copy link
Contributor

ronisbr commented Jan 10, 2023

Hi @non-Jedi ,

Thanks!

Everything that is not arguable was implemented. However, we might have corner cases.

I add the variable highlighting to the level 3 (the default). This is the specification for each level:

Level 1 usually contains only comments and definitions.
Level 2 usually adds keywords, strings, constants, types, etc.
Level 3 usually represents a full-blown fontification, including
assignment, constants, numbers, properties, etc.
Level 4 adds everything else that can be fontified: delimiters,
operators, brackets, all functions and variables, etc.

Now we have:

Captura de Tela 2023-01-10 às 16 12 33

@chriselrod
Copy link

Thanks for all the great work!
(Of course, thanks to everyone developing the packages I use.)

Yes! If you set treesit-font-lock-level to 2, where the literals are not highlighted, it is much faster.

For now I went in the other direction, and am trying 4.

(defun treesit-font-lock-level-4 ()
  (setq-local treesit-font-lock-level 4)
  (treesit-font-lock-recompute-features))
(add-hook 'julia-ts-mode-hook #'treesit-font-lock-level-4)

Something else I noticed: macros aren't highlighted.

@ronisbr
Copy link
Contributor

ronisbr commented Jan 11, 2023

Thanks for all the great work!

You're welcome! I also want to point out the AMAZING work of Julia tree-sitter grammar developers (@maxbrunsfeld, @savq, and others). In all this time, I only found very minor issues! It is amazing!

For now I went in the other direction, and am trying 4.

Me too! However, it can slow down. I noticed that the problem is when at the screen there is a lot of highlighting. It seems that it can handle big files pretty well (did not test deeply).

Something else I noticed: macros aren't highlighted.

I did not understand, it seems to be working here:

Captura de Tela 2023-01-11 às 09 27 04

By the way, until julia-ts-mode is merged here, I needed to replicate some functionality. Hence, I copied all the code related with LaTeX symbol substitution. Now, I think it is working to the point I can start using it daily.

@chriselrod
Copy link

I did not understand, it seems to be working here:

Hmm -- I'm on a different computer with the same config as before, and I now see the same thing you showed.
Perhaps I was on outdated versions. I'll let you know if I see anything different.

@ronisbr
Copy link
Contributor

ronisbr commented Jan 11, 2023

By the way, until julia-ts-mode is merged here, I needed to replicate some functionality. Hence, I copied all the code related with LaTeX symbol substitution. Now, I think it is working to the point I can start using it daily.

I undid this given the amazing advice of @non-Jedi to make Julia-ts-mode a derived mode of Julia-mode.

@ronisbr
Copy link
Contributor

ronisbr commented Jan 13, 2023

Update!

After a lot of problems, I managed to make an option to select which kind of indentation after assignment the user wants. Hence, we can now select:

    var = a + b + c +
          d + e +
          f

or

    var = a + b + c +
        d + e +
        f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants