Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta: Exact-printer Mega-issue #7544

Open
emilypi opened this issue Aug 14, 2021 · 62 comments · May be fixed by #9385
Open

Meta: Exact-printer Mega-issue #7544

emilypi opened this issue Aug 14, 2021 · 62 comments · May be fixed by #9385

Comments

@emilypi
Copy link
Member

emilypi commented Aug 14, 2021

What is this issue?

A central place to discuss the issue of a Cabal exact-printer so that we can push forward the work all in one place. Currently, The discussion stretches back across many issues over the past 6 years.

What is it?

An exact-printer, as inspired by ghc-exactprint, is a byte-for-byte bidirectional parser and pretty printer, which gaurantees the following constructs exist and are principled:

  • A .cabal file source tree representation, which is in bijection with the package description for a .cabal file for a given spec version. This representation should be parametrized with all offset and encoding information along with file data.
  • A method for introducing deltas to .cabal file entries. I should able to update and transform individual components of data deep in the representation without modifying the broader structure.
  • A pretty printer which adds offset annotations to the specification fragment so that the output should have a byte-for-byte round trip on no-op parse/print and print/parse tests.
  • Tests which confirm this in cabal-testsuite and cabal install's test suite for the individual commands.

Why do we need it?

This work is tied very closely with the format, init, gen-bounds and other work as shown in historical discussions:

This would free us up for several important quality of life improvements to the cabal install ecosystem, as well as general consumers of the Cabal library API. We would like to have this in by Cabal-3.8.0.0. The important bits this enables include:

  • cabal gen-bounds - will be able to update build-depends stanzas in a principled manner, so that bounds management will be automatic in the future if the user so desires.
  • cabal format - we hope that a .cabal format can have a canonical shape, so we can limit the amount of deviations from the norm, or ad-hoc formatting choices present in the ecosystem.
  • cabal init becomes much easier to manage, and we no longer have to dedicate redundant types, formatting, and pretty printing to this effort.
  • Transforming cabal files, with something like module discovery, becomes that much simpler.

Who's in charge of this?

I am currently overseeing @ptkato who has been tasked with taking this on. Please consider him and myself a point of contact for this work, and join us in libera.chat#hackage to discuss.

@ptkato @gbaz @davean

@gbaz
Copy link
Collaborator

gbaz commented Aug 14, 2021

We have an intermediate "lexed tree" structure in https://github.com/haskell/cabal/blob/1be581d5463adebf8dc8cc8a5375a9ca3051970a/Cabal/src/Distribution/Fields/Field.hs

One idea I would throw out is turning that structure into the main structure we get exactprinting working with (making it comment and whitespace preserving). Then transforms on cabal files involve parsing to that, then running the FieldGrammar parser to get the nice fully typed structure, modifying that and emitting it back to Fields, and then calculating the delta to apply the changes thus made to the original exact Fields.

@emilypi
Copy link
Member Author

emilypi commented Aug 14, 2021

There's good prior art here: - https://github.com/phadej/cabal-fmt

@tonymorris
Copy link

I've written a few of these for other formats. If you need help @ptkato then @emilypi can hook us up.

@Mikolaj
Copy link
Member

Mikolaj commented Aug 14, 2021

See also the old github project about that: https://github.com/haskell/cabal/projects/9

I've updated it with related issues I've found, but then also marked all the issues with a label, so the indexing is available in both forms now. If the project is not going to be used, please remove it. The labeling will remain, preserving the data.

@ptkato
Copy link
Collaborator

ptkato commented Sep 16, 2021

One idea I would throw out is turning that structure into the main structure we get exactprinting working with (making it comment and whitespace preserving).

I combed through cabal-fmt and here, and I must say, I'm not a fan of how the comments would need to be dealt with if we use that approach. Yes, for solving it works great, it cuts down a considerable amount of time, but for an exact-printer it seems... odd (and potentially making it more complicated than it should be). I really do believe that that the right way would be to parse it all at once, as opposed to appending the comments afterwards.

@phadej
Copy link
Collaborator

phadej commented Sep 16, 2021

Consider that the Field is not the most refined representation of cabal files. It's like parsing JSON file into Value with aeson. There's still the step of taking that structure and (also) parsing it further into GenericPackageDescription.

The Cabal library completely lacks the AST representation of .cabal files: we don't really need that intermediate step, thought it would help report warnings more precisely in Check. That, however, is not the most important piece of functionality.


GHC / ghc-exactprint doesn't put comments into AST either. There is very good reason: literally everything would need to be intercalated with comments (as {- ... -} comments can be literally anywhere in between).

cabal format is special, that line comments are always on own line (!!!) so we could intercalate them in Field representation. Fine.

BUT, if we want to represent fiield contents as data structure too then we face the same problem as GHC:

build-depends:
   base
      -- minimum is GHC-7.0
      >= 4.3
      -- and maximum ..
      && <= 4.17

I expect that cabal-exactprint would allow me to change the 4.17 into 4.18 preserving all the formatting, and with representation close to what is written (i.e. AST node for &&, >= etc.).

I don't see how comments can be attached to that. (I suspect that ghc-exactprint has some logic so it can fit comments between tokens it prints).

@jneira
Copy link
Member

jneira commented Sep 17, 2021

Sure @alanz could help us to compare ghc-exactprint with the requirments for cabal

@ptkato
Copy link
Collaborator

ptkato commented Sep 18, 2021

BUT, if we want to represent fiield contents as data structure too then we face the same problem as GHC:

...

I expect that cabal-exactprint would allow me to change the 4.17 into 4.18 preserving all the formatting, and with representation close to what is written (i.e. AST node for &&, >= etc.).

Indeed, a tad bit more complex AST would be needed to take care of that... shape, but I believe that's about it. And yes, input from @alanz would certainly be appreciated.

@alanz
Copy link
Collaborator

alanz commented Sep 19, 2021

Sorry all, I have been ignoring this issue for a while, my attention being elsewhere. I will take a decent look in the next few days.

@davean
Copy link

davean commented Sep 23, 2021

I don't see how the multiple comments thing is an issue. Your annotation functor would of course contain a list of (maybe) comments, in most cases it would be empty, in others it would be a singleton, and in the most complicated case it would be comments, and skips. Of course you can do the same sort of thing in GHC's case, but it becomes complicated enough there to remove the point I think.

Even with such information it'll be complicated to get correct - does a comment associate to the line above or below isn't apparent. For example if you delete data, does a comment go with that data or not?

My annotation would be something like [Maybe (WhiteSpace, Comment)] (illustratory only of course).

@alanz
Copy link
Collaborator

alanz commented Sep 23, 2021

First quick comment

GHC / ghc-exactprint doesn't put comments into AST either.

Prior to GHC 9.2 this is true, from GHC 9.2 they are in the AST, as part of the exact print annotations. But that is just mechanics of data capture.

In ghc-exactprint we work with an item having some form of "anchor", which represents the top-left point from which it gets rendered, much like in a normal pretty printer. This means that whatever leading indentation exists at that point gets preserved as we print items contained in the one being printed.

We capture comments as being contained within the span of the AST item being printed, or coming after it for top level items.

This is good enough for exact-printing an unchanged AST. If the AST needs to be modified prior to printing, we do a comment re-balancing process, where we try to keep comments associated with the AST item they logically belong to, which means a preceding document comment will be kept with the item following it, and comments between things associate to the thing without a gap, so a trailing comment on the same line, or without a line break will associate with the prior one. This is all very heuristic driven, and up for optimization.

To be able to handle that case, as the exact-printer finds comments it throws them into a queue, and then prints any occuring between the current print head position and the next output item. (Logically, the actual mechanics are a bit more complex because everything is deltas).

And some day I need to capture all of this in a proper document. Perhaps someone should offer to co-author something with me, to hold my feet to the fire of actually documenting it.

@phadej
Copy link
Collaborator

phadej commented Sep 23, 2021

Prior to GHC 9.2 this is true, from GHC 9.2 they are in the AST, as part of the exact print annotations. But that is just mechanics of data capture.

That's the idea why Field type has ann parameter, and that's where cabal-fmt puts attached comments. I.e. there isn't separate Comment AST node, as there isn't in GHC.

@ptkato
Copy link
Collaborator

ptkato commented Oct 4, 2021

Oki, I think we should keep things as simple as possible, as long as it covers our needs, taking into account the inputs from this issue, I think one acceptable shape for our AST would be something along the lines of:

data ExactPrint ann
  = Field   ann String [ExactPrint ann]         -- your field (or section) with its children
  | Value   ann String (Maybe (ExactPrint ann)) -- the fields' values
  | Cont    ann String (ExactPrint ann)         -- this would mold a single value across multiple lines
  | Comment ann String                          -- comment
  | Empty   ann                                 -- empty lines, and also to stop the recursion in Cont

It would allow enough flexibility to depict a cabal-like format in its entirety while being quite expressive and not overly complicated. Thoughts?

@phadej
Copy link
Collaborator

phadej commented Oct 4, 2021

I really really would like to have a build-depends as an AST so I can modify it, without thinking about how it's represented as Strings, and glued together to make the contents of that field.

As I said, nodes for &&, <package-name>, constraints (>= 1.2.3) etc.

Because I think that current Field representation already contains enough information to exact-print it. (The position of : after field is not recorded, but that's a minor problem). We need to do some work to attach comments, but cabal-fmt demonstrates that it's possible. I don't think it's very useful though. I wouldn't call that "cabal exactprint".

@ptkato
Copy link
Collaborator

ptkato commented Oct 4, 2021

Do we have any other case where a specific treatment would be desired or is it something exclusive to build-depends?

@gbaz
Copy link
Collaborator

gbaz commented Oct 5, 2021

I think we've let perfect be the enemy of the good far too long. A nice special representation for build-depends (or any other multiline field) can be built on top of any of the proposed structures, and that should be fine. Even without that, we have 90% of the use cases that we really really want this feature for unblocked. ptkato: go for it!

@ffaf1
Copy link
Collaborator

ffaf1 commented Oct 6, 2023

Hello Oleksii,

I don't want to encode global positions into the tree, as tree modifications can break those references.

Is this valid even for relative (to an anchor) positions?

@BurningWitness
Copy link

BurningWitness commented Oct 6, 2023

@ffaf1
No, the tree requires relative positions to be encodable. I'm bringing the global position issue up because that's something that both Cabal-syntax's lexer and GHC's parser support, so there might be expectations concerning this.

> Distribution.Fields.Parser.readFields "foo\n  bar: baz"
Right [Section (Name (Position 1 1) "foo") [] [Field (Name (Position 2 3) "bar") [FieldLine (Position 2 8) "baz"]]]

@liamzee
The discoveries are gained from reverse engineering, I have no interest in rewriting Cabal-syntax's lexer. Also note that when I make points I'm not really hitting a dead end, there's simply more than one way to go about solving each one and I'm choosing the path that looks sound to me.

I am indeed making an attempt, both because this looks like a week of novel work and because I have a good mental model for it, however whether anyone actually agrees with said model is still unknown to me.

@liamzee
Copy link
Collaborator

liamzee commented Oct 6, 2023

@BurningWitness At the very least, the community seems to be missing a good exact-parser for .cabal files, and this seems to be an urgent need given the desire for cabal add and better HLS cabal file capabilities.

The best-known attempt I'm familiar with is the following:

https://github.com/VeryMilkyJoe/haskell-language-server

I am eagerly pouring over your notes, so even if you lose interest, your work remains useful. Thanks for the updates!

@LemonjamesD
Copy link

Is this supposed to be a separate parser or a replacement for the existing parser in cabal now?

@fendor
Copy link
Collaborator

fendor commented Oct 13, 2023

Separate parser with a different goal compared to Cabal's internal one.
However, we are currently in the process of evaluating how much work it would be to fix Cabal's parser to give HLS what it needs.

@BurningWitness
Copy link

It's a separate parser because Cabal-syntax is already huge and it already does its things in a way that's fundamentally incompatible with this approach.

As the resulting parser simply parses anything that has a format of .cabal, any library downstream should be free to make a parser from the resulting AST to whatever data it needs. If the parser delivered is good enough, then later on perhaps there will be talks of splitting Cabal-syntax further, however right now it's too early to discuss that.

@gbaz
Copy link
Collaborator

gbaz commented Oct 13, 2023

Note that there is I think a perfectly cromulent and not too difficult proposal above that would retrofit cabal's existing parser with sufficient exact-parsing capabilities. See this message and the discussion prior:

#7544 (comment)

I am aware people are interested in pursuing different independent projects and we welcome that. However, this is a good proposal and I have not seen any objection to it other than that it requires some work to get up to speed enough with the cabal parser, which is complex, to be capable of doing this. So I would encourage people to seriously consider it.

@BurningWitness
Copy link

I don't know how well that representation matches reality, but the annotationless one I have isn't all that long either.

My variant
-- | Context-dependent whitespace.
newtype Offset = Offset Int
                 deriving newtype Show

-- | Context-independent whitespace.
newtype Whitespace = Whitespace Int
                     deriving newtype Show

-- | Anything that follows two consecutive hyphens. Lasts until the end of the line.
data Comment = Comment
                 Whitespace -- ^ Before double hyphens
                 Whitespace -- ^ Between double hyphens and text
                 Text
               deriving Show



-- | Any Unicode characters, excluding C0 control codes, '{', '}', ':'.
newtype Heading = Heading Text
                  deriving newtype Show



-- | Any Unicode characters, excluding C0 control codes, '{', '}', ':' and spaces.
newtype Name = Name Text
               deriving newtype Show

-- | Field contents at the same line as the declaration.
data Inline = Inline
                Text -- ^ Includes preceding whitespace
            
            | NewlineI

              deriving Show

-- | Field contents at the lines following the declaration.
data Line = Line
              Text -- ^ Includes preceding whitespace

          | CommentL Comment

          | NewlineL

            deriving Show



-- | Curly bracket syntax, together with the preceding empty space
data Curlies a = CommentC Comment (Curlies a)

               | NewlineC (Curlies a)

               | Curlies
                   Whitespace -- ^ Before left bracket
                   a
                   Whitespace -- ^ Before right bracket

                 deriving Show



-- | List with an alternative unparsed representation.
data Gradual u a = Entry a (Gradual u a)
                 | More u
                 | End
                   deriving Show

-- | Section contents with the curly bracket alternative.
data Section = CurlS (Curlies (Gradual [Line] Node))
             | NormalS
                 (Maybe Comment)         -- ^ Inline comment
                 (Gradual [Line] Node)
               deriving Show



-- | Field contents.
data Contents = Contents Inline [Line]
                deriving Show

-- | Field contents with the curly backet alternative.
data Field = CurlF (Curlies Contents)
           | NormalF Contents
             deriving Show



data Node = Section
              Offset
              Heading
              Section

          | Field
              Offset
              Name
              Whitespace -- ^ Between field name and colon
              Field

          | CommentT Comment

          | NewlineT

            deriving Show



newtype Layout = Layout (Gradual Lazy.Text Node)
                 deriving newtype Show

The big problem in my case isn't the definition, it's parsing:

  • Pretty much all elements have ambiguous boundaries, the only clearly delimited one is a comment line.

  • Meaning of section/field contents depends on whether the first encountered real line starts with a curly bracket or not;

  • Comments, while immediately available at the topmost level, actually belong to the element at the bottom and as such need to be carried over incrementally.

All three are solved with lookahead, so the parser gets quite convoluted.

@Bodigrim
Copy link
Collaborator

(random thoughts and snippets)

One can parse and annotate each field with its source like this:

readFieldsAnnotatedWithSource :: ByteString -> Maybe [Field ByteString]
readFieldsAnnotatedWithSource bs =
  either (const Nothing) (Just . snd . L.mapAccumR annotateField maxBoundPos) (readFields bs)
  where
    annotateField :: Position -> Field Position -> (Position, Field ByteString)
    annotateField finishPos = \case
      Field (Name pos name) fls -> (pos, Field (Name (getSrcBetween pos finishPos') name) fls')
        where
          (finishPos', fls') = L.mapAccumR annotateFieldLine finishPos fls
      Section (Name pos name) args fs -> (pos, Section (Name (getSrcBetween pos finishPos'') name) args' fs')
        where
          (finishPos', fs') = L.mapAccumR annotateField finishPos fs
          (finishPos'', args') = L.mapAccumR annotateSectionArg finishPos' args

    annotateFieldLine :: Position -> FieldLine Position -> (Position, FieldLine ByteString)
    annotateFieldLine finishPos (FieldLine pos xs) = (pos, FieldLine (getSrcBetween pos finishPos) xs)

    annotateSectionArg :: Position -> SectionArg Position -> (Position, SectionArg ByteString)
    annotateSectionArg finishPos = \case
      SecArgName pos xs -> (pos, SecArgName (getSrcBetween pos finishPos) xs)
      SecArgStr pos xs -> (pos, SecArgStr (getSrcBetween pos finishPos) xs)
      SecArgOther pos xs -> (pos, SecArgOther (getSrcBetween pos finishPos) xs)

    getSrcBetween :: Position -> Position -> ByteString
    getSrcBetween from to = snd $ splitAtPosition from $ fst $ splitAtPosition to bs

    maxBoundPos :: Position
    maxBoundPos = Position maxBound maxBound

Then manipulate [Field ByteString], then restore the source file with foldMap fold :: [Field ByteString] -> ByteString. This to a certain limit just works:

  • Legacy format of sections in figure braces is a pain: closing braces gets associated with the latest deepest token. A modern tool can just refuse to support it.
  • Comma-separated lists are troublesome: leading commas could be associated to either preceding or following token.
  • Figuring out correct indentation in the middle of a section might be annoying, it would probably be wiser to preserve both annotations, e. g., [Field (Position, ByteString)].

@gbaz
Copy link
Collaborator

gbaz commented Oct 18, 2023

I don't know how well that representation matches reality, but the annotationless one I have isn't all that long either.

The point of the proposed representation I linked is not brevity or not but rather that it "forgets down" to the existing "Field" represntation, suitable for use downstream in the existing parser pipeline. This means that it can be added as an extension to the existing parser, while a whole new AST would likely not be so simple. The other point of that proposed representation is that it parses efficiently an extension to the existing efficient parser.

@BurningWitness
Copy link

Then you (the maintainers) will have to agree on what the scope of this task is and how you see Cabal's parsing working in the future.

In my view the current bidirectional all-versions-in-one double-specialization parser is extremely unwieldy (e.g. buildInfoFieldGrammar). I suspect this is, at least in part, the reason for why format changes over the versions have been quite anemic, only adding and deprecating specific fields, never cleaning up.

If everyone is satisfied with the status quo and a mere extension of the lexer is deemed the easiest approach, then I'm not the person to solve this as I have no interest in tag soup parsers.

@gbaz
Copy link
Collaborator

gbaz commented Oct 19, 2023

I think an independent library that can parse, modify, and exactprint cabal files would be quite nice. The concern would be that it might not keep pace with the spec, while doing so in an integrated way would. If your intent remains, as stated above, to write an independent library that does not replace what currently exists, then please go for it! All sorts of independent tools can be written on top of it and integrated with cabal using the external command system, and this could well be a good approach.

That said, I do not see a path to replacing the entire existing cabal parser, with its needs for efficiency, back-compat, etc. At this point what we can do is only evolve it.

@ulysses4ever
Copy link
Collaborator

The concern would be that it might not keep pace with the spec

The spec doesn't change so very often, I thought.

@BurningWitness
Copy link

BurningWitness commented Oct 20, 2023

I do not see a path to replacing the entire existing cabal parser, with its needs for efficiency, back-compat, etc.

While it is a lot of work due to the sheer number of field formats and caveats, I don't see either efficiency or backwards compatibility as issues:

  • Exact parser should be able to stream its results one node at a time. One node in this case is (section line* | field line* | comment | newline), where line* in sections can be parsed into node*. Some minor inefficiencies will arise due to streaming, e.g. parsing some of the next line just to know if it belongs to this node or not, but I wouldn't sweat it.

  • Every format version can have a separate representation (data Cabal = CabalV1_0 ... | CabalV2_0 ... | ...), each with its own exact -> representation parser. Due to the aforementioned lack of changes between versions, the only version that truly matters is the latest one, all previous ones are simply clones with some fields added or removed.

On top of it all of this should be completely pure: downloading all of Hackage, and then checking exactness of the exact parser and errors emitted by the representation one, should be good enough tests for both of them.

@andreabedini
Copy link
Collaborator

Similarly to what @gbaz says, I welcome and support anybody to work on an independent parsing library.

BUT I also doubt of the long term benefits of such efforts, since Cabal-syntax would remain the reference implementation of the syntax[1] and any separate implementation could not guarantee to stay compatible in the future. Indeed, Hackage has already a handful of different parsers, do they work?

Please, do dive in the code, try your ideas! IMHO the only way out of the tarpit is through iterative improvements. I know the parsing part could be daunting (from memory I count 4 "levels", the lexer, the parser from tokens to [Field Position], the conversion from [Field Position] to Fields (don't even ask), the parser from Fields to GenericPackageDescription (using the FieldGrammar infrastructure)).

Try to take them a part, try to replace one, why do they exist? what do the do? can we evolve it? which bit first? and how?

Many would prefer wqorking on a green field project but Cabal has many years of history and many constraints to satisfy. Here lies the challenge.

[1] which is not formally specification elsewhere, perhaps one could start from this first. But we would need a way to determine if such specification matches the current implementation. Any idea is welcome.

@liamzee
Copy link
Collaborator

liamzee commented Oct 20, 2023

@andreabedini

But there's short-term benefits, no? Projects like HLS are waiting for a good exact parser of any kind, and while long-term, it might be nice to retrofit the parser in Cabal-syntax, in the short-term, this scratches an itch and allows tooling that requires Cabal exact parsing to move forward.

@andreabedini
Copy link
Collaborator

Yes, there are short term benefits. I cannot disagree with that. Exploring the design space is also something that can become a long term benift; but there is also the usual trap of getting 80% of the design right and never finish the remaining 20%.

@gbaz
Copy link
Collaborator

gbaz commented Oct 21, 2023

The spec doesn't change so very often, I thought.

It changes slightly (in terms of additions of new fields) with almost every Cabal major release:

https://cabal.readthedocs.io/en/stable/file-format-changelog.html

@tonyday567
Copy link

the only way out of the tarpit is through iterative improvements. I know the parsing part could be daunting (from memory I count 4 "levels", the lexer, the parser from tokens to [Field Position], the conversion from [Field Position] to Fields (don't even ask), the parser from Fields to GenericPackageDescription (using the FieldGrammar infrastructure)).

Try to take them a part, try to replace one, why do they exist? what do the do? can we evolve it? which bit first? and how?

I would love to collaborate on a project with this type of specification. Something that could start small - I like the idea of just learning a bit about the answers to these questions. An initial aim could be just to get a "rough guide" to cabal syntax, and the library, that could maybe help the next generation of intrepid explorers.

I don't know too much about the code base, but I have parsed cabal files using the lbrary and some (flatparse) logic.

@jappeace
Copy link
Collaborator

@gbaz

That said, I do not see a path to replacing the entire existing cabal parser, with its needs for efficiency, back-compat, etc. At this point what we can do is only evolve it.

I get the impression this is somewhat demotivating to potential new contributors, so I wish to suggest a path to replacing cabal parser:

  • initial parser/printer is developed as a seperate library,
  • it provides a mapping (or mappings) to the current cabal ast (if you can call it that), eg ExactAst -> [Field Position] etc.
    • This can be tagged by version as well, it doesn't have to be a single mapping!
    • This would also give compatibility with all existing cabal code.
  • we use tests to assert correctness, all golden tests that pass in the current cabal parser should have the same output as the exact print parser and it's mapping:
    • we can contribute more tests to the existing cabal library if the maintainers feel they're lacking. This can be upstreamed immediatly!
    • the mapping to legacy AST can be used to see if both parsers do the same.
    • the exact printer also get's tests described originally.
      • eg: have a byte-for-byte round trip on no-op parse/print and print/parse tests.
  • we setup benchmarking to assert it's fast enough.

Once the maintainers feel confident the exact print parser is good enough in terms of correctness and performance,
we can swap out the legacy legacy parser for the exact print one with help of the mapping.

@BurningWitness
Copy link

BurningWitness commented Oct 24, 2023

Having thought about it for a bit more, curly bracket syntax forces parsing to be depth-first, as any deeper level may introduce a breaking offset change:

foo
    bar {
         baz
              this {
that
}
              end }

As such streaming is limited to one top-level element at a time, and to prove the element ends at a correct point its entire skeleton must be parsed.


For any future format-inventors:

I think the sane approach would be to instead use a special literal at the start of the line, only interpreting it as an offset break if its offset is smaller than of the element its inside of:

foo
    bar
>        baz
>              this
>> that
>              end

Same rule would make sense for comments inside fields:

    foo: -- bar    < not a comment
--       baz       < comment
      -- this      < not a comment

@andreabedini
Copy link
Collaborator

@jappeace

I understand how @gbaz's comment might have come across as demotivating. But "evolve it" and "replace it" are not necessarily in opposition.

Just like Theseus's ship, we can evolve it by replacing one part at the time (or should I say replace it by evolving one part at the time? 🤔).

I wish to suggest a path to replacing cabal parser:

I appreciate you have been spending time one this and I will comment on some technical aspects of it; but what I want to understand first is: are you willing to do the work?

  • initial parser/printer is developed as a seperate library,

This is done, here is a freshly baked new parser written by @VeryMilkyJoe. I also notice that @liamzee has already contributed to it.

I might disagree with some of the technical choices but let's just roll with the CabalAST as it is implemented there.

To have a chance of being able to use it Cabal-syntax, it has to depend only on boot packages so you have to replace megaparsec with parsec.

* it provides a mapping (or mappings) to the current cabal ast (if you can call it that), eg `ExactAst -> [Field Position]` etc.

You are welcome to do this.

  * This can be tagged by version as well, it doesn't have to be a single mapping!

Which version you are talking about here? The Field syntax does not depend on any version. I am not sure whether CabalAST makes more assumptions than Field.

  * This would also give compatibility with all existing cabal code.

At the cost of impoving anything. If Cabal operated on Field Position, we would have lost all the annotations that CabalAST provides.

Also, you will realise that CabalAST does not mention build-depends or anything like that. Why is that? It's because Field (which I believe CabalAST was modelled from) is more basic syntax structure.

The actual fields definition is in the various FieldGrammars, a static definition of the fields that is used for both parsing and pretty-printing. It is general enough it is also used to generate the field syntax reference.

GenericPackageDescription is parsed from Field using this infrastructure. I assume you could parse it from CabalAST instead; but how are you going to keep the annotations?

* we use tests to assert correctness,  all golden tests that pass in the current cabal parser should have the same output as the exact print parser and it's mapping:

Define "output": Field Position, Fields, GenericPackageDescription?

  * we can contribute more tests to the existing cabal library if the maintainers feel they're lacking. This can be upstreamed immediatly!

(while considering myself only a contributor) Yes please!

You can also use the test-suite hackage-tests in Cabal-tests.

  * the mapping to legacy AST can be used to see if both parsers do the same.

Again which legacy AST you are referring to here?

* we setup benchmarking to assert it's fast enough.

In terms of benchmarks I believe we have only cabal-benchmarks/bench/CabalBenchmarks.hs.

As far as I understand:

  • in terms of speed, timing hackage-tests might be enough
  • memory usage is also important, the main constraint comes from hackage-server which use Cabal-syntax. Having a benchmark to measure memory consumption in a use-case like hackage-server would be very valuable. You are welcome to contribute one.

Once the maintainers feel confident the exact print parser is good enough in terms of correctness and performance, we can swap out the legacy legacy parser for the exact print one with help of the mapping.

Once the problems above have been resolved, I believe nobody will stand in the way.

I apologise if I have come across a bit rude. By looking at the codebase, I think these are the problems we have to face. I am eager to see them discussed or better still solved.

Of course, I would be also happy to be called out if someone thinks that I am too entrenched in the current architecture and everything can be easily simplied. Show me!

@jappeace
Copy link
Collaborator

I appreciate you have been spending time one this and I will comment on some technical aspects of it; but what I want to understand first is: are you willing to do the work?

No the intention was to hire @LemonjamesD to do this, but she has to know what to do.
I feel your comment makes it a lot more clear what the actual issue is. Thanks for that :)

At the cost of impoving anything. If Cabal operated on Field Position, we would have lost all the annotations that CabalAST provides.

Yes so I get the impression the best approach is to modify types such as Fields and GenericPackageDescription in place to add exact printing support (1).
And deal with the obnoxious amount of compile errors.

Define "output": Field Position, Fields, GenericPackageDescription?

Any of those would work depending on what you're mapping into,
if you'd do the seperate AST approach.

Again which legacy AST you are referring to here?

I perhaps mistakenly assumed the current way of parsing needed replacing.

I apologise if I have come across a bit rude. By looking at the codebase, I think these are the problems we have to face. I am eager to see them discussed or better still solved.

It's fine you provided a goldmine of information, at least I can now make a sorta scoped contract rather then "oh go do this thing".


New approach based on (1):

  • add some sort of pretty function for generic package descriptions:
    exactPrint :: GenericPackageDescription -> Text
    
    (don't know if something such as the pretty package could be used)
  • setup a extremly basic test for byte for byte roundtrippin.
  • start adding Position to various fields in generic package description, let the test pass.
  • add more tests that are more realistic
  • repeat untill all of hackage can be parsed.

This approach makes the work easier to split as well because you can just make test by test pass. Where a test is a cabal file being roundtripped.

@BurningWitness BurningWitness linked a pull request Oct 31, 2023 that will close this issue
7 tasks
@BurningWitness
Copy link

BurningWitness commented Nov 8, 2023

Seems like the last idiosyncrasy on my list is section name parsing, allowing

> foo
>   { bar "b -- {a}:z" }
[Section "foo" [] [Section "bar" [SecArgStr "b -- {a}:z"] []]]

No manifests on Hackage utilize this as a format escape, only ds-kanren/metric patches do. foo:bar I assume clashes directly with the existing console command format of pkg:test:name, while foo bar always has to be quoted. As such I don't see a good reason to maintain any of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment