Skip to content

Latest commit

 

History

History
91 lines (60 loc) · 7.38 KB

PARSE_TREE.md

File metadata and controls

91 lines (60 loc) · 7.38 KB

Javascript Camxes Parser

Javascript Camxes is a Node.js parser implementation of the Lojban PEG grammar. There exist several variants of the grammar; the current standard PEG grammar is contained in the file camxes.peg, and the corresponding Javascript parser is camxes.js. The parser is automatically generated from the PEG grammar using the PEGJS library. The main function provided by camxes.js is camxes.parse(lojban_text), which takes any Lojban text as input and returns a parse tree thereof if the input Lojban text is grammatical; otherwise it throws a syntax error exception.

Javascript Camxes' parse tree structure

The parse tree generated by the parser is a nested array representing an abstract syntax tree.

If we take a simple Lojban input such as "mi citka lo plise", a very simplified parse tree for it would look like:

text ── sentence ── proposition ┬─ argument ── KOhA ── "mi"
                                │ 
                                ├─ predicate ── gismu ── "citka"
                                │ 
                                └─ argument ┬─ LE ── "lo"
                                            │ 
                                            └─ gismu ── "plise"

Which is can be represented as a nested list (with each tree node enclosed in brackets "[]") as follow:

[text, [sentence, [proposition, [argument, [KOhA, "mi"]] [predicate, [gismu, "citka"]] [argument, [LE, "lo"], [gismu, "plise"]]]]]

The parse trees generated by the Camxes parser are encoded as such nested lists, but they are much more complexes than this simplified example because the generated tree shows a node for each satisfied PEG rule in the PEG grammar, and furthermore it contains all the morphological information (e.g. consonant clusters, diphtongues…), so the raw output of Camxes contains a great deal of details that makes it hard to read as is. However this raw and complex parse tree output can later be simplified so only the desired nodes are kept (for example all the morphology nodes can be pruned, etc.).

Below is an example of the raw, untrimmed output returned by Camxes when the Lojban sentence "ti melbi" is given as input:

["text",["text_1",["paragraphs",["paragraph",["statement",["statement_1",["statement_2",["statement_3",["sentence",[["terms",["terms_1",["terms_2",["term",["term_1",["sumti",["sumti_1",["sumti_2",["sumti_3",["sumti_4",["sumti_5",["sumti_6",["KOhA_clause",["KOhA_pre",["KOhA",[["t","t"],["i","i"]]],["spaces"," "]]]]]]]]]]]]]]],["CU"]],["bridi_tail",["bridi_tail_1",["bridi_tail_2",["bridi_tail_3",["selbri",["selbri_1",["selbri_2",["selbri_3",["selbri_4",["selbri_5",["selbri_6",["tanru_unit",["tanru_unit_1",["tanru_unit_2",["BRIVLA_clause",["BRIVLA_pre",["BRIVLA",["gismu",[["consonant",["syllabic",["m","m"]]],["stressed_vowel",["vowel",["e","e"]]],["consonant",["syllabic",["l","l"]]]],["consonant",["voiced",["b","b"]]],["vowel",["i","i"]]]]]]]]]]]]]]]],["tail_terms",["VAU"]]]]]]]]]]]]]]]

Daunting, isn't it?

Fortunately, a postprocessor script (camxes_postproc.js) comes along with the parser, and has various options to trim this raw parse tree to various degrees, from the simple removal of morphological information, to keeping only full words with brackets merely showing syntactic grouping.

Here's what looks like the previous raw parse tree after the morphological information has been trimmed away:

["text",["text_1",["paragraphs",["paragraph",["statement",["statement_1",["statement_2",["statement_3",["sentence",[["terms",["terms_1",["terms_2",["term",["term_1",["sumti",["sumti_1",["sumti_2",["sumti_3",["sumti_4",["sumti_5",["sumti_6",["KOhA_clause",["KOhA_pre",["KOhA","ti"],["spaces"," "]]]]]]]]]]]]]]],["CU"]],["bridi_tail",["bridi_tail_1",["bridi_tail_2",["bridi_tail_3",["selbri",["selbri_1",["selbri_2",["selbri_3",["selbri_4",["selbri_5",["selbri_6",["tanru_unit",["tanru_unit_1",["tanru_unit_2",["BRIVLA_clause",["BRIVLA_pre",["BRIVLA",["gismu","melbi"]]]]]]]]]]]]]],["tail_terms",["VAU"]]]]]]]]]]]]]]]

Here's the same tree after the removal of all the non-terminal nodes and node names, as well as whitespaces:

[["ti","CU"],["melbi","VAU"]]

Already far more readable, isn't it?

However all the detailed information contained in the raw ouput can be valuable if you intend to make software interpreting or translating to or from Lojban, and so on.

Interpreting the parse tree notation

Now let's take a more detailed look at how to interpret this nested list representation of the parse tree.

First of all, in the raw parse tree, each node corresponds to one rule in the PEG grammar (camxes.peg for the standard grammar). It is important to be at least vaguely acquitained with the formal grammar to understand the parse tree and the nodes names.

First of all, each node (each array/list in the nested array/list representation) in the parse tree optionally begins with a node label (the name of the node). Each node name is identical to the name of the corresponding PEG rule in the formal grammar. If the node has no label/name (anonymous node), then the first element of the node is a list/array. If the first element of a node is a letter string (as opposed to an array), then it is the name of the node.

Some terminal nodes (nodes containing no sub-array) only contains a selmaho name (e.g. "KU", "VAU", "KUhO"). Those stand for terminators that have been elided in the input Lojban text, but that have been restaured by the parser.

The existence of anonymous (nameless) nodes in the parse tree is due to the fact that parentheses in the PEG grammar generate anonymous nodes, so with a PEG rule like selbri_2 <- selbri_3 (CO_clause free* selbri_2)? and "broda co brode" as the input, the PEG rule with generate ["selbri_2",["selbri_3",...],[["CO_clause",...]["selbri_2",...]]], with an anonymous node containing the "co brode" part.

Finally, if a node is of the form ["string","string"], for example ["gismu","melbi"], then it is a terminal node whose first element is its name, and whose second element is the terminal Lojban text element (letter, word…).

Here is an example of how to interpret the following piece of nested list notation (with morphological information kept), corresponding to the Lojban "ro ti" (with a whitespace and an elided terminator BOI):

["sumti_4",["sumti_5",["quantifier",["number",["PA_clause",["PA_pre",["PA",[["r","r"],["o","o"]]],["spaces"," "]]]],["BOI"]],["sumti_6",["KOhA_clause",["KOhA_pre",["KOhA",[["t","t"],["i","i"]]]]]]]]
sumti_4 ── sumti_5 ┬─ quantifier ┬─ number ── PA_clause ── PA_pre ┬─ PA ── ∅ ┬─ r ── "r"
                   │             │                                │          │
                   │             │                                │          └─ o ── "o"
                   │             └─ BOI                           │
                   │                                              └─ spaces ── " "
                   │
                   └─ sumti_6 ── KOhA_clause ── KOhA_pre ── KOhA ── ∅ ┬─ t ── "t"
                                                                      │
                                                                      └─ i ── "i"

Note: in the above syntax tree, the "∅" after "PA" and "KOhA" stand for anonymous nodes.