Skip to content

tjvr/a-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

f413873 · Nov 30, 2022
Nov 30, 2022
Nov 29, 2022
Nov 29, 2022
Feb 2, 2019
Nov 30, 2022
Nov 29, 2022
Jan 4, 2019
Jan 2, 2019
Nov 30, 2022
Jan 22, 2019
Nov 29, 2022
Jan 5, 2019
Nov 29, 2022
Nov 29, 2022
Jan 2, 2019

Repository files navigation

What is this?

This is a work-in-progress parser generator: a framework for generating parsers from a grammar definition. It takes inspiration from Nearley and other projects.

A parser is used to turn text (such as the source code for a programming language) into a parse tree. This tree-like structure contains all of the hierarchy of the source code, and is classically the input to a compiler which traverses the tree to emit machine code (at a rough simplification. Most production compilers use several different intermediate tree formats.)

A parser is usually used with a tokenizer (or "lexer"). The tokenizer does the "dumb" job of splitting the text into "words", called "tokens"; the parser does the "smart" job of recognising sequences of such tokens. We recommend Moo as a very fast and friendly tokenizer.

So, as far as this project is concerned: a Parser turns a stream of Tokens into a tree of Nodes.

To construct a Parser, you create a Grammar, and then pick a parsing algorithm to use.

Parse trees

The grammar/node file contains types for representing a parse tree. These are loosely modelled on the Node objects in the ESTree spec, which is based on a production JavaScript parser.

interface Node {
    type: string;
    region: Region | null;
    ...attrs
}

Parse trees are made up of Nodes. Grammars include annotations describing how to build a parse tree from them, which lets you conveniently omit syntactic information (such as whitespace, or operator tokens) from your parse tree, without writing any grammar rule post-processors or tree traversal code.

A Node includes a Region, which describes the location in the source file where the node was found. This is very useful for generating descriptive semantic error messages from your compiler: for example, you might like to highlight the region in which a type error was found.

interface Region {
    start: Pos,
    end: Pos,
    buffer: String,
}

A Region consists of a start position (just before the first character that was matched) and an end position (just after the last character that was matched). It also includes a pointer to the entire source text.

A Pos is a simple pair of line and column numbers. Like Moo, these start from one. The character offset from the beginning of the source is also included.

interface Pos {
    line: Number, // >= 1
    col: Number, // >= 1
    offset: Number,
}

A Node additionally includes attributes, based on the annotations in the source file. Attribute values have three different kinds:

  • another Node
  • the value of token
  • an Array (a list of Nodes, or token values)

Grammar definitions

Grammars are defined using a custom syntax, from which a parser can be generated.

(You might wonder how this syntax is defined itself: in fact, it has its own grammar, and can parse itself! We call this "bootstrapping".)

The grammar module contains code to interpret this syntax.

TODO explain atoms: what does "if" match?

Post-processing

By default, rules will match input, but won't return anything. Annotations are used to describe how to construct a parse tree from what was matched. Rules without annotations always produce the value null.

if_stmt -> "if" expr "then" block
// null

Rules can be annotated with a node type. Any number of children may then be annotated with an attribute name. For example, this rule will produce a node of type IfStatement, with attributes cond and body.

if_stmt IfStatement -> "if" cond:expr "then" body:block
// {type: "IfStatement", cond: ..., body: ...}

You cannot annotate children if the rule is not annotated first. Otherwise, there is no node to attach the attributes to.

if_stmt -> "if" cond:expr "then" body:block
// Not allowed

Root Annotation

It's often useful to pass through a node unchanged, without wrapping it in another object. For example, you might have a number rule that can match a float. We can use the root annotation (which can be thought of as an empty attribute name). Use this to pass through the child unchanged.

float_literal Literal -> value:"float"
// {type: "Literal", value: 3.14}

number -> float_literal
// null

number -> :float_literal
// {type: "Literal", value: 3.14}

Parentheses are another good example. You must define a syntax rule for brackets -- otherwise they couldn't be parsed! -- but usually you don't want them to appear in the final parse tree.

x -> "(" :x ")"

The root annotation can only be applied to a single child of a rule.

foo -> :bar :quxx
// Not allowed

The root annotation cannot be combined with other annotations.

foo Object -> :bar
// Not allowed

num Object -> :expr "+" other:expr
// Not allowed

List Annotations

It's often useful to parse lists (e.g. statements in a program; comma-separated arguments to a function call).

You could define lists yourself, by inventing a node type for linked lists.

// A program must have at least one statement.
program StatementList -> head:statement

// A program is a statement followed by the rest of the program.
program StatementList -> head:statement tail:program

Each node in the linked list will have the type StatementList.

The rule above is right-recursive; we prefer left-recursive rules. In addition, there is a special built-in list type [], to avoid you working with linked lists yourself.

The last item in the list is annotated as the root attribute. The special list attribute [] is used for the rest of the list.

// A program must have at least one statement.
program [] -> :statement

// A program ends with a statement.
program [] -> []:program :statement

The rule must be annotated with the special list type []. One of the children may then be annotated as the root attribute; another child may then additionally be annotated with the list attribute.

You cannot have a list just contain itself; and as above, the root annotation can only be applied to a single child of a rule.

program [] -> []:program
// Not allowed

program [] -> []:program :statement :statement
// Not allowed

Apart from these restrictions, you can use these annotations anywhere in the rule.

// Body is one or more lines separated by semicolons
body [] -> :statement
body [] -> []:body ";" :statement

// Arguments are zero or more expressions separated by commas
args [] -> 
args [] -> []:args ";" :expr

EBNF Operators

Three regex-like operators for optional tokens and repetition are provided:

Option ?

val? matches zero or one occurences of val. It expands to the generated rule val?:

val? -> :val
val? ->

In the expression key:val?, if val is not present, then key will be null.

One or many +

val+ matches one or many occurences of val. It expands to the generated rule val+:

val+ [] -> []:val+ :val
val+ [] -> :val

In the expression key:val+, key will always contain a non-empty array.

Zero or many *

val* matches zero or many occurences of val. It expands to the generated rule val*:

val* [] -> []:val* :val
val* [] ->

In the expression key:val*, key will always contain an array, but it may be empty.

Parsing

There may be various parsing algorithms implemented. Initially there is a proof-of-concept parser which uses Nearley's implementation of the Earley algorithm.

This project doesn't have a name yet, so the example below calls it foo. Using a parser will look something like this:

const moo = require('moo')
const grammar = require('foo') // import the core library
const compileNearley = require('foo/compile-nearley') // import the Earley algorithm

// Get our tokenizer ready
const lexer = moo.compile(...)

// Define our Grammar
const myGrammar = foo.grammar(`
  program -> :expr+

  expr Quote -> "'" list:List
  expr List -> "(" items:expr+ ")"
  expr Atom -> name:"identifier"
  expr Literal -> value:"number"
  expr Literal -> value:"string"
`)

// Construct a Parser from our Grammar, using the Earley algorithm
const parser = compileNearley(myGrammar)

// Begin a new Parse
parser.reset()

// Tokenize the source code
lexer.reset(source)

for (let tok of lexer) {
  // Feed each token to the parser
  try {
    parser.eat(tok)
  } catch (err) {
    throw new Error(lexer.formatError("syntax error: " + err.message))
  }
}

// Get the final parse tree
const parseTree
try {
  parseTree = parser.result()
} catch (err) {
  throw new Error(lexer.formatError("unexpected EOF: " + err.message))
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published