Syntax — Specification

Metanotation
- Regular expressions
- The context-free grammar
Lexical Structure
Syntactic Analysis

Updated: December 22, 2019

Metanotation

A combination of the following metalanguages and metalinguistic formalisms is used in this section:

plain English,
MANOOL expressions (whereby involving a kind of metacircular description),
regular expressions, and
a formal context-free grammar¹ with attributes (i.e., semantic values).

All MANOOL expressions involved as metacircular descriptions are to be considered in the binding environment of the MANOOL standard library.

Regular expressions

A regular expression regexp that describes a lexical category <lexcat>L (which corresponds to a terminal symbol from the context-free grammar) is provided according to the following format:

<lexcat>L -> regexp [[ ... ]]

as in the following example:

<lit>L -> (<letter> | "_") (<letter> | "_" | <digit>)*  [[ {if E <> "_" then MakeSym[E] else MakeSym[]} ]]

Alternatively, a lexical category may be denoted simply as "chars"L as in the example:

";"L -> ";"  [[ Nil ]]

Auxiliary (helper) definitions may also be provided, e.g.:

<digit> -> "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Inside regexp

a single literal character or a sequence thereof is enclosed in either " or ' (double or single quotes),
a mere concatenation of subexpressions means sequencing,
* (an asterisk metaoperator) means repetition zero or more times of the preceding item,
| (a vertical bar) separates syntactic alternatives (and is read “or”),
* bind stronger than | and plain concatenations whereas the later bind stronger than |,
( and ) (parentheses) are used for explicit grouping subexpressions, and
... (ellipses) are used to represent “obvious” omissions and have a rather informal character.

At the end of each “lexical production” a semantic evaluation function as a MANOOL expression is provided enclosed in [[ and ]] whose purpose is to describe construction of the corresponding lexical (semantic) value (i.e., an attribute value from the standpoint of attribute grammars) out of the lexeme text, denoted as E (from “lexical element”).

The context-free grammar

Each production of the context-free grammar for a nonterminal symbol (i.e., a syntactic category) <syncat>S is provided according to the following format:

<syncat>S -> rhs [[ ... ]]

where no metaoperators occur except | (which is just an optional shorthand notation to represent several alternative productions for the same left-hand side).

Inside all regular expressions and productions, "" means “ε” (i.e., the empty string) and is useful to reinforce clarity.

Again, at the end of each production a semantic evaluation function as a MANOOL expression may be provided enclosed in [[ and ]] that constructs the corresponding semantic value out of the semantic values of constituents, accessible as elements of the array A (from “attributes”). In this case each corresponding symbol in the right-hand-side of the production is followed by the corresponding index between [ and ] (brackets), as in the example:

<args'>S  ->  <datum>S[0] ";"L <args'>S[2]  [[ MakeList[{array of A[0]} + A[2]] ]]

The default semantic evaluation function is assumed to be simply A[0].

Lexical Structure

During the translation phase of lexical analysis, the abstract machine decomposes the input character string into a concatenation of lexical elements and lexical separators (or just separators, for short). The later are ignored except as they serve to separate the former. At least one separator is required between an integer or symbol literal and a nearby integer or symbol literal.²

A lexical separator is either a whitespace character or a comment.

Lexical elements are classified into

literals (integer, string, and symbol),
infix operators (equivalence/association, relational, additive, and multiplicative),
unary operators (prefix and postfix), and
delimiters and punctuators.

Source character set and encoding

This specification deliberately makes no provisions for particular character sets or encodings to be used to interpret program units as text files. Moreover, it is generally possible to represent MANOOL programs as series of bytes (octets) in a manner agnostic to character set and character encoding. Such representation, however, shall cover the ASCII character set and shall be backward compatible with ASCII in respect to character encoding. As an extreme example, it would be perfectly valid to assume the UTF-8 character encoding for some source code fragments and ISO-8859-1 for other fragments, even in the same source file.

Program units are composed only of the following graphical ASCII characters:

26 uppercase letters:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

26 lowercase letters:

a b c d e f g h i j k l m n o p q r s t u v w x y z

10 decimal digits:
```
0 1 2 3 4 5 6 7 8 9
```

30 special characters:

! " # $ % & ' ( ) * + - . / : ; < = > ? @ [ \ ] ^ _ { | } ~

plus the following whitespace ASCII characters:

SP — space;
HT — tab;
VT — vertical tab;
FF — form feed;
CR — carriage return;
LF — line feed (line separator);

except that string literals and comments may contain any additional characters (regardless of encoding).

As a special case, the abstract machine considers a zero byte (corresponding to the ASCII NUL character) to be an optional end-of-file marker (effectively ignoring it and the rest of the file). A line separator character at end of file in a program unit is recommended but not required as per this specification.

Note that MANOOL is a case-sensitive language, so the abstract machine considers lowercase and uppercase letters as distinct for all purposes. Also note that the following two characters are illegal in program units anywhere outside of string literals and comments and are reserved for future use:

, — comma;
` — ASCII grave accent or backquote.

Literals

In this section the following helper definitions hold:

<letter> ->  "A" | "B" | "C" | ... | "Z" | "a" | "b" | "c" | ... | "z"
<digit>  ->  "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<any char except '"' and LF> -> ...

Integer literals

An integer literal consists of one or more decimal digits and denotes a lexical value of type Integer by using the conventional decimal notation:

<lit>L -> <digit> <digit>*  [[ Int[E] ]]

Examples:

0   1   2   5   10   123   2018   140737488355327

String literals

A string literal consists of zero or more characters — other than double quotes (") and line separators — enclosed in double quotes. With the quotes stripped off, a string literal denotes a lexical value of type String, as is:

<lit>L -> '"' <any char except '"' and LF>* '"'  [[ E[Range[1; Size[E] - 1]] ]]

Examples:

""   "foo"   "This is a string"   "manool.org.18/std/0.6/all"

Alternatively, \} and \{ may be used instead of double quotes.³ In this case the lexical value may contain anything except \{. Thus,

\}
  <p>This is a paragraph.</p>
\{

is equivalent to

(Lf + Sp + Sp + "<p>This is a paragraph.</p>" + Lf)$

Symbol literals

A symbol literal consists of one or more letters, digits, and underscore characters (_) and starts with a letter or underscore. Each occurrence in a program unit of a symbol literal denotes a lexical value of type Symbol. For any symbol literal distinct from a single underscore, the value is obtained by constructing a symbol out of the literal (i.e., straight out of the text); otherwise, the value is represented by a unique uninterned symbol generated by the abstract machine during the lexical analysis phase on each occurrence of the literal:

<lit>L -> (<letter> | "_") (<letter> | "_" | <digit>)*  [[ {if E <> "_" then MakeSym[E] else MakeSym[]} ]]

Examples:

A   WriteLine   extern   Log10   _   _Num

Note that negative integers, some categories of strings, extremely long strings, and some categories of symbols have no literal representation whatsoever.

Infix operators

An infix operator is similar to a symbol literal, although unlike the later it consists of one or two special characters and is classified from the syntactic analysis standpoint as either <equ op>L, <rel op>L, <add op>L, or <mul op>L:

Equivalence/association operator:
```
<equ op>L -> "="  [[ Sym[E] ]]
```

Relational operators:

<rel op>L -> "==" | "<>" | "<" | "<=" | ">" | ">="  [[ Sym[E] ]]

Additive operators:

<add op>L -> "+" | "-" | "|"  [[ Sym[E] ]]

Multiplicative operators:

<mul op>L -> "*" | "/" | "&"  [[ Sym[E] ]]

Unary operators

A unary operator is similar to a symbol literal, although unlike the later it consists of a single special character and is classified from the syntactic analysis standpoint as either <pref op>L or <post op>L:

Prefix operator:
```
<pref op>L -> "~"  [[ Sym[E] ]]
```

Postfix operators:

<post op>L -> "!" | "#" | "$" | "%" | "'" | "?" | "@" | "^"  [[ Sym[E] ]]

Delimiters, punctuators

The only delimiter in MANOOL is ;:

";"L -> ";"  [[ Nil ]]

Punctuators in MANOOL are

(  )  .  :  [  ]  {  }

Delimiters and punctuators lack any meaningful lexical value and serve only for grouping syntactically other values and overriding operator precedence and associativity.

Comments

MANOOL syntax allows for two kinds of comments: line comments and block comments.

Line comments

A line comment starts with two adjacent - (ASCII hyphen-minus or dash) characters and extends up to the nearest end of line. Here is an example of a line comment:

-- This is a line comment

Block comments

Block comments may be recursively nested inside one another. A block comment starts with a combination of /* (ASCII solidus or slash plus asterisk) characters and extends up through the matching combination */, which shall not immediately precede an asterisk. Inside a block comment, combinations /* and */ do not start or end a nested comment when encountered within what would look like a line comment or string literal if it had occurred outside of a comment. This applies even to malformed string literals implicitly terminating at end of line.⁴ This is a complete, properly terminated block comment (provided no asterisk immediately follows it):

/* This is a block comment
  */*** This is a nested comment
    -- This is a line ***/comment/***
    Out.WriteLine["Comments terminate on */ and start on /*"]
    "Malformed ***/string literal/*** ends here ->
  end of comment ***/
end of comment */

Syntactic Analysis

On the second translation phase, the abstract machine parses the input in the form of string of terminal symbols according to the context-free grammar provided below, builds up a parse tree, and then constructs the AST intermediate representation.

Context-free grammar (with attributes and semantic evaluation functions):

<start>S  ->  <datum>S
<datum>S  ->  <datum'>S
<datum>S  ->  <datum'>S[0] <equ op>L[1] <datum'>S[2]  [[ MakeList[{array of A[1]; A[0]; A[2]}] ]]

<datum'>S ->  <simple>S
<datum'>S ->  <simple>S[0] <rel op>L[1] <simple>S[2]  [[ MakeList[{array of A[1]; A[0]; A[2]}] ]]

<simple>S ->  <term>S
<simple>S ->  <simple>S[0] <add op>L[1] <term>S[2]  [[ MakeList[{array of A[1]; A[0]; A[2]}] ]]

<term>S   ->  <factor>S
<term>S   ->  <term>S[0] <mul op>L[1] <factor>S[2]  [[ MakeList[{array of A[1]; A[0]; A[2]}] ]]

<factor>S ->  <prim>S
<factor>S ->  <pref op>L[0] <factor>S[1]  [[ MakeList[{array of A[0]; A[1]}] ]]

<prim>S   ->  <prim'>S
<prim>S   ->  <prim>S[0] "["L <args>S[2] "]"L  [[ MakeList[{array of A[0]} + A[2]] ]]
<prim>S   ->  <prim>S[0] "."L <prim'>S[2] "["L <args>S[4] "]"L  [[ MakeList[{array of A[2]; A[0]} + A[4]] ]]
<prim>S   ->  <prim>S[0] <post op>L[1]  [[ MakeList[{array of A[1]; A[0]}] ]]

<prim'>S  ->  <lit>L[0]  [[ A[0] ]]  |  "("L <op>S[1] ")"L  [[ A[1] ]]
<prim'>S  ->  "{"L <list>S[1] "}"L  [[ A[1] ]]  |  "("L <datum>S[1] ")"L  [[ A[1] ]]

<op>S     ->  <equ op>L | <rel op>L | <add op>L | <mul op>L | <pref op>L | <post op>L

<args>S   ->  <args'>S  |  ""  [[ MakeList[{array}] ]]

<args'>S  ->  <datum>S[0] <args>S[1]  [[ MakeList[{array of A[0]} + A[1]] ]]
<args'>S  ->  <datum>S[0] ";"L <args'>S[2]  [[ MakeList[{array of A[0]} + A[2]] ]]

<list>S   ->  <list'>S  |  ""  [[ MakeList[{array}] ]]

<list'>S  ->  <datum>S[0] <list>S[1]  [[ MakeList[{array of A[0]} + A[1]] ]]
<list'>S  ->  <datum>S[0] ";"L <list'>S[2]  [[ MakeList[{array of A[0]} + A[2]] ]]
<list'>S  ->  <datum>S[0] ":"L <list'>S[2]  [[ MakeList[{array of A[0]; A[2]}] ]]

where MakeList is some (pure) function, and for any array A of ASTs, the following condition is met:

MakeList[A].IsList[] & (Size[MakeList[A]] == Size[A]) & {for {E1 = MakeList[A]; E2 = A} all E1 == E2}