peg(3)


grammar::peg(3tcl)       Grammar operations and usage       grammar::peg(3tcl)

______________________________________________________________________________

NAME
       grammar::peg - Create and manipulate parsing expression grammars

SYNOPSIS
       package require Tcl  8.4

       package require snit

       package require grammar::peg  ?0.1?

       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?

       pegName destroy

       pegName clear

       pegName = srcPEG

       pegName --> dstPEG

       pegName serialize

       pegName deserialize serialization

       pegName is valid

       pegName start ?pe?

       pegName nonterminals

       pegName nonterminal add nt pe

       pegName nonterminal delete nt1 ?nt2 ...?

       pegName nonterminal exists nt

       pegName nonterminal rename nt ntnew

       pegName nonterminal mode nt ?mode?

       pegName nonterminal rule nt

       pegName unknown nonterminals

______________________________________________________________________________

DESCRIPTION
       This package provides a container class for parsing expression grammars
       (Short: PEG).  It allows the incremental definition of the grammar, its
       manipulation  and querying of the definition.  The package neither pro-
       vides complex operations on the grammar, nor has it the ability to exe-
       cute  a  grammar  definition for a stream of symbols.  Two packages re-
       lated to this one are grammar::mengine  and  grammar::peg::interpreter.
       The first of them defines a general virtual machine for the matching of
       a character stream, and the second implements an interpreter for  pars-
       ing expression grammars on top of that virtual machine.

   TERMS & CONCEPTS
       PEGs  are similar to context-free grammars, but not equivalent; in some
       cases PEGs are strictly more powerful than context-free grammars (there
       exist PEGs for some non-context-free languages).  The formal mathemati-
       cal definition of parsing expressions and parsing  expression  grammars
       can be found in section PARSING EXPRESSION GRAMMARS.

       In  short,  we have terminal symbols, which are the most basic building
       blocks for sentences, and nonterminal symbols with  associated  parsing
       expressions,  defining  the grammatical structure of the sentences. The
       two sets of symbols are distinctive, and do not overlap. When  speaking
       about  symbols  the  word  "symbol" is often left out. The union of the
       sets of terminal and nonterminal symbols is called the set of symbols.

       Here the set of terminal symbols is not explicitly managed, but implic-
       itly defined as the set of all characters. Note that this means that we
       inherit from Tcl the ability to handle all of Unicode.

       A pair of nonterminal and parsing expression is also called a grammati-
       cal  rule,  or rule for short. In the context of a rule the nonterminal
       is often called the left-hand-side (LHS), and  the  parsing  expression
       the right-hand-side (RHS).

       The  start  expression  of a grammar is a parsing expression from which
       all the sentences contained in the language specified  by  the  grammar
       are  derived.  To make the understanding of this term easier let us as-
       sume for a moment that the RHS of each rule, and the start  expression,
       is  either  a sequence of symbols, or a series of alternate parsing ex-
       pressions.  In the latter case the rule can be seen as a set of  rules,
       each  providing one alternative for the nonterminal.  A parsing expres-
       sion A' is now a derivation of a parsing expression A if we pick one of
       the  nonterminals N in the expression, and one of the alternative rules
       R for N, and then replace the nonterminal in A with the RHS of the cho-
       sen  rule.  Here  we  can see why the terminal symbols are called such.
       They cannot be expanded any further, thus terminate the process of  de-
       riving new expressions.  An example

                  Rules
                    (1)  A <- a B c
                    (2a) B <- d B
                    (2b) B <- e

                  Some derivations, using starting expression A.

                    A -/1/-> a B c -/2a/-> a d B c -/2b/-> a d e c

       A  derived  expression  containing only terminal symbols is a sentence.
       The set of all sentences which can be derived from the start expression
       is the language of the grammar.

       Some definitions for nonterminals and expressions:

       [1]    A  nonterminal A is called reachable if it is possible to derive
              a parsing expression from the start expression which contains A.

       [2]    A nonterminal A is called useful if it is possible to  derive  a
              sentence from it.

       [3]    A  nonterminal A is called recursive if it is possible to derive
              a parsing expression from it which contains A, again.

       [4]    The FIRST set of a nonterminal A contains all the symbols  which
              can  occur of as the leftmost symbol in a parsing expression de-
              rived from A. If the FIRST set contains A itself then that  non-
              terminal is called left-recursive.

       [5]    The  LAST  set of a nonterminal A contains all the symbols which
              can occur of as the rightmost symbol in a parsing expression de-
              rived  from  A. If the LAST set contains A itself then that non-
              terminal is called right-recursive.

       [6]    The FOLLOW set of a nonterminal A contains all the symbols which
              can occur after A in a parsing expression derived from the start
              expression.

       [7]    A nonterminal (or parsing expression) is called nullable if  the
              empty sentence can be derived from it.

       And based on the above definitions for grammars:

       [1]    A  grammar G is recursive if and only if it contains a nontermi-
              nal A which is recursive. The terms left-  and  right-recursive,
              and useful are analogously defined.

       [2]    A  grammar  is  minimal if it contains only reachable and useful
              nonterminals.

       [3]    A grammar is wellformed if it is not left-recursive. Such  gram-
              mars  are also complete, which means that they always succeed or
              fail on all input sentences. For an incomplete  grammar  on  the
              other  hand  input sentences exist for which an attempt to match
              them against the grammar will not terminate.

       [4]    As we wish to allow ourselves to build a  grammar  incrementally
              in  a container object we will encounter stages where the RHS of
              one or more rules reference symbols which are not yet  known  to
              the  container.  Such  a grammar we call invalid.  We cannot use
              the term incomplete as this term is already taken, see the  last
              item.

   CONTAINER CLASS API
       The package exports the API described here.

       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
              The command creates a new container object for a parsing expres-
              sion grammar and returns the fully qualified name of the  object
              command as its result. The API the returned command is following
              is described in the section CONTAINER OBJECT API. It may be used
              to  invoke  various  operations on the container and the grammar
              within.

              The new container, i.e. grammar will be empty if no src is spec-
              ified. Otherwise it will contain a copy of the grammar contained
              in the src.  The src has to be a container object reference  for
              all  operators except deserialize.  The deserialize operator re-
              quires src to be the serialization of a parsing expression gram-
              mar instead.

              An  empty  grammar has no nonterminal symbols, and the start ex-
              pression is the empty expression, i.e. epsilon. It is valid, but
              not useful.

   CONTAINER OBJECT API
       All grammar container objects provide the following methods for the ma-
       nipulation of their contents:

       pegName destroy
              Destroys the grammar, including its storage space and associated
              command.

       pegName clear
              Clears  out  the definition of the grammar contained in pegName,
              but does not destroy the object.

       pegName = srcPEG
              Assigns the contents of the grammar contained in srcPEG to  peg-
              Name,  overwriting any existing definition.  This is the assign-
              ment operator for grammars. It copies the grammar  contained  in
              the  grammar  object  srcPEG over the grammar definition in peg-
              Name. The old contents of pegName are deleted by this operation.

              This operation is in effect equivalent to

                  pegName deserialize [srcPEG serialize]

       pegName --> dstPEG
              This is the reverse assignment operator for grammars. It  copies
              the  automation contained in the object pegName over the grammar
              definition in the object dstPEG.  The old contents of dstPEG are
              deleted by this operation.

              This operation is in effect equivalent to

                  dstPEG deserialize [pegName serialize]

       pegName serialize
              This  method  serializes the grammar stored in pegName. In other
              words it returns a tcl value completely describing that grammar.
              This  allows,  for  example, the transfer of grammars over arbi-
              trary channels, persistence, etc.  This method is also the basis
              for both the copy constructor and the assignment operator.

              The  result of this method has to be semantically identical over
              all implementations of the grammar::peg interface. This is  what
              will  enable  us  to copy grammars between different implementa-
              tions of the same interface.

              The result is a list of four elements with the following  struc-
              ture:

              [1]    The constant string grammar::peg.

              [2]    A dictionary. Its keys are the names of all known nonter-
                     minal symbols, and their associated values are the  pars-
                     ing expressions describing their sentennial structure.

              [3]    A dictionary. Its keys are the names of all known nonter-
                     minal symbols, and their associated  values  hints  to  a
                     matcher  regarding  the  semantic  values produced by the
                     symbol.

              [4]    The last item is a parsing expression, the start  expres-
                     sion of the grammar.

       Assuming the following PEG for simple mathematical expressions

                  Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'
                  Sign       <- '+' / '-'
                  Number     <- Sign? Digit+
                  Expression <- '(' Expression ')' / (Factor (MulOp Factor)*)
                  MulOp      <- '*' / '/'
                  Factor     <- Term (AddOp Term)*
                  AddOp      <- '+'/'-'
                  Term       <- Number

       a possible serialization is

                  grammar::peg \
                  {Expression {/ {x ( Expression )} {x Factor {* {x MulOp Factor}}}} \
                   Factor     {x Term {* {x AddOp Term}}} \
                   Term       Number \
                   MulOp      {/ * /} \
                   AddOp      {/ + -} \
                   Number     {x {? Sign} {+ Digit}} \
                   Sign       {/ + -} \
                   Digit      {/ 0 1 2 3 4 5 6 7 8 9} \
                  } \
                  {Expression value     Factor     value \
                   Term       value     MulOp      value \
                   AddOp      value     Number     value \
                   Sign       value     Digit      value \
                  }
                  Expression

       A possible one, because the order of the nonterminals in the dictionary
       is not relevant.

       pegName deserialize serialization
              This is the complement to serialize.  It  replaces  the  grammar
              definition  in pegName with the grammar described by the serial-
              ization value. The old contents of pegName are deleted  by  this
              operation.

       pegName is valid
              A  predicate. It tests whether the PEG in pegName is valid.  See
              section TERMS & CONCEPTS for  the  definition  of  this  grammar
              property.  The result is a boolean value. It will be set to true
              if the PEG has the tested property, and false otherwise.

       pegName start ?pe?
              This method defines the start expression of the grammar. It  re-
              places  the previously defined start expression with the parsing
              expression pe.  The method fails and throws an error if pe  does
              not  contain a valid parsing expression as specified in the sec-
              tion PARSING EXPRESSIONS. In that case the  existing  start  ex-
              pression is not changed.  The method returns the empty string as
              its result.

              If the method is called without an argument it will  return  the
              currently defined start expression.

       pegName nonterminals
              Returns the set of all nonterminal symbols known to the grammar.

       pegName nonterminal add nt pe
              This  method  adds the nonterminal nt and its associated parsing
              expression pe to the set of nonterminal symbols and rules of the
              PEG  contained  in  the  object  pegName.   The method fails and
              throws an error if either the string nt is already  known  as  a
              symbol of the grammar, or if pe does not contain a valid parsing
              expression as specified in the section PARSING  EXPRESSIONS.  In
              that  case  the  current set of nonterminal symbols and rules is
              not changed.  The method returns the empty string as its result.

       pegName nonterminal delete nt1 ?nt2 ...?
              This method removes the named symbols nt1, nt2 from the  set  of
              nonterminal  symbols of the PEG contained in the object pegName.
              The method fails and throws an error if any of  the  strings  is
              not  known as a nonterminal symbol. In that case the current set
              of nonterminal symbols is not changed.  The method  returns  the
              empty string as its result.

              The  stored  grammar becomes invalid if the deleted nonterminals
              are referenced by the RHS of still-known rules.

       pegName nonterminal exists nt
              A predicate. It tests whether the nonterminal symbol nt is known
              to  the  PEG in pegName.  The result is a boolean value. It will
              be set to true if the symbol nt is known, and false otherwise.

       pegName nonterminal rename nt ntnew
              This method renames the nonterminal symbol  nt  to  ntnew.   The
              method  fails and throws an error if either nt is not known as a
              nonterminal, or if ntnew is a known symbol.  The method  returns
              the empty string as its result.

       pegName nonterminal mode nt ?mode?
              This  mode returns or sets the semantic mode associated with the
              nonterminal symbol nt. If no mode is specified the current  mode
              of  the  nonterminal  is returned. Otherwise the current mode is
              set to mode.  The method fails and throws an error if nt is  not
              known  as a nonterminal.  The grammar interpreter implemented by
              the package grammar::peg::interpreter recognizes  the  following
              modes:

              value  The  semantic  value  of  the nonterminal is the abstract
                     syntax tree created from the AST's of the RHS and a  node
                     for the nonterminal itself.

              match  The  semantic value of the nonterminal is an the abstract
                     syntax tree consisting of single a node  for  the  string
                     matched  by  the  RHS.  The ASTs generated by the RHS are
                     discarded.

              leaf   The semantic value of the nonterminal is an the  abstract
                     syntax tree consisting of single a node for the nontermi-
                     nal itself. The ASTs generated by the RHS are discarded.

              discard
                     The nonterminal has no semantic value. The ASTs generated
                     by the RHS are discarded (as well).

       pegName nonterminal rule nt
              This  method  returns the parsing expression associated with the
              nonterminal nt.  The method fails and throws an error if  nt  is
              not known as a nonterminal.

       pegName unknown nonterminals
              This method returns a list containing the names of all nontermi-
              nal symbols which are referenced on the  RHS  of  a  grammatical
              rule,  but  have  no  rule  definining their structure. In other
              words, a list of the nonterminal symbols which make the  grammar
              invalid. The grammar is valid if this list is empty.

   PARSING EXPRESSIONS
       Various methods of PEG container objects expect a parsing expression as
       their argument, or will return such. This section specifies the  format
       such parsing expressions are in.

       [1]    The  string  epsilon is an atomic parsing expression. It matches
              the empty string.

       [2]    The string alnum is an atomic parsing expression. It matches any
              alphanumeric character.

       [3]    The string alpha is an atomic parsing expression. It matches any
              alphabetical character.

       [4]    The string dot is an atomic parsing expression. It  matches  any
              character.

       [5]    The  expression  [list  t x] is an atomic parsing expression. It
              matches the terminal string x.

       [6]    The expression [list n A] is an atomic  parsing  expression.  It
              matches the nonterminal A.

       [7]    For  parsing expressions e1, e2, ... the result of [list / e1 e2
              ... ] is a parsing expression as  well.   This  is  the  ordered
              choice, aka prioritized choice.

       [8]    For  parsing expressions e1, e2, ... the result of [list x e1 e2
              ... ] is a parsing expression as well.  This is the sequence.

       [9]    For a parsing expression e the result of [list * e] is a parsing
              expression as well.  This is the kleene closure, describing zero
              or more repetitions.

       [10]   For a parsing expression e the result of [list + e] is a parsing
              expression  as  well.   This is the positive kleene closure, de-
              scribing one or more repetitions.

       [11]   For a parsing expression e the result of [list & e] is a parsing
              expression as well.  This is the and lookahead predicate.

       [12]   For a parsing expression e the result of [list ! e] is a parsing
              expression as well.  This is the not lookahead predicate.

       [13]   For a parsing expression e the result of [list ? e] is a parsing
              expression as well.  This is the optional input.

       Examples of parsing expressions where already shown, in the description
       of the method serialize.

PARSING EXPRESSION GRAMMARS
       For the mathematically inclined, a PEG is a 4-tuple (VN,VT,R,eS) where

       o      VN is a set of nonterminal symbols,

       o      VT is a set of terminal symbols,

       o      R is a finite set of rules, where each rule is a pair  (A,e),  A
              in VN, and e a parsing expression.

       o      eS is a parsing expression, the start expression.

       Further constraints are

       o      The intersection of VN and VT is empty.

       o      For  all  A  in  VT exists exactly one pair (A,e) in R. In other
              words, R is a function from nonterminal symbols to  parsing  ex-
              pressions.

       Parsing expression are inductively defined via

       o      The empty string (epsilon) is a parsing expression.

       o      A terminal symbol a is a parsing expression.

       o      A nonterminal symbol A is a parsing expression.

       o      e1e2  is  a parsing expression for parsing expressions e1 and 2.
              This is called sequence.

       o      e1/e2 is a parsing expression for parsing expressions e1 and  2.
              This is called ordered choice.

       o      e*  is  a  parsing  expression for parsing expression e. This is
              called zero-or-more repetitions, also known as kleene closure.

       o      e+ is a parsing expression for parsing  expression  e.  This  is
              called  one-or-more  repetitions,  also known as positive kleene
              closure.

       o      !e is a parsing expression for parsing expression  e1.  This  is
              called a not lookahead predicate.

       o      &e  is  a  parsing expression for parsing expression e1. This is
              called an and lookahead predicate.

       PEGs are used to define a grammatical structure for streams of  symbols
       over  VT.  They  are  a modern phrasing of older formalisms invented by
       Alexander Birham. These formalisms  were  called  TS  (TMG  recognition
       scheme),  and  gTS  (generalized  TS).  Later they were renamed to TPDL
       (Top-Down Parsing Languages) and gTPDL (generalized TPDL).

       They can be easily implemented by recursive descent parsers with  back-
       tracking. This makes them relatives of LL(k) Context-Free Grammars.

REFERENCES
       [1]    The   Packrat  Parsing  and  Parsing  Expression  Grammars  Page
              [http://www.pdos.lcs.mit.edu/~baford/packrat/], by  Bryan  Ford,
              Massachusetts  Institute  of  Technology. This is the main entry
              page to PEGs, and their realization through Packrat Parsers.

       [2]    Parsing      Techniques      -      A      Practical       Guide
              [http://www.cs.vu.nl/~dick/PTAPG.html],  an online book offering
              a clear, accessible, and thorough discussion of  many  different
              parsing  techniques  with  their interrelations and applicabili-
              ties, including error recovery techniques.

       [3]    Compilers and Compiler  Generators  [http://scifac.ru.ac.za/com-
              pilers/], an online book using CoCo/R, a generator for recursive
              descent parsers.

BUGS, IDEAS, FEEDBACK
       This document, and the package it describes, will  undoubtedly  contain
       bugs  and  other  problems.   Please  report such in the category gram-
       mar_peg of the Tcllib Trackers  [http://core.tcl.tk/tcllib/reportlist].
       Please  also  report any ideas for enhancements you may have for either
       package and/or documentation.

       When proposing code changes, please provide unified diffs, i.e the out-
       put of diff -u.

       Note  further  that  attachments  are  strongly  preferred over inlined
       patches. Attachments can be made by going  to  the  Edit  form  of  the
       ticket  immediately  after  its  creation, and then using the left-most
       button in the secondary navigation bar.

KEYWORDS
       LL(k), TDPL,  context-free  languages,  expression,  grammar,  parsing,
       parsing  expression,  parsing  expression grammar, push down automaton,
       recursive descent, state, top-down parsing languages, transducer

CATEGORY
       Grammars and finite automata

COPYRIGHT
       Copyright (c) 2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>

tcllib                                0.1                   grammar::peg(3tcl)
Man(1) output converted with man2html
list of all man pages