File: lex.el.html
This file handles the creation of lexical analyzers for different languages in Emacs Lisp. The purpose of a lexical analyzer is to convert a buffer into a list of lexical tokens. Each token contains the token class (such as 'number, 'symbol, 'IF, etc) and the location in the buffer it was found. Optionally, a token also contains a string representing what is at the designated buffer location.
Tokens are pushed onto a token stream, which is basically a list of all the lexical tokens from the analyzed region. The token stream is then handed to the grammar which parsers the file.
; How it works
Each analyzer specifies a condition and forms. These conditions
and forms are assembled into a function by define-lex that does
the lexical analysis.
In the lexical analyzer created with define-lex, each condition
is tested for a given point. When the condition is true, the forms
run.
The forms can push a lexical token onto the token stream. The analyzer forms also must move the current analyzer point. If the analyzer point is moved without pushing a token, then the matched syntax is effectively ignored, or skipped.
Thus, starting at the beginning of a region to be analyzed, each condition is tested. One will match, and a lexical token might be pushed, and the point is moved to the end of the lexical token identified. At the new position, the process occurs again until the end of the specified region is reached.
; How to use semantic-lex
To create a lexer for a language, use the define-lex macro.
The define-lex macro accepts a list of lexical analyzers. Each
analyzer is created with define-lex-analyzer, or one of the
derivative macros. A single analyzer defines a regular expression
to match text in a buffer, and a short segment of code to create
one lexical token.
Each analyzer has a NAME, DOC, a CONDITION, and possibly some
FORMS. The NAME is the name used in define-lex. The DOC
describes what the analyzer should do.
The CONDITION evaluates the text at the current point in the current buffer. If CONDITION is true, then the FORMS will be executed.
The purpose of the FORMS is to push new lexical tokens onto the list of tokens for the current buffer, and to move point after the matched text.
Some macros for creating one analyzer are:
define-lex-analyzer - A generic analyzer associating any style of
condition to forms.
define-lex-regex-analyzer - Matches a regular expression.
define-lex-simple-regex-analyzer - Matches a regular expressions,
and pushes the match.
define-lex-block-analyzer - Matches list syntax, and defines
handles open/close delimiters.
These macros are used by the grammar compiler when lexical
information is specified in a grammar:
define-lex- * -type-analyzer - Matches syntax specified in
a grammar, and pushes one token for it. The * would
be sexp for things like lists or strings, and
string for things that need to match some special
string, such as "\\\\." where a literal match is needed.
; Lexical Tables
There are tables of different symbols managed in semantic-lex.el. They are:
Lexical keyword table - A Table of symbols declared in a grammar
file with the %keyword declaration.
Keywords are used by semantic-lex-symbol-or-keyword(var)/semantic-lex-symbol-or-keyword(fun)
to create lexical tokens based on the keyword.
Lexical type table - A table of symbols declared in a grammar
file with the %type declaration.
The grammar compiler uses the type table to create new
lexical analyzers. These analyzers are then used to when
a new lexical analyzer is made for a language.
; Lexical Types
A lexical type defines a kind of lexical analyzer that will be automatically generated from a grammar file based on some predetermined attributes. For now these two attributes are recognized :
* matchdatatype : define the kind of lexical analyzer. That is :
- regexp : define a regexp analyzer (see
define-lex-regex-type-analyzer)
- string : define a string analyzer (see
define-lex-string-type-analyzer)
- block : define a block type analyzer (see
define-lex-block-type-analyzer)
- sexp : define a sexp analyzer (see
define-lex-sexp-type-analyzer)
- keyword : define a keyword analyzer (see
define-lex-keyword-type-analyzer)
* syntax : define the syntax that matches a syntactic
expression. When syntax is matched the corresponding type
analyzer is entered and the resulting match data will be
interpreted based on the kind of analyzer (see matchdatatype
above).
The following lexical types are predefined :
+-------------+---------------+--------------------------------+
| type | matchdatatype | syntax |
+-------------+---------------+--------------------------------+
| punctuation | string | "\\\\(\\\\s.\\\\|\\\\s$\\\\|\\\\s'\\\\)+" |
| keyword | keyword | "\\\\(\\\\sw\\\\|\\\\s_\\\\)+" |
| symbol | regexp | "\\\\(\\\\sw\\\\|\\\\s_\\\\)+" |
| string | sexp | "\\\\s\\"" |
| number | regexp | semantic-lex-number-expression |
| block | block | "\\s(\\|\\s)" |
+-------------+---------------+--------------------------------+
In a grammar you must use a %type expression to automatically generate the corresponding analyzers of that type.
Here is an example to auto-generate punctuation analyzers with 'matchdatatype and 'syntax predefined (see table above)
%type <punctuation> ;; will auto-generate this kind of analyzers
It is equivalent to write :
%type <punctuation> syntax "\\(\\s.\\|\\s$\\|\\s'\\)+" matchdatatype string
;; Some punctuation based on the type defines above
%token <punctuation> NOT "!"
%token <punctuation> NOTEQ "!="
%token <punctuation> MOD "%"
%token <punctuation> MODEQ "%="
; On the Semantic 1.x lexer
In semantic 1.x, the lexical analyzer was an all purpose routine. To boost efficiency, the analyzer is now a series of routines that are constructed at build time into a single routine. This will eliminate unneeded if statements to speed the lexer.
Defined variables (48)
semantic-flex-depth | Default flexing depth. |
semantic-flex-enable-bol | When flexing, report beginning of lines as syntactic elements. |
semantic-flex-enable-newlines | When flexing, report newlines as syntactic elements. |
semantic-flex-enable-whitespace | When flexing, report whitespace as syntactic elements. |
semantic-flex-extensions | Buffer local extensions to the lexical analyzer. |
semantic-flex-keywords-obarray | Buffer local keyword obarray for the lexical analyzer. |
semantic-flex-syntax-modifications | Changes to the syntax table for this buffer. |
semantic-flex-tokens | An alist of semantic token types. |
semantic-flex-unterminated-syntax-end-function | Function called when unterminated syntax is encountered. |
semantic-ignore-comments | Default comment handling. |
semantic-lex-analysis-bounds | The bounds of the current analysis. |
semantic-lex-analyzer | The lexical analyzer used for a given buffer. |
semantic-lex-beginning-of-line | Detect and create a beginning of line token (BOL). |
semantic-lex-block-streams | Streams of tokens inside collapsed blocks. |
semantic-lex-charquote | Detect and create charquote tokens. |
semantic-lex-close-paren | Detect and create a close parenthesis token. |
semantic-lex-comment-regex | Regular expression for identifying comment start during lexical analysis. |
semantic-lex-comments | Detect and create a comment token. |
semantic-lex-comments-as-whitespace | Detect comments and create a whitespace token. |
semantic-lex-current-depth | The current depth as tracked through lexical functions. |
semantic-lex-debug | When non-nil, debug the local lexical analyzer. |
semantic-lex-debug-analyzers | Non-nil means to debug analyzers with syntax protection. |
semantic-lex-default-action | The default action when no other lexical actions match text. |
semantic-lex-depth | Default lexing depth. |
semantic-lex-end-point | The end point as tracked through lexical functions. |
semantic-lex-ignore-comments | Detect and create a comment token. |
semantic-lex-ignore-newline | Detect and ignore newline tokens. |
semantic-lex-ignore-whitespace | Detect and skip over whitespace tokens. |
semantic-lex-maximum-depth | The maximum depth of parenthesis as tracked through lexical functions. |
semantic-lex-newline | Detect and create newline tokens. |
semantic-lex-newline-as-whitespace | Detect and create newline tokens. |
semantic-lex-number | Detect and create number tokens. |
semantic-lex-number-expression | Regular expression for matching a number. |
semantic-lex-open-paren | Detect and create an open parenthesis token. |
semantic-lex-paren-or-list | Detect open parenthesis. |
semantic-lex-punctuation | Detect and create punctuation tokens. |
semantic-lex-punctuation-type | Detect and create a punctuation type token. |
semantic-lex-reset-functions | Abnormal hook used by major-modes to reset lexical analyzers. |
semantic-lex-string | Detect and create a string token. |
semantic-lex-symbol-or-keyword | Detect and create symbol and keyword tokens. |
semantic-lex-syntax-modifications | Changes to the syntax table for this buffer. |
semantic-lex-syntax-table | Syntax table used by lexical analysis. |
semantic-lex-token-stream | The current token stream we are collecting. |
semantic-lex-tokens | An alist of semantic token types. |
semantic-lex-types-obarray | Buffer local types obarray for the lexical analyzer. |
semantic-lex-unterminated-syntax-end-function | Function called when unterminated syntax is encountered. |
semantic-lex-whitespace | Detect and create whitespace tokens. |
semantic-number-expression | See variable ‘semantic-lex-number-expression’. |