File: lex.el.html

This file handles the creation of lexical analyzers for different languages in Emacs Lisp. The purpose of a lexical analyzer is to convert a buffer into a list of lexical tokens. Each token contains the token class (such as 'number, 'symbol, 'IF, etc) and the location in the buffer it was found. Optionally, a token also contains a string representing what is at the designated buffer location.

Tokens are pushed onto a token stream, which is basically a list of all the lexical tokens from the analyzed region. The token stream is then handed to the grammar which parsers the file.

; How it works

Each analyzer specifies a condition and forms. These conditions and forms are assembled into a function by define-lex that does the lexical analysis.

In the lexical analyzer created with define-lex, each condition is tested for a given point. When the condition is true, the forms run.

The forms can push a lexical token onto the token stream. The analyzer forms also must move the current analyzer point. If the analyzer point is moved without pushing a token, then the matched syntax is effectively ignored, or skipped.

Thus, starting at the beginning of a region to be analyzed, each condition is tested. One will match, and a lexical token might be pushed, and the point is moved to the end of the lexical token identified. At the new position, the process occurs again until the end of the specified region is reached.

; How to use semantic-lex

To create a lexer for a language, use the define-lex macro.

The define-lex macro accepts a list of lexical analyzers. Each analyzer is created with define-lex-analyzer, or one of the derivative macros. A single analyzer defines a regular expression to match text in a buffer, and a short segment of code to create one lexical token.

Each analyzer has a NAME, DOC, a CONDITION, and possibly some FORMS. The NAME is the name used in define-lex. The DOC describes what the analyzer should do.

The CONDITION evaluates the text at the current point in the current buffer. If CONDITION is true, then the FORMS will be executed.

The purpose of the FORMS is to push new lexical tokens onto the list of tokens for the current buffer, and to move point after the matched text.

Some macros for creating one analyzer are:

  define-lex-analyzer - A generic analyzer associating any style of
             condition to forms.
  define-lex-regex-analyzer - Matches a regular expression.
  define-lex-simple-regex-analyzer - Matches a regular expressions,
             and pushes the match.
  define-lex-block-analyzer - Matches list syntax, and defines
             handles open/close delimiters.

These macros are used by the grammar compiler when lexical information is specified in a grammar:
  define-lex- * -type-analyzer - Matches syntax specified in
             a grammar, and pushes one token for it. The * would
             be sexp for things like lists or strings, and
             string for things that need to match some special
             string, such as "\\\\." where a literal match is needed.

; Lexical Tables

There are tables of different symbols managed in semantic-lex.el. They are:

  Lexical keyword table - A Table of symbols declared in a grammar
          file with the %keyword declaration.
          Keywords are used by semantic-lex-symbol-or-keyword(var)/semantic-lex-symbol-or-keyword(fun)
          to create lexical tokens based on the keyword.

  Lexical type table - A table of symbols declared in a grammar
          file with the %type declaration.
          The grammar compiler uses the type table to create new
          lexical analyzers. These analyzers are then used to when
          a new lexical analyzer is made for a language.

; Lexical Types

A lexical type defines a kind of lexical analyzer that will be automatically generated from a grammar file based on some predetermined attributes. For now these two attributes are recognized :

* matchdatatype : define the kind of lexical analyzer. That is :

  - regexp : define a regexp analyzer (see
    define-lex-regex-type-analyzer)

  - string : define a string analyzer (see
    define-lex-string-type-analyzer)

  - block : define a block type analyzer (see
    define-lex-block-type-analyzer)

  - sexp : define a sexp analyzer (see
    define-lex-sexp-type-analyzer)

  - keyword : define a keyword analyzer (see
    define-lex-keyword-type-analyzer)

* syntax : define the syntax that matches a syntactic
  expression. When syntax is matched the corresponding type
  analyzer is entered and the resulting match data will be
  interpreted based on the kind of analyzer (see matchdatatype
  above).

The following lexical types are predefined :

+-------------+---------------+--------------------------------+
| type | matchdatatype | syntax |
+-------------+---------------+--------------------------------+
| punctuation | string | "\\\\(\\\\s.\\\\|\\\\s$\\\\|\\\\s'\\\\)+" |
| keyword | keyword | "\\\\(\\\\sw\\\\|\\\\s_\\\\)+" |
| symbol | regexp | "\\\\(\\\\sw\\\\|\\\\s_\\\\)+" |
| string | sexp | "\\\\s\\"" |
| number | regexp | semantic-lex-number-expression |
| block | block | "\\s(\\|\\s)" |
+-------------+---------------+--------------------------------+

In a grammar you must use a %type expression to automatically generate the corresponding analyzers of that type.

Here is an example to auto-generate punctuation analyzers with 'matchdatatype and 'syntax predefined (see table above)

%type <punctuation> ;; will auto-generate this kind of analyzers

It is equivalent to write :

%type <punctuation> syntax "\\(\\s.\\|\\s$\\|\\s'\\)+" matchdatatype string

;; Some punctuation based on the type defines above

%token <punctuation> NOT "!"
%token <punctuation> NOTEQ "!="
%token <punctuation> MOD "%"
%token <punctuation> MODEQ "%="


; On the Semantic 1.x lexer

In semantic 1.x, the lexical analyzer was an all purpose routine. To boost efficiency, the analyzer is now a series of routines that are constructed at build time into a single routine. This will eliminate unneeded if statements to speed the lexer.

Defined variables (48)

semantic-flex-depthDefault flexing depth.
semantic-flex-enable-bolWhen flexing, report beginning of lines as syntactic elements.
semantic-flex-enable-newlinesWhen flexing, report newlines as syntactic elements.
semantic-flex-enable-whitespaceWhen flexing, report whitespace as syntactic elements.
semantic-flex-extensionsBuffer local extensions to the lexical analyzer.
semantic-flex-keywords-obarrayBuffer local keyword obarray for the lexical analyzer.
semantic-flex-syntax-modificationsChanges to the syntax table for this buffer.
semantic-flex-tokensAn alist of semantic token types.
semantic-flex-unterminated-syntax-end-functionFunction called when unterminated syntax is encountered.
semantic-ignore-commentsDefault comment handling.
semantic-lex-analysis-boundsThe bounds of the current analysis.
semantic-lex-analyzerThe lexical analyzer used for a given buffer.
semantic-lex-beginning-of-lineDetect and create a beginning of line token (BOL).
semantic-lex-block-streamsStreams of tokens inside collapsed blocks.
semantic-lex-charquoteDetect and create charquote tokens.
semantic-lex-close-parenDetect and create a close parenthesis token.
semantic-lex-comment-regexRegular expression for identifying comment start during lexical analysis.
semantic-lex-commentsDetect and create a comment token.
semantic-lex-comments-as-whitespaceDetect comments and create a whitespace token.
semantic-lex-current-depthThe current depth as tracked through lexical functions.
semantic-lex-debugWhen non-nil, debug the local lexical analyzer.
semantic-lex-debug-analyzersNon-nil means to debug analyzers with syntax protection.
semantic-lex-default-actionThe default action when no other lexical actions match text.
semantic-lex-depthDefault lexing depth.
semantic-lex-end-pointThe end point as tracked through lexical functions.
semantic-lex-ignore-commentsDetect and create a comment token.
semantic-lex-ignore-newlineDetect and ignore newline tokens.
semantic-lex-ignore-whitespaceDetect and skip over whitespace tokens.
semantic-lex-maximum-depthThe maximum depth of parenthesis as tracked through lexical functions.
semantic-lex-newlineDetect and create newline tokens.
semantic-lex-newline-as-whitespaceDetect and create newline tokens.
semantic-lex-numberDetect and create number tokens.
semantic-lex-number-expressionRegular expression for matching a number.
semantic-lex-open-parenDetect and create an open parenthesis token.
semantic-lex-paren-or-listDetect open parenthesis.
semantic-lex-punctuationDetect and create punctuation tokens.
semantic-lex-punctuation-typeDetect and create a punctuation type token.
semantic-lex-reset-functionsAbnormal hook used by major-modes to reset lexical analyzers.
semantic-lex-stringDetect and create a string token.
semantic-lex-symbol-or-keywordDetect and create symbol and keyword tokens.
semantic-lex-syntax-modificationsChanges to the syntax table for this buffer.
semantic-lex-syntax-tableSyntax table used by lexical analysis.
semantic-lex-token-streamThe current token stream we are collecting.
semantic-lex-tokensAn alist of semantic token types.
semantic-lex-types-obarrayBuffer local types obarray for the lexical analyzer.
semantic-lex-unterminated-syntax-end-functionFunction called when unterminated syntax is encountered.
semantic-lex-whitespaceDetect and create whitespace tokens.
semantic-number-expressionSee variable ‘semantic-lex-number-expression’.

Defined functions (78)

define-lex(NAME DOC &rest ANALYZERS)
define-lex-analyzer(NAME DOC CONDITION &rest FORMS)
define-lex-block-analyzer(NAME DOC SPEC1 &rest SPECS)
define-lex-block-type-analyzer(NAME DOC SYNTAX MATCHES)
define-lex-keyword-type-analyzer(NAME DOC SYNTAX)
define-lex-regex-analyzer(NAME DOC REGEXP &rest FORMS)
define-lex-regex-type-analyzer(NAME DOC SYNTAX MATCHES DEFAULT)
define-lex-sexp-type-analyzer(NAME DOC SYNTAX TOKEN)
define-lex-simple-regex-analyzer(NAME DOC REGEXP TOKSYM &optional INDEX &rest FORMS)
define-lex-string-type-analyzer(NAME DOC SYNTAX MATCHES DEFAULT)
semantic-comment-lexer(START END &optional DEPTH LENGTH)
semantic-lex(START END &optional DEPTH LENGTH)
semantic-lex-beginning-of-line()
semantic-lex-buffer(&optional DEPTH)
semantic-lex-catch-errors(SYMBOL &rest FORMS)
semantic-lex-charquote()
semantic-lex-close-paren()
semantic-lex-comments()
semantic-lex-comments-as-whitespace()
semantic-lex-debug(ARG)
semantic-lex-debug-break(TOKEN)
semantic-lex-default-action()
semantic-lex-end-block(SYNTAX)
semantic-lex-expand-block-specs(SPECS)
semantic-lex-highlight-token(TOKEN)
semantic-lex-ignore-comments()
semantic-lex-ignore-newline()
semantic-lex-ignore-whitespace()
semantic-lex-init()
semantic-lex-keyword-get(NAME PROPERTY)
semantic-lex-keyword-invalid(NAME)
semantic-lex-keyword-p(NAME)
semantic-lex-keyword-put(NAME PROPERTY VALUE)
semantic-lex-keyword-set(NAME VALUE)
semantic-lex-keyword-symbol(NAME)
semantic-lex-keyword-value(NAME)
semantic-lex-keywords(&optional PROPERTY)
semantic-lex-list(SEMLIST DEPTH)
semantic-lex-make-keyword-table(SPECS &optional PROPSPECS)
semantic-lex-make-type-table(SPECS &optional PROPSPECS)
semantic-lex-map-keywords(FUN &optional PROPERTY)
semantic-lex-map-symbols(FUN TABLE &optional PROPERTY)
semantic-lex-map-types(FUN &optional PROPERTY)
semantic-lex-newline()
semantic-lex-newline-as-whitespace()
semantic-lex-number()
semantic-lex-one-token(ANALYZERS)
semantic-lex-open-paren()
semantic-lex-paren-or-list()
semantic-lex-preset-default-types()
semantic-lex-punctuation()
semantic-lex-punctuation-type()
semantic-lex-push-token(TOKEN &rest BLOCKSPECS)
semantic-lex-start-block(SYNTAX)
semantic-lex-string()
semantic-lex-symbol-or-keyword()
semantic-lex-test(ARG)
semantic-lex-token(SYMBOL START END &optional STR)
semantic-lex-token-bounds(TOKEN)
semantic-lex-token-class(TOKEN)
semantic-lex-token-end(TOKEN)
semantic-lex-token-p(THING)
semantic-lex-token-start(TOKEN)
semantic-lex-token-text(TOKEN)
semantic-lex-token-with-text-p(THING)
semantic-lex-token-without-text-p(THING)
semantic-lex-type-get(TYPE PROPERTY &optional NOERROR)
semantic-lex-type-invalid(TYPE)
semantic-lex-type-p(TYPE)
semantic-lex-type-put(TYPE PROPERTY VALUE &optional ADD)
semantic-lex-type-set(TYPE VALUE)
semantic-lex-type-symbol(TYPE)
semantic-lex-type-value(TYPE &optional NOERROR)
semantic-lex-types(&optional PROPERTY)
semantic-lex-unterminated-syntax-detected(SYNTAX)
semantic-lex-unterminated-syntax-protection(SYNTAX &rest FORMS)
semantic-lex-whitespace()
semantic-simple-lexer(START END &optional DEPTH LENGTH)

Defined faces (0)