Function: rx--normalise-char-pattern

rx--normalise-char-pattern is a byte-compiled function defined in rx.el.gz.

Signature

(rx--normalise-char-pattern FORM)

Documentation

Normalize FORM as a pattern matching a single-character.

Characters become strings, any forms and character classes become rx--char-alt forms, user-definitions and eval forms are expanded, and or, not and intersection forms are normalized recursively.

A rx--char-alt form is shaped (rx--char-alt INTERVALS . CLASSES) where INTERVALS is a sorted list of disjoint nonadjacent intervals, each a cons of characters, and CLASSES an unordered list of unique name-normalised character classes.

Source Code

;; Defined in /usr/src/emacs/lisp/emacs-lisp/rx.el.gz
;; FIXME: flatten nested `or' patterns when performing char-pattern combining.
;; The only reason for not flattening is to ensure regexp-opt processing
;; (which we do for entire `or' patterns, not subsequences), but we
;; obviously want to translate
;;   (or "a" space (or "b" (+ nonl) word) "c")
;;   -> (or (in "ab" space) (+ nonl) (in "c" word))

;; FIXME: normalise `seq', both the construct and implicit sequences,
;; so that they are flattened, adjacent strings concatenated, and
;; empty strings removed. That would give more opportunities for regexp-opt:
;;  (or "a" (seq "ab" (seq "c" "d") "")) -> (or "a" "abcd")

;; FIXME: Since `rx--normalise-char-pattern' recurses through `or', `not' and
;; `intersection', we may end up normalising subtrees multiple times
;; which wastes time (but should be idempotent).
;; One way to avoid this is to aggressively normalise the entire tree
;; before translating anything at all, but we must then recurse through
;; all constructs and probably copy them.
;; Such normalisation could normalise synonyms, eliminate `minimal-match'
;; and `maximal-match' and convert affected `1+' to either `+' or `+?' etc.
;; We would also consolidate the user-def lookup, both modern and legacy,
;; in one place.

(defun rx--normalise-char-pattern (form)
  "Normalize FORM as a pattern matching a single-character.
Characters become strings, `any' forms and character classes become
`rx--char-alt' forms, user-definitions and `eval' forms are expanded,
and `or', `not' and `intersection' forms are normalized recursively.

A `rx--char-alt' form is shaped (rx--char-alt INTERVALS . CLASSES)
where INTERVALS is a sorted list of disjoint nonadjacent intervals,
each a cons of characters, and CLASSES an unordered list of unique
name-normalised character classes."
  (defvar rx--builtin-forms)
  (defvar rx--builtin-symbols)
  (cond ((consp form)
         (let ((op (car form))
               (body (cdr form)))
           (cond ((memq op '(or |))
                  ;; Normalise the constructor to `or' and the args recursively.
                  (cons 'or (mapcar #'rx--normalise-char-pattern body)))
                 ;; Convert `any' forms and char classes now so that we
                 ;; don't need to do it later on.
                 ((memq op '(any in char))
                  (cons 'rx--char-alt (rx--parse-any body)))
                 ((memq op '(not intersection))
                  (cons op (mapcar #'rx--normalise-char-pattern body)))
                 ((eq op 'eval)
                  (rx--normalise-char-pattern (rx--expand-eval body)))
                 ((memq op rx--builtin-forms) form)
                 ((let ((expanded (rx--expand-def-form form)))
                    (and expanded
                         (rx--normalise-char-pattern expanded))))
                 (t form))))
        ;; FIXME: Should we expand legacy definitions from
        ;; `rx-constituents' here as well?
        ((symbolp form)
         (cond ((let ((class (assq form rx--char-classes)))
                  (and class
                       `(rx--char-alt nil . (,(cdr class))))))
               ((memq form rx--builtin-symbols) form)
               ((let ((expanded (rx--expand-def-symbol form)))
                  (and expanded
                       (rx--normalise-char-pattern expanded))))
               (t form)))
        ((characterp form)
         (char-to-string form))
        (t form)))