Function: define-coding-system
define-coding-system is a byte-compiled function defined in
mule.el.gz.
Signature
(define-coding-system NAME DOCSTRING &rest PROPS)
Documentation
Define NAME (a symbol) as a coding system with DOCSTRING and attributes.
The remaining arguments must come in pairs ATTRIBUTE VALUE. ATTRIBUTE may be any symbol.
A coding system specifies a rule to decode (i.e. to convert a byte sequence to a character sequence) and a rule to encode (the opposite of decoding).
The decoding is done by at most 3 steps; the first is to convert a byte sequence to a character sequence by one of Emacs' internal routines specified by :coding-type attribute. The optional second step is to convert the character sequence (the result of the first step) by a translation table specified by :decode-translation-table attribute. The optional third step is to convert the above result by a Lisp function specified by :post-read-conversion attribute.
The encoding is done by at most 3 steps, which are the reverse of the decoding steps. The optional first step converts a character sequence to another character sequence by a Lisp function specified by :pre-write-conversion attribute. The optional second step converts the above result by a translation table specified by :encode-translation-table attribute. The third step converts the above result to a byte sequence by one of the Emacs's internal routines specified by the :coding-type attribute.
The following attributes have special meanings. Those labeled as
"(required)" should not be omitted.
:mnemonic (required)
VALUE is a character to display on mode line for the coding system.
:coding-type (required)
VALUE specifies the format of byte sequence the coding system
decodes and encodes to. It must be one of charset, utf-8,
utf-16, iso-2022, emacs-mule, shift-jis, ccl,
raw-text, undecided.
If VALUE is charset, the coding system is for handling a
byte sequence in which each byte or every two- to four-byte
sequence represents a character code of a charset specified
by the :charset-list attribute.
If VALUE is utf-8, the coding system is for handling Unicode
UTF-8 byte sequences. See also the documentation of the
attribute :bom.
If VALUE is utf-16, the coding system is for handling Unicode
UTF-16 byte sequences. See also the documentation of the
attributes :bom and :endian.
If VALUE is iso-2022, the coding system is for handling byte
sequences conforming to ISO/IEC 2022. See also the documentation
of the attributes :charset-list, :flags, and :designation.
If VALUE is emacs-mule, the coding system is for handling
byte sequences which Emacs 20 and 21 used for their internal
representation of characters.
If VALUE is shift-jis, the coding system is for handling byte
sequences of Shift_JIS format. See also the attribute :charset-list.
If VALUE is ccl, the coding system uses CCL programs to decode
and encode byte sequences. The CCL programs must be
specified by the attributes :ccl-decoder and :ccl-encoder.
If VALUE is raw-text, the coding system decodes byte sequences
without any conversions.
:eol-type
VALUE is the EOL (end-of-line) format of the coding system. It must be
one of unix, dos, mac. The symbol unix means Unix-like EOL
(i.e., a single LF character), dos means DOS-like EOL (i.e., a sequence
of CR followed by LF), and mac means Mac-like EOL (i.e., a single CR).
If omitted, Emacs detects the EOL format automatically when decoding.
:charset-list (required if :coding-type is charset or shift-jis)
VALUE must be a list of charsets supported by the coding system.
If coding-type: is charset, then on decoding and encoding by the
coding system, if a character belongs to multiple charsets in the
list, a charset that comes first in the list is selected.
If :coding-type is iso-2022, VALUE may be iso-2022, which
indicates that the coding system supports all ISO-2022 based
charsets.
If :coding-type is shift-jis, VALUE must be a list of three
to four charsets supported by Shift_JIS encoding scheme. The
first charset (one dimension) is for code space 0x00..0x7F, the
second (one dimension) for 0xA1..0xDF, the third (two dimension)
for 0x8140..0xEFFC, the optional fourth (three dimension) for
0xF040..0xFCFC.
If :coding-type is emacs-mule, VALUE may be emacs-mule,
which indicates that the coding system supports all charsets that
have the :emacs-mule-id property.
:ascii-compatible-p
If VALUE is non-nil, the coding system decodes all 7-bit bytes into the corresponding ASCII characters, and encodes all ASCII characters back to the corresponding 7-bit bytes. VALUE defaults to nil.
:decode-translation-table
VALUE must be a translation table to use on decoding.
:encode-translation-table
VALUE must be a translation table to use on encoding.
:post-read-conversion
VALUE must be a function to call after some text is inserted and
decoded by the coding system itself and before any functions in
after-insert-file-functions are called. This function is passed one
argument: the number of characters in the text to convert, with
point at the start of the text. The function should leave point
and the match data unchanged, and should return the new character
count. Note that this function should avoid reading from files
or receiving text from subprocesses -- anything that could invoke
decoding; if it must do so, it should bind
coding-system-for-read to a value other than the current
coding-system, to avoid infinite recursion.
:pre-write-conversion
VALUE must be a function to call after all functions in
write-region-annotate-functions and buffer-file-format are
called, and before the text is encoded by the coding system
itself. This function should convert the whole text in the
current buffer, and leave the match data unchanged. For backward
compatibility, this function is passed two arguments which can be
ignored. Note that this function should avoid writing to files
or sending text to subprocesses -- anything that could invoke
encoding; if it must do so, it should bind
coding-system-for-write to a value other than the current
coding-system, to avoid infinite recursion.
:default-char
VALUE must be a character. On encoding, characters that are not supported by the coding system are each replaced with VALUE. If not specified, the default is the space character #x20.
:for-unibyte
VALUE non-nil means that visiting a file with the coding system results in a unibyte buffer.
:mime-charset
VALUE must be a symbol whose name is that of a MIME charset converted to lower case.
:mime-text-unsuitable
VALUE non-nil means the :mime-charset property names a charset which is unsuitable for the top-level media of type "text".
:flags
VALUE must be a list of symbols that control the ISO-2022 converter.
Each must be a member of the list coding-system-iso-2022-flags
(which see). This attribute is meaningful only when :coding-type
is iso-2022.
:designation
VALUE must be a vector [G0-USAGE G1-USAGE G2-USAGE G3-USAGE]. GN-USAGE specifies the usage of graphic register GN as follows.
If it is nil, no charset can be designated to GN.
If it is a charset, the charset is initially designated to GN, and never used by the other charsets.
If it is a list, the elements must be charsets, nil, 94, or 96. GN
can be used by all the listed charsets. If the list contains 94, any
iso-2022 charset whose code-space ranges are 94 long can be designated
to GN. If the list contains 96, any charsets whose ranges are
96 long can be designated to GN. If the first element is a charset,
that charset is initially designated to GN.
This attribute is meaningful only when :coding-type is iso-2022.
:bom
This attributes specifies whether the coding system uses a "byte order
mark". VALUE must be nil, t, or a cons cell of coding systems whose
:coding-type is utf-16 or utf-8.
If the value is nil, on decoding, don't treat the first two-byte as BOM, and on encoding, don't produce BOM bytes.
If the value is t, on decoding, skip the first two-byte as BOM, and on encoding, produce BOM bytes according to the value of :endian.
If the value is a cons cell, on decoding, check the first two bytes. If they are 0xFE 0xFF, use the car part coding system of the value. If they are 0xFF 0xFE, use the cdr part coding system of the value. Otherwise, treat them as bytes for a normal character. On encoding, produce BOM bytes according to the value of :endian.
This attribute is meaningful only when :coding-type is utf-16 or
utf-8.
:endian
VALUE must be big or little specifying big-endian and
little-endian respectively. The default value is big.
Changing this attribute is only meaningful when :coding-type
is utf-16.
:ccl-decoder (required if :coding-type is ccl)
VALUE is a CCL program name defined by define-ccl-program. The
CCL program reads a byte sequence and writes a character sequence
as a decoding result.
:ccl-encoder (required if :coding-type is ccl)
VALUE is a CCL program name defined by define-ccl-program. The
CCL program reads a character sequence and writes a byte sequence
as an encoding result.
:inhibit-null-byte-detection
VALUE non-nil means Emacs should ignore null bytes on code detection.
See the variable inhibit-null-byte-detection. This attribute
is meaningful only when :coding-type is undecided.
If VALUE is t, Emacs will ignore null bytes unconditionally while
detecting encoding. If VALUE is non-nil and not t, Emacs will
ignore null bytes if inhibit-null-byte-detection is non-nil.
:inhibit-iso-escape-detection
VALUE non-nil means Emacs should ignore ISO-2022 escape sequences on
code detection. See the variable inhibit-iso-escape-detection.
This attribute is meaningful only when :coding-type is
undecided.
If VALUE is t, Emacs will ignore escape sequences unconditionally
while detecting encoding. If VALUE is non-nil and not t, Emacs
will ignore escape sequences if inhibit-iso-escape-detection is
non-nil.
:prefer-utf-8
VALUE non-nil means Emacs prefers UTF-8 on code detection for
non-ASCII files. This attribute is meaningful only when
:coding-type is undecided.
Probably introduced at or before Emacs version 23.1.
Source Code
;; Defined in /usr/src/emacs/lisp/international/mule.el.gz
(defun define-coding-system (name docstring &rest props)
"Define NAME (a symbol) as a coding system with DOCSTRING and attributes.
The remaining arguments must come in pairs ATTRIBUTE VALUE. ATTRIBUTE
may be any symbol.
A coding system specifies a rule to decode (i.e. to convert a
byte sequence to a character sequence) and a rule to encode (the
opposite of decoding).
The decoding is done by at most 3 steps; the first is to convert
a byte sequence to a character sequence by one of Emacs'
internal routines specified by `:coding-type' attribute. The
optional second step is to convert the character sequence (the
result of the first step) by a translation table specified
by `:decode-translation-table' attribute. The optional third step
is to convert the above result by a Lisp function specified
by `:post-read-conversion' attribute.
The encoding is done by at most 3 steps, which are the reverse
of the decoding steps. The optional first step converts a
character sequence to another character sequence by a Lisp
function specified by `:pre-write-conversion' attribute. The
optional second step converts the above result by a translation
table specified by `:encode-translation-table' attribute. The
third step converts the above result to a byte sequence by one
of the Emacs's internal routines specified by the `:coding-type'
attribute.
The following attributes have special meanings. Those labeled as
\"(required)\" should not be omitted.
`:mnemonic' (required)
VALUE is a character to display on mode line for the coding system.
`:coding-type' (required)
VALUE specifies the format of byte sequence the coding system
decodes and encodes to. It must be one of `charset', `utf-8',
`utf-16', `iso-2022', `emacs-mule', `shift-jis', `ccl',
`raw-text', `undecided'.
If VALUE is `charset', the coding system is for handling a
byte sequence in which each byte or every two- to four-byte
sequence represents a character code of a charset specified
by the `:charset-list' attribute.
If VALUE is `utf-8', the coding system is for handling Unicode
UTF-8 byte sequences. See also the documentation of the
attribute `:bom'.
If VALUE is `utf-16', the coding system is for handling Unicode
UTF-16 byte sequences. See also the documentation of the
attributes :bom and `:endian'.
If VALUE is `iso-2022', the coding system is for handling byte
sequences conforming to ISO/IEC 2022. See also the documentation
of the attributes `:charset-list', `:flags', and `:designation'.
If VALUE is `emacs-mule', the coding system is for handling
byte sequences which Emacs 20 and 21 used for their internal
representation of characters.
If VALUE is `shift-jis', the coding system is for handling byte
sequences of Shift_JIS format. See also the attribute `:charset-list'.
If VALUE is `ccl', the coding system uses CCL programs to decode
and encode byte sequences. The CCL programs must be
specified by the attributes `:ccl-decoder' and `:ccl-encoder'.
If VALUE is `raw-text', the coding system decodes byte sequences
without any conversions.
`:eol-type'
VALUE is the EOL (end-of-line) format of the coding system. It must be
one of `unix', `dos', `mac'. The symbol `unix' means Unix-like EOL
\(i.e., a single LF character), `dos' means DOS-like EOL \(i.e., a sequence
of CR followed by LF), and `mac' means Mac-like EOL \(i.e., a single CR).
If omitted, Emacs detects the EOL format automatically when decoding.
`:charset-list' (required if `:coding-type' is `charset' or `shift-jis')
VALUE must be a list of charsets supported by the coding system.
If `coding-type:' is `charset', then on decoding and encoding by the
coding system, if a character belongs to multiple charsets in the
list, a charset that comes first in the list is selected.
If `:coding-type' is `iso-2022', VALUE may be `iso-2022', which
indicates that the coding system supports all ISO-2022 based
charsets.
If `:coding-type' is `shift-jis', VALUE must be a list of three
to four charsets supported by Shift_JIS encoding scheme. The
first charset (one dimension) is for code space 0x00..0x7F, the
second (one dimension) for 0xA1..0xDF, the third (two dimension)
for 0x8140..0xEFFC, the optional fourth (three dimension) for
0xF040..0xFCFC.
If `:coding-type' is `emacs-mule', VALUE may be `emacs-mule',
which indicates that the coding system supports all charsets that
have the `:emacs-mule-id' property.
`:ascii-compatible-p'
If VALUE is non-nil, the coding system decodes all 7-bit bytes into
the corresponding ASCII characters, and encodes all ASCII characters
back to the corresponding 7-bit bytes. VALUE defaults to nil.
`:decode-translation-table'
VALUE must be a translation table to use on decoding.
`:encode-translation-table'
VALUE must be a translation table to use on encoding.
`:post-read-conversion'
VALUE must be a function to call after some text is inserted and
decoded by the coding system itself and before any functions in
`after-insert-file-functions' are called. This function is passed one
argument: the number of characters in the text to convert, with
point at the start of the text. The function should leave point
and the match data unchanged, and should return the new character
count. Note that this function should avoid reading from files
or receiving text from subprocesses -- anything that could invoke
decoding; if it must do so, it should bind
`coding-system-for-read' to a value other than the current
coding-system, to avoid infinite recursion.
`:pre-write-conversion'
VALUE must be a function to call after all functions in
`write-region-annotate-functions' and `buffer-file-format' are
called, and before the text is encoded by the coding system
itself. This function should convert the whole text in the
current buffer, and leave the match data unchanged. For backward
compatibility, this function is passed two arguments which can be
ignored. Note that this function should avoid writing to files
or sending text to subprocesses -- anything that could invoke
encoding; if it must do so, it should bind
`coding-system-for-write' to a value other than the current
coding-system, to avoid infinite recursion.
`:default-char'
VALUE must be a character. On encoding, characters that are not
supported by the coding system are each replaced with VALUE. If
not specified, the default is the space character #x20.
`:for-unibyte'
VALUE non-nil means that visiting a file with the coding system
results in a unibyte buffer.
`:mime-charset'
VALUE must be a symbol whose name is that of a MIME charset converted
to lower case.
`:mime-text-unsuitable'
VALUE non-nil means the `:mime-charset' property names a charset which
is unsuitable for the top-level media of type \"text\".
`:flags'
VALUE must be a list of symbols that control the ISO-2022 converter.
Each must be a member of the list `coding-system-iso-2022-flags'
\(which see). This attribute is meaningful only when `:coding-type'
is `iso-2022'.
`:designation'
VALUE must be a vector [G0-USAGE G1-USAGE G2-USAGE G3-USAGE].
GN-USAGE specifies the usage of graphic register GN as follows.
If it is nil, no charset can be designated to GN.
If it is a charset, the charset is initially designated to GN, and
never used by the other charsets.
If it is a list, the elements must be charsets, nil, 94, or 96. GN
can be used by all the listed charsets. If the list contains 94, any
iso-2022 charset whose code-space ranges are 94 long can be designated
to GN. If the list contains 96, any charsets whose ranges are
96 long can be designated to GN. If the first element is a charset,
that charset is initially designated to GN.
This attribute is meaningful only when `:coding-type' is `iso-2022'.
`:bom'
This attributes specifies whether the coding system uses a \"byte order
mark\". VALUE must be nil, t, or a cons cell of coding systems whose
`:coding-type' is `utf-16' or `utf-8'.
If the value is nil, on decoding, don't treat the first two-byte as
BOM, and on encoding, don't produce BOM bytes.
If the value is t, on decoding, skip the first two-byte as BOM, and on
encoding, produce BOM bytes according to the value of `:endian'.
If the value is a cons cell, on decoding, check the first two bytes.
If they are 0xFE 0xFF, use the car part coding system of the value.
If they are 0xFF 0xFE, use the cdr part coding system of the value.
Otherwise, treat them as bytes for a normal character. On encoding,
produce BOM bytes according to the value of `:endian'.
This attribute is meaningful only when `:coding-type' is `utf-16' or
`utf-8'.
`:endian'
VALUE must be `big' or `little' specifying big-endian and
little-endian respectively. The default value is `big'.
Changing this attribute is only meaningful when `:coding-type'
is `utf-16'.
`:ccl-decoder' (required if :coding-type is `ccl')
VALUE is a CCL program name defined by `define-ccl-program'. The
CCL program reads a byte sequence and writes a character sequence
as a decoding result.
`:ccl-encoder' (required if :coding-type is `ccl')
VALUE is a CCL program name defined by `define-ccl-program'. The
CCL program reads a character sequence and writes a byte sequence
as an encoding result.
`:inhibit-null-byte-detection'
VALUE non-nil means Emacs should ignore null bytes on code detection.
See the variable `inhibit-null-byte-detection'. This attribute
is meaningful only when `:coding-type' is `undecided'.
If VALUE is t, Emacs will ignore null bytes unconditionally while
detecting encoding. If VALUE is non-nil and not t, Emacs will
ignore null bytes if `inhibit-null-byte-detection' is non-nil.
`:inhibit-iso-escape-detection'
VALUE non-nil means Emacs should ignore ISO-2022 escape sequences on
code detection. See the variable `inhibit-iso-escape-detection'.
This attribute is meaningful only when `:coding-type' is
`undecided'.
If VALUE is t, Emacs will ignore escape sequences unconditionally
while detecting encoding. If VALUE is non-nil and not t, Emacs
will ignore escape sequences if `inhibit-iso-escape-detection' is
non-nil.
`:prefer-utf-8'
VALUE non-nil means Emacs prefers UTF-8 on code detection for
non-ASCII files. This attribute is meaningful only when
`:coding-type' is `undecided'."
(declare (indent defun))
(let* ((common-attrs (mapcar 'list
'(:mnemonic
:coding-type
:charset-list
:ascii-compatible-p
:decode-translation-table
:encode-translation-table
:post-read-conversion
:pre-write-conversion
:default-char
:for-unibyte
:plist
:eol-type)))
(coding-type (plist-get props :coding-type))
(spec-attrs (mapcar 'list
(cond ((eq coding-type 'iso-2022)
'(:initial
:reg-usage
:request
:flags))
((eq coding-type 'utf-8)
'(:bom))
((eq coding-type 'utf-16)
'(:bom
:endian))
((eq coding-type 'ccl)
'(:ccl-decoder
:ccl-encoder
:valids))
((eq coding-type 'undecided)
'(:inhibit-null-byte-detection
:inhibit-iso-escape-detection
:prefer-utf-8))))))
(dolist (slot common-attrs)
(setcdr slot (plist-get props (car slot))))
(dolist (slot spec-attrs)
(setcdr slot (plist-get props (car slot))))
(if (eq coding-type 'iso-2022)
(let ((designation (plist-get props :designation))
(flags (plist-get props :flags))
(initial (make-vector 4 nil))
(reg-usage (cons 4 4))
request elt)
(dotimes (i 4)
(setq elt (aref designation i))
(cond ((charsetp elt)
(aset initial i elt)
(setq request (cons (cons elt i) request)))
((consp elt)
(aset initial i (car elt))
(if (charsetp (car elt))
(setq request (cons (cons (car elt) i) request)))
(dolist (e (cdr elt))
(cond ((charsetp e)
(setq request (cons (cons e i) request)))
((eq e 94)
(setcar reg-usage i))
((eq e 96)
(setcdr reg-usage i))
((eq e t)
(setcar reg-usage i)
(setcdr reg-usage i)))))))
(setcdr (assq :initial spec-attrs) initial)
(setcdr (assq :reg-usage spec-attrs) reg-usage)
(setcdr (assq :request spec-attrs) request)
;; Change :flags value from a list to a bit-mask.
(let ((bits 0)
(i 0))
(dolist (elt coding-system-iso-2022-flags)
(if (memq elt flags)
(setq bits (logior bits (ash 1 i))))
(setq i (1+ i)))
(setcdr (assq :flags spec-attrs) bits))))
;; Add :name and :docstring properties to PROPS.
(setq props
(cons :name (cons name (cons :docstring (cons docstring props)))))
(setcdr (assq :plist common-attrs) props)
(apply #'define-coding-system-internal
name (mapcar #'cdr (append common-attrs spec-attrs)))))