File: ucs-normalize.el.html
This program has passed the NormalizationTest-5.2.0.txt.
References: https://www.unicode.org/reports/tr15/ https://www.unicode.org/review/pr-29.html
HFS-Normalization: Reference: https://developer.apple.com/library/archive/technotes/tn/tn1150.html
HFS Normalization excludes following area for decomposition.
U+02000 .. U+02FFF :: Punctuation, symbols, dingbats, arrows, etc.
(Characters in this region will be composed.)
U+0F900 .. U+0FAFF :: CJK compatibility Ideographs.
U+2F800 .. U+2FFFF :: CJK compatibility Ideographs.
HFS-Normalization is useful for normalizing text involving CJK Ideographs.
;
; Implementation Notes on NFC/HFS-NFC.
;
<Stages> Decomposition Composition
NFD: 'nfd nil
NFC: 'nfd t
NFKD: 'nfkd nil
NFKC: 'nfkd t
HFS-NFD: 'hfs-nfd 'hfs-nfd-comp-p
HFS-NFC: 'hfs-nfd t
Algorithm for Normalization
Before normalization, following data will be prepared.
1. quick-check-list
quick-check-list consists of characters that will be decomposed
during normalization. It includes composition-exclusions,
singletons, non-starter-decompositions and decomposable
characters.
quick-check-regexp will search the above characters plus
combining characters.
2. decomposition-translation
decomposition-translation is a translation table that will be
used to decompose the characters.
Normalization Process
A. Searching (ucs-normalize-region)
Region is searched for quick-check-regexp to find possibly
normalizable point.
B. Identification of Normalization Block
(1) start of the block
If the searched character is a starter and not combining
with previous character, then the beginning of the block is
the searched character. If searched character is combining
character, then previous character will be the target
character
(2) end of the block
Block ends at non-composable starter character.
C. Decomposition (ucs-normalize-block)
The entire block will be decomposed by
decomposition-translation table.
D. Sorting and Composition of Smaller Blocks (ucs-normalize-block-compose-chars)
The block will be split to multiple smaller blocks by starter
characters. Each block is sorted, and composed if necessary.
E. Composition of Entire Block (ucs-normalize-compose-chars)
Composed blocks are collected and again composed.
Defined variables (2)
ucs-normalize-combining-chars-regexp | Regular expression to match sequence of combining characters. |
ucs-normalize-decomposition-pair-to-primary-composite | Hash table of decomposed pair to primary composite. |