0%

Go_Learning_Fundamental_Text_Normalization

Refs: Text normalization in Go - The Go Programming Language

Prior knowledge

Before learning this part, we should know some theoretical knowledge about Unicode equivalence - Wikipedia.

  • canoically equivalent:

    1
    2
    canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.
    -- wiki
  • compatible equivalent:

    1
    2
    Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the typographic ligature "ff") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as sorting and indexing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.
    -- wiki
  • Composing, Decoposing:

    • The former replaces runes that can combine into a single rune with this single rune.
    • The latter breaks runes apart into their components.

Key Points

Composing Decomposing
Canonical NFC NFD
Compatibility NFKC NFKD

Example

Difference With NFC, NFD

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package main

import (
"fmt"

"golang.org/x/text/unicode/norm"
)

func main() {
str := "é"
nfc := norm.NFC.String(str) // NFC 形式
nfd := norm.NFD.String(str) // NFD 形式

fmt.Printf("NFC String RESULT IS: % s\n", nfc)
fmt.Printf("NFD String RESULT IS: % s\n", nfd)
fmt.Printf("NFC Bytes RESULT IS: % x\n", nfc)
fmt.Printf("NFD Bytes RESULT IS: % x\n", nfd)
fmt.Printf("NFC Rune RESULT IS: % +q\n", nfc)
fmt.Printf("NFD Rune RESULT IS: % +q\n", nfd)

// 检查是否相等
fmt.Println(nfc == nfd) // 输出:false
}

OutPut:

1
2
3
4
5
6
7
NFC String RESULT IS: é
NFD String RESULT IS: é
NFC Bytes RESULT IS: c3 a9
NFD Bytes RESULT IS: 65 cc 81
NFC Rune RESULT IS: "\u00e9"
NFD Rune RESULT IS: "e\u0301"
false

From above we know that, both NFC, NFD, their string format is equal. While they hold difference byte sequence. NFC: \xc3\xa9, NFD: \x65\xcc\x81.

In this example, we used the letter “e” with an accent. By applying the normalization transformation to this string in the form of NFC and NFD, we can see the difference in the output. The NFC form combines the accent into a single character “e”, while the NFD form decomposes it into two characters “e”. At the same time, we can also compare the results of NFC and NFD forms to determine whether they are equal.

Difference With NFC, NFKC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package main

import (
"fmt"

"golang.org/x/text/unicode/norm"
)

func main() {
str := "㎡"
nfc := norm.NFC.String(str) // NFC 形式
nfkc := norm.NFKC.String(str) // NFKC 形式

fmt.Printf("NFC String RESULT IS: % s\n", nfc)
fmt.Printf("NFKC String RESULT IS: % s\n", nfkc)
fmt.Printf("NFC Bytes RESULT IS: % x\n", nfc)
fmt.Printf("NFKC Bytes RESULT IS: % x\n", nfkc)
fmt.Printf("NFC Rune RESULT IS: % +q\n", nfc)
fmt.Printf("NFKC Rune RESULT IS: % +q\n", nfkc)

// 检查是否相等
fmt.Println(nfc == nfkc) // 输出:false
}

OutPut:

1
2
3
4
5
6
7
NFC String RESULT IS: ㎡
NFKC String RESULT IS: m2
NFC Bytes RESULT IS: e3 8e a1
NFKC Bytes RESULT IS: 6d 32
NFC Rune RESULT IS: "\u33a1"
NFKC Rune RESULT IS: "m2"
false

In this example, we use the special character “㎡”, which is the unit for square meters. Through the string for NFC and NFKC forms of normalized transformation, we can see the change of the output. The NFC form leaves it as is, while the NFKC form converts it to “m²”, even though the superscript 2 is used to denote the square.

Thus, it can be seen that in some cases the strings of the form NFC and NFKC are not equal.