0%

Go_Learning_Fundamental_String_2

Refs: Strings, bytes, runes and characters in Go - The Go Programming Language

Prior knowledge

Before learning this part, we should know some theoretical knowledge about utf-8 unicode and some basic usage of string in Go.

Key Points

Here is the key points of this post.

  • A string holds arbitrary bytes.
  • A string literal ,without byte-level escapes, always hold valid UTF-8 sequence;
  • Go source code is always UTF-8.

A string holds arbitrary bytes.

Strings is effect read-only slice of bytes in Go. Let’s start with A string holds arbitrary bytes.

Here is a example. I create a sting literal with byte-level escapes, like \xbd\xb2\x3d\xbc\x20\xe2\xad\x90. And the flag \xNN note a string with peculiar byte values.

1
const sample = "\xbd\xb2\x3d\xbc\x20\xe2\xad\x90"

Print this sample with different format, like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
package main

import "fmt"

func main() {
const sample = "\xbd\xb2\x3d\xbc\x20\xe2\xad\x90"

fmt.Printf("Print as string: %s \n", sample)
fmt.Printf("\n")

fmt.Print("Print as Byte loop: ")
for i := 0; i < len(sample); i++ {
fmt.Printf("%x ", sample[i])
}
fmt.Printf("\n")
fmt.Printf("\n")

fmt.Printf("Printf with quota: %q\n\n", sample)

fmt.Printf("Printf with plus quota: %+q\n", sample)
}

OutPut:

1
2
3
4
5
6
7
Print as string: ��=� ⭐ 

Print as Byte loop: bd b2 3d bc 20 e2 ad 90

Printf with quota: "\xbd\xb2=\xbc ⭐"

Printf with plus quota: "\xbd\xb2=\xbc \u2b50"

Explain:

  • Because of the sample contain not valid UTF-8 bytes, directly print fmt.Printf("Print as string: %s \n", sample) is a mess ��=� ⭐ result.

  • To find out what string really hold on, i split it apart and examine each byte with %x flag. This byte to byte level output is bd b2 3d bc 20 e2 ad 90. This result is equal to the declaration statement const sample = "\xbd\xb2\x3d\xbc\x20\xe2\xad\x90" .

  • There’s more. The %q verb could escape any non-printable byte sequence in a string, So the output "\xbd\xb2=\xbc ⭐" is clear. If we check this output, we will find one Equal ASCII sign, one Space ASCII sign and one Yello Star Unicode sign.

    1
    the yello star ⭐ has Unicode value U+2B50, encode with utf-8 bytes: e2 ad 90.
  • The %+q verb could escape not only non-printable sequence, but also any non-ASCII bytes. It will exposes the Unicode values of properly formatted UTF-8 format. "\xbd\xb2=\xbc \u2b50"

So we got that: A string holds arbitrary bytes. is only a bunch of bytes. which means with ransom bytes mostly combine a invalid utf-8 sequence like example above.

String literals always hold valid utf-8 sequence

A string literal ,without byte-level escapes, always hold valid UTF-8 sequence; We know that when we store a character in a string, it will store as byte to byte format. Let’s we what happened with a example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
package main

import "fmt"

func main() {
const yelloStar = `⭐`

fmt.Printf("plain string: ")
fmt.Printf("%s", yelloStar)
fmt.Printf("\n")

fmt.Printf("quoted string: ")
fmt.Printf("%+q", yelloStar)
fmt.Printf("\n")

fmt.Printf("hex bytes: ")
for i := 0; i < len(yelloStar); i++ {
fmt.Printf("%x ", yelloStar[i])
}
fmt.Printf("\n")
}

OutPut:

1
2
3
plain string: ⭐
quoted string: "\u2b50"
hex bytes: e2 ad 90

Explain:

  • The Unicode value of character is \u2b50, present as bytes e2 ad 90(Utf-8).

  • Because of Go’s encoding format is utf-8, when the source code is written, the text editor (VS Code, typora ...) would place the UTF-8 encoding of the symbol into the source text.

    1
    In short, Go source code is UTF-8, so *the source code for the string literal is UTF-8 text*. If that string literal contains no escape sequences, which a raw string cannot, the constructed string will hold exactly the source text between the quotes. Thus by definition and by construction the raw string will always contain a valid UTF-8 representation of its contents. Similarly, unless it contains UTF-8-breaking escapes like those from the previous section, a regular string literal will also always contain valid UTF-8.
  • To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.