Go_Learning_Fundamental_String_2

Refs: Strings, bytes, runes and characters in Go - The Go Programming Language

Prior knowledge

Before learning this part, we should know some theoretical knowledge about utf-8 unicode and some basic usage of string in Go.

Key Points

Here is the key points of this post.

A string holds arbitrary bytes.
A string literal ,without byte-level escapes, always hold valid UTF-8 sequence;
Go source code is always UTF-8.

A string holds arbitrary bytes.

Strings is effect read-only slice of bytes in Go. Let’s start with A string holds arbitrary bytes.

Here is a example. I create a sting literal with byte-level escapes, like \xbd\xb2\x3d\xbc\x20\xe2\xad\x90. And the flag \xNN note a string with peculiar byte values.

1	const sample = "\xbd\xb2\x3d\xbc\x20\xe2\xad\x90"

Print this sample with different format, like this:

package main

import "fmt"

func main() {
	const sample = "\xbd\xb2\x3d\xbc\x20\xe2\xad\x90"

	fmt.Printf("Print as string: %s \n", sample)
	fmt.Printf("\n")

	fmt.Print("Print as Byte loop: ")
	for i := 0; i < len(sample); i++ {
		fmt.Printf("%x ", sample[i])
	}
	fmt.Printf("\n")
	fmt.Printf("\n")

	fmt.Printf("Printf with quota: %q\n\n", sample)

	fmt.Printf("Printf with plus quota: %+q\n", sample)
}

OutPut:

Print as string: ��=� ⭐ 

Print as Byte loop: bd b2 3d bc 20 e2 ad 90 

Printf with quota: "\xbd\xb2=\xbc ⭐"

Printf with plus quota: "\xbd\xb2=\xbc \u2b50"

Explain:

Because of the sample contain not valid UTF-8 bytes, directly print fmt.Printf("Print as string: %s \n", sample) is a mess ��=� ⭐ result.
To find out what string really hold on, i split it apart and examine each byte with %x flag. This byte to byte level output is bd b2 3d bc 20 e2 ad 90. This result is equal to the declaration statement const sample = "\xbd\xb2\x3d\xbc\x20\xe2\xad\x90" .
There’s more. The %q verb could escape any non-printable byte sequence in a string, So the output "\xbd\xb2=\xbc ⭐" is clear. If we check this output, we will find one Equal ASCII sign, one Space ASCII sign and one Yello Star Unicode sign.
1
the yello star ⭐ has Unicode value U+2B50, encode with utf-8 bytes: e2 ad 90.
The %+q verb could escape not only non-printable sequence, but also any non-ASCII bytes. It will exposes the Unicode values of properly formatted UTF-8 format. "\xbd\xb2=\xbc \u2b50"

So we got that: A string holds arbitrary bytes. is only a bunch of bytes. which means with ransom bytes mostly combine a invalid utf-8 sequence like example above.

String literals always hold valid `utf-8` sequence

A string literal ,without byte-level escapes, always hold valid UTF-8 sequence; We know that when we store a character in a string, it will store as byte to byte format. Let’s we what happened with a example:

package main

import "fmt"

func main() {
	const yelloStar = `⭐`

	fmt.Printf("plain string: ")
	fmt.Printf("%s", yelloStar)
	fmt.Printf("\n")

	fmt.Printf("quoted string: ")
	fmt.Printf("%+q", yelloStar)
	fmt.Printf("\n")

	fmt.Printf("hex bytes: ")
	for i := 0; i < len(yelloStar); i++ {
		fmt.Printf("%x ", yelloStar[i])
	}
	fmt.Printf("\n")
}

OutPut:

1
2
3

plain string: ⭐
quoted string: "\u2b50"
hex bytes: e2 ad 90

Explain:

The Unicode value of character ⭐ is \u2b50, present as bytes e2 ad 90(Utf-8).

Because of Go’s encoding format is utf-8, when the source code is written, the text editor (VS Code, typora ...) would place the UTF-8 encoding of the symbol ⭐ into the source text.

In short, Go source code is UTF-8, so *the source code for the string literal is UTF-8 text*. If that string literal contains no escape sequences, which a raw string cannot, the constructed string will hold exactly the source text between the quotes. Thus by definition and by construction the raw string will always contain a valid UTF-8 representation of its contents. Similarly, unless it contains UTF-8-breaking escapes like those from the previous section, a regular string literal will also always contain valid UTF-8.

To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.

Prior knowledge

Key Points

A string holds arbitrary bytes.

String literals always hold valid utf-8 sequence

String literals always hold valid `utf-8` sequence