Refs: Strings, bytes, runes and characters in Go - The Go Programming Language
Prior knowledge
Before learning this part, we should know some theoretical knowledge about utf-8 unicode and some basic usage of string in Go.
Key Points
Here is the key points of this post.
- A string holds arbitrary bytes.
- A string literal ,without byte-level escapes, always hold valid
UTF-8
sequence; - Go source code is always
UTF-8
.
A string holds arbitrary bytes.
Strings is effect read-only slice of bytes in Go. Let’s start with A string holds arbitrary bytes.
Here is a example. I create a sting literal with byte-level escapes, like \xbd\xb2\x3d\xbc\x20\xe2\xad\x90
. And the flag \xNN
note a string with peculiar byte values.
1 | const sample = "\xbd\xb2\x3d\xbc\x20\xe2\xad\x90" |
Print this sample
with different format, like this:
1 | package main |
OutPut
:
1 | Print as string: ��=� ⭐ |
Explain:
Because of the
sample
contain not validUTF-8
bytes, directly printfmt.Printf("Print as string: %s \n", sample)
is a mess��=� ⭐
result.To find out what string really hold on, i split it apart and examine each byte with
%x
flag. This byte to byte level output isbd b2 3d bc 20 e2 ad 90
. This result is equal to the declaration statementconst sample = "\xbd\xb2\x3d\xbc\x20\xe2\xad\x90"
.There’s more. The
%q
verb could escape any non-printable byte sequence in a string, So the output"\xbd\xb2=\xbc ⭐"
is clear. If we check this output, we will find oneEqual ASCII
sign, oneSpace ASCII
sign and oneYello Star Unicode
sign.1
the yello star ⭐ has Unicode value U+2B50, encode with utf-8 bytes: e2 ad 90.
The
%+q
verb could escape not only non-printable sequence, but also any non-ASCII bytes. It will exposes the Unicode values of properly formattedUTF-8
format."\xbd\xb2=\xbc \u2b50"
So we got that: A string holds arbitrary bytes. is only a bunch of bytes. which means with ransom bytes mostly combine a invalid utf-8
sequence like example above.
String literals always hold valid utf-8
sequence
A string literal ,without byte-level escapes, always hold valid UTF-8
sequence; We know that when we store a character in a string, it will store as byte to byte format. Let’s we what happened with a example:
1 | package main |
OutPut:
1 | plain string: ⭐ |
Explain:
The Unicode value of character
⭐
is\u2b50
, present as bytese2 ad 90
(Utf-8
).Because of Go’s encoding format is
utf-8
, when the source code is written, the text editor (VS Code, typora ...
) would place theUTF-8
encoding of the symbol⭐
into the source text.1
In short, Go source code is UTF-8, so *the source code for the string literal is UTF-8 text*. If that string literal contains no escape sequences, which a raw string cannot, the constructed string will hold exactly the source text between the quotes. Thus by definition and by construction the raw string will always contain a valid UTF-8 representation of its contents. Similarly, unless it contains UTF-8-breaking escapes like those from the previous section, a regular string literal will also always contain valid UTF-8.
To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always)
UTF-8
.