0%

Go_Learning_Fundamental_String

Reference: blog1, blog2, blog_3, blog_4

Definition

A string is a slice of bytes in Go.

1
2
string is the set of all strings of 8-bit bytes, conventionally but not necessarily representing UTF-8-encoded text. A string may be empty, but not nil. Values of string type are immutable.
-- from go compilter

Base Data Structure of String in Go

String in Go is C Struct, string is a wrapper of byte sequence. We can realyyy view string as immutable byte slice.

1
2
3
4
type _string struct {
elements *byte // underlying bytes
len int // number of bytes
}

literal String| Raw String | Multi Line String

Go support two styles of string literals, the double-quote style and the back-quote style(raw string literalss).

Literal String: created with double quotes, escape sequence is sensitive, such as \n, \r;

Raw String: created with back-quote, escape sequence is insensitive.

Multi Line String: create with bak-quote with multi line, escape sequence is insensitive.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
package main

import "fmt"

func main() {
literalString := "Literal string is create with double quote \nwhich own escape sequence sensitive feature;"
rawString := `Raw string is create with back-quote \n which own escape sequence insensitive feature;`
multiLineString := `
MultiLine string is create with back-quote
which own multi line sentitive feature
`

fmt.Println(literalString)
fmt.Println(rawString)
fmt.Println(multiLineString)
}
// Literal string is create with double quote
// which own escape sequence sensitive feature;
// Raw string is create with back-quote \n which own escape sequence insensitive feature;
//
// MultiLine string is create with back-quote
// which own multi line sentitive feature

Some Fact with String

  • String values can be used as constants (along with boolean and all kinds of numeric values).
  • Go supports two styles of string literals, the double-quote style (or interpreted literals) and the back-quote style (or raw string literals).
  • The zero values of string types are blank strings, which can be represented with "" or ```` in literal.
  • Strings can be concatenated with + and += operators.
  • String types are all comparable (by using the == and != operators). And like integer and floating-point values, two values of the same string type can also be compared with >, <, >= and <= operators. When comparing two strings, their underlying bytes will be compared, one byte by one byte. If one string is a prefix of the other one and the other one is longer, then the other one will be viewed as the larger one.
1
2
3
4
5
6
7
8
9
10
11
12
package main

import "fmt"

func main() {
const hello = "hello"
var world = `World!`
helloWorld := hello + " " + world
fmt.Println(helloWorld) // helloworld
fmt.Println("hello" == hello) // true
fmt.Println(hello > helloWorld) // false
}
  • Like Java, the contents (underlying bytes) of string values are immutable. The lengths of string values also can’t be modified separately. An addressable string value can only be overwritten as a whole by assigning another string value to it.
  • The built-in string type has no methods (just like most other built-in types in Go), but we can
    • use functions provided in the strings standard package to do all kinds of string manipulations.
    • call the built-in len function to get the length of a string (number of bytes stored in the string).
    • use the element access syntax aString[i] introduced in container element accesses to get the *i**th* byte value stored in aString. The expression aString[i] is not addressable. In other words, value aString[i] can’t be modified.
    • use the subslice syntax aString[start:end] to get a substring of aString. Here, start and end are both indexes of bytes stored in aString.
  • For the standard Go compiler, the destination string variable and source string value in a string assignment will share the same underlying byte sequence in memory. The result of a substring expression aString[start:end] also shares the same underlying byte sequence with the base string aString in memory.
1
2
3
4
5
6
7
8
9
10
11
12
13
package main

import "fmt"

func main() {
const hello = "hello"
var world = `World!`
helloWorld := hello + " " + world
fmt.Println(len(helloWorld)) // 12
fmt.Println(helloWorld[0]) // 104 is byte value of 'h'
// helloWorld[0] = 'H' Error string is immutable
// fmt.Printf(&helloWorld[0]) Error string is unaddressable
}

Operation With String

Compare String in Go

As mentioned above, when comparing two strings, their underlying bytes will be compared, one byte by one byte, If one string is the prefix of another string, then the longer one viewed as the larger one. While Go compilers make the following optimizations for string comparisons.

  • For == and != comparison, if the length of the compared string is not equal, then the two strings must be not equal.
  • If their underlying bytes sequence pointers are equal, then the comparision result is the same as comparing the length of two strings.

So for two equal strings, the time complexity of comparing them depend on whether or not their underlying byte sequence pointers are equal. If the two are equal, time complexity is O(1), otherwise is O(n).

As above mentioned, for the go standrad Go compiler, in a string value assignment, the destination string value and the source string will share the same underlying byte sequence in memory. So the cost of comparing the two strings becomes very small.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
package main

import (
"fmt"
"time"
)

func main() {
bs := make([]byte, 1<<26)
// s0, s1, s2 are three equal string
// s0 is the deep copy of bs, s1 is also a deep copy of bs
// so s0 s1 has different memory address
// as mentiond, s2 and s1 share the same memory
s0 := string(bs)
s1 := string(bs)
s2 := s1
startTime := time.Now()
_ = s0 == s1
duration := time.Now().Sub(startTime)
fmt.Println("Duration for (s0==s1):", duration)

startTime = time.Now()
_ = s2 == s1
duration = time.Now().Sub(startTime)
fmt.Println("Duration for (s1==s2):", duration)
}
// Duration for (s0==s1): 15.190255ms
// Duration for (s1==s2): 94ns

1ms is 1000000ns! So please try to avoid comparing two long strings if they don’t share the same underlying byte sequence.

Loop Over String

  • **Classic for : loop over bytes. **
  • For range: loops over runes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
package main

import (
"fmt"
)

func main() {
name := "Sun清志"
for i, char := range name {
fmt.Printf("Index of char %c is %v \n", char, i)
}
fmt.Print("--------------- \n")
for i := 0; i < len(name); i++ {
fmt.Printf("index %v val of name is %c \n", i, name[i])
}
}
// Index of char S is 0
// Index of char u is 1
// Index of char n is 2
// Index of char 清 is 3
// Index of char 志 is 6
/// ---------------
// index 0 val of name is S
// index 1 val of name is u
// index 2 val of name is n
// index 3 val of name is æ
// index 4 val of name is ¸
// index 5 val of name is
// index 6 val of name is å
// index 7 val of name is ¿
// index 8 val of name is

String Join | String Split

Besides using the + operator to concatenate strings, we can also use following ways to concatenate strings.

  • The Sprintf/Sprint/Sprintln functions in the fmt standard package can be used to concatenate values of any types, including string types.
  • Use the Join function in the strings standard package.
  • The Buffer type in the bytes standard package (or the built-in copy function) can be used to build byte slices, which afterwards can be converted to string values.
  • Since Go 1.10, the Builder type in the strings standard package can be used to build strings. Comparing with bytes.Buffer way, this way avoids making an unnecessary duplicated copy of underlying bytes for the result string.

The standard Go compiler makes optimizations for string concatenations by using the + operator. So generally, using + operator to concatenate strings is convenient and efficient if the number of the concatenated strings is known at compile time.

String Conversion With Byte And Rune

  • a string value can be explicitly converted to byte slice, and vice versa.
  • a string value can be explicitly converted to rune slice, and vice versa.

In a conversion from a rune slice to string, each slice element (a rune value) will be UTF-8 encoded as from one to four bytes and stored in the result string. If a slice rune element value is outside the range of valid Unicode code points, then it will be viewed as 0xFFFD, the code point for the Unicode replacement character. 0xFFFD will be UTF-8 encoded as three bytes (0xef 0xbf 0xbd).

When a string is converted to a rune slice, the bytes stored in the string will be viewed as successive UTF-8 encoding byte sequence representations of many Unicode code points. Bad UTF-8 encoding representations will be converted to a rune value 0xFFFD.

When a string is converted to a byte slice, the result byte slice is just a deep copy of the underlying byte sequence of the string. When a byte slice is converted to a string, the underlying byte sequence of the result string is also just a deep copy of the byte slice. A memory allocation is needed to store the deep copy in each of such conversions. The reason why a deep copy is essential is slice elements are mutable but the bytes stored in strings are immutable, so a byte slice and a string can’t share byte elements.

Please note, for conversions between strings and byte slices,

  • illegal UTF-8 encoded bytes are allowed and will keep unchanged.
  • the standard Go compiler makes some optimizations for some special cases of such conversions, so that the deep copies are not made.

Conversion Between Byte And Rune

Conversions between byte slices and rune slices are not supported directly in Go, We can use the following ways to achieve this goal:

  • use string values as a hop. This way is convenient but not very efficient, for two deep copies are needed in the process.
  • use the functions in unicode/utf8 standard package.
  • use the Runes function in the bytes standard package to convert a []byte value to a []rune value. There is not a function in this package to convert a rune slice to byte slice.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
package main

import (
"bytes"
"unicode/utf8"
)

func Runes2Bytes(rs []rune) []byte {
n := 0
for _, r := range rs {
n += utf8.RuneLen(r)
}
n, bs := 0, make([]byte, n)
for _, r := range rs {
n += utf8.EncodeRune(bs[n:], r)
}
return bs
}

func main() {
s := "Color Infection is a fun game."
bs := []byte(s) // string -> []byte
s = string(bs) // []byte -> string
rs := []rune(s) // string -> []rune
s = string(rs) // []rune -> string
rs = bytes.Runes(bs) // []byte -> []rune
bs = Runes2Bytes(rs) // []rune -> []byte
}

Compiler Optimizations for Conversions Between Strings and Byte Slices

Above has mentioned that the underlying bytes in the conversions between strings and byte slices will be copied. The standard Go compiler makes some optimizations, which are proven to still work in Go Toolchain 1.18, for some special scenarios to avoid the duplicate copies. These scenarios include:

  • a conversion (from string to byte slice) which follows the range keyword in a for-range loop.
  • a conversion (from byte slice to string) which is used as a map key in map element retrieval indexing syntax.
  • a conversion (from byte slice to string) which is used in a comparison.
  • a conversion (from byte slice to string) which is used in a string concatenation, and at least one of concatenated string values is a non-blank string constant.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
package main

import "fmt"

func main() {
var str = "world"
// Here, the []byte(str) conversion will
// not copy the underlying bytes of str.
for i, b := range []byte(str) {
fmt.Println(i, ":", b)
}

key := []byte{'k', 'e', 'y'}
m := map[string]string{}
// The string(key) conversion copys the bytes in key.
m[string(key)] = "value"
// Here, this string(key) conversion doesn't copy
// the bytes in key. The optimization will be still
// made, even if key is a package-level variable.
fmt.Println(m[string(key)]) // value (very possible)
}

Note, the last line might not output value if there are data races in evaluating string(key). However, such data races will never cause panics.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package main

import "fmt"
import "testing"

var s string
var x = []byte{1023: 'x'}
var y = []byte{1023: 'y'}

func fc() {
// None of the below 4 conversions will
// copy the underlying bytes of x and y.
// Surely, the underlying bytes of x and y will
// be still copied in the string concatenation.
if string(x) != string(y) {
s = (" " + string(x) + string(y))[1:]
}
}

func fd() {
// Only the two conversions in the comparison
// will not copy the underlying bytes of x and y.
if string(x) != string(y) {
// Please note the difference between the
// following concatenation and the one in fc.
s = string(x) + string(y)
}
}

func main() {
fmt.Println(testing.AllocsPerRun(1, fc)) // 1
fmt.Println(testing.AllocsPerRun(1, fd)) // 3
}