46.2 顺序执行和并行执行的性能基准测试

根据是否并行执行,Go的性能基准测试可以分为两类:顺序执行的性能基准测试并行执行的性能基准测试

1. 顺序执行的性能基准测试

其代码写法如下:

func BenchmarkXxx(b *testing.B) {
    // ...
    for i := 0; i < b.N; i++ {
        // 被测对象的执行代码
    }
}

前面对多种字符串连接方法的性能基准测试就归属于这一类。关于顺序执行的性能基准测试的执行过程原理,可以通过下面的例子来说明:

// chapter8/sources/benchmark-impl/sequential_test.go
var (
    m     map[int64]struct{} = make(map[int64]struct{}, 10)
    mu    sync.Mutex
    round int64 = 1
)

func BenchmarkSequential(b *testing.B) {
    fmt.Printf("\ngoroutine[%d] enter BenchmarkSequential: round[%d], b.N[%d]\n",
           tls.ID(), atomic.LoadInt64(&round), b.N)
    defer func() {
        atomic.AddInt64(&round, 1)
    }()

    for i := 0; i < b.N; i++ {
        mu.Lock()
        _, ok := m[round]
        if !ok {
            m[round] = struct{}{}
            fmt.Printf("goroutine[%d] enter loop in BenchmarkSequential: round[%d], b.N[%d]\n",
                tls.ID(), atomic.LoadInt64(&round), b.N)
        }
        mu.Unlock()
    }
    fmt.Printf("goroutine[%d] exit BenchmarkSequential: round[%d], b.N[%d]\n",
           tls.ID(), atomic.LoadInt64(&round), b.N)
}

运行这个例子:

$go test -bench . sequential_test.go

goroutine[1] enter BenchmarkSequential: round[1], b.N[1]
goroutine[1] enter loop in BenchmarkSequential: round[1], b.N[1]
goroutine[1] exit BenchmarkSequential: round[1], b.N[1]
goos: darwin
goarch: amd64
BenchmarkSequential-8
goroutine[2] enter BenchmarkSequential: round[2], b.N[100]
goroutine[2] enter loop in BenchmarkSequential: round[2], b.N[100]
goroutine[2] exit BenchmarkSequential: round[2], b.N[100]
goroutine[2] enter BenchmarkSequential: round[3], b.N[10000]
goroutine[2] enter loop in BenchmarkSequential: round[3], b.N[10000]
goroutine[2] exit BenchmarkSequential: round[3], b.N[10000]

goroutine[2] enter BenchmarkSequential: round[4], b.N[1000000]
goroutine[2] enter loop in BenchmarkSequential: round[4], b.N[1000000]
goroutine[2] exit BenchmarkSequential: round[4], b.N[1000000]

goroutine[2] enter BenchmarkSequential: round[5], b.N[65666582]
goroutine[2] enter loop in BenchmarkSequential: round[5], b.N[65666582]
goroutine[2] exit BenchmarkSequential: round[5], b.N[65666582]
65666582           20.6 ns/op
PASS
ok         command-line-arguments 1.381s

我们看到:

  • BenchmarkSequential被执行了多轮(见输出结果中的round值);
  • 每一轮执行,for循环的b.N值均不相同,依次为1、100、10000、1000000和65666582;
  • 除b.N为1的首轮,其余各轮均在一个goroutine(goroutine[2])中顺序执行。

默认情况下,每个性能基准测试函数(如BenchmarkSequential)的执行时间为1秒。如果执行一轮所消耗的时间不足1秒,那么go test会按就近的顺序增加b.N的值:1、2、3、5、10、20、30、50、100等。如果当b.N较小时,基准测试执行可以很快完成,那么go test基准测试框架将跳过中间的一些值,选择较大的值,比如像这里b.N从1直接跳到100。选定新的b.N之后,go test基准测试框架会启动新一轮性能基准测试函数的执行,直到某一轮执行所消耗的时间超出1秒。上面例子中最后一轮的b.N值为65666582,这个值应该是go test根据上一轮执行后得到的每次循环平均执行时间计算出来的。go test发现,如果将上一轮每次循环平均执行时间与再扩大100倍的N值相乘,那么下一轮的执行时间会超出1秒很多,于是go test用1秒与上一轮每次循环平均执行时间一起估算出一个循环次数,即上面的65666582。

如果基准测试仅运行1秒,且在这1秒内仅运行10轮迭代,那么这些基准测试运行所得的平均值可能会有较高的标准偏差。如果基准测试运行了数百万或数十亿次迭代,那么其所得平均值可能趋于准确。要增加迭代次数,可以使用-benchtime命令行选项来增加基准测试执行的时间。

下面的例子中,我们通过go test的命令行参数-benchtime将1秒这个默认性能基准测试函数执行时间改为2秒:

$go test -bench . sequential_test.go -benchtime 2s
...

goroutine[2] enter BenchmarkSequential: round[4], b.N[1000000]
goroutine[2] enter loop in BenchmarkSequential: round[4], b.N[1000000]
goroutine[2] exit BenchmarkSequential: round[4], b.N[1000000]

goroutine[2] enter BenchmarkSequential: round[5], b.N[100000000]
goroutine[2] enter loop in BenchmarkSequential: round[5], b.N[100000000]
goroutine[2] exit BenchmarkSequential: round[5], b.N[100000000]
100000000          20.5 ns/op
PASS
ok         command-line-arguments 2.075s

我们看到性能基准测试函数执行时间改为2秒后,最终轮的b.N的值可以增大到100000000。

也可以通过-benchtime手动指定b.N的值,这样go test就会以你指定的N值作为最终轮的循环次数:

$go test -v -benchtime 5x -bench . sequential_test.go
goos: darwin
goarch: amd64
BenchmarkSequential

goroutine[1] enter BenchmarkSequential: round[1], b.N[1]
goroutine[1] enter loop in BenchmarkSequential: round[1], b.N[1]
goroutine[1] exit BenchmarkSequential: round[1], b.N[1]

goroutine[2] enter BenchmarkSequential: round[2], b.N[5]
goroutine[2] enter loop in BenchmarkSequential: round[2], b.N[5]
goroutine[2] exit BenchmarkSequential: round[2], b.N[5]
BenchmarkSequential-8            5             5470 ns/op
PASS
ok        command-line-arguments 0.006s

上面的每个性能基准测试函数(如BenchmarkSequential)虽然实际执行了多轮,但也仅算一次执行。有时候考虑到性能基准测试单次执行的数据不具代表性,我们可能会显式要求go test多次执行以收集多次数据,并将这些数据经过统计学方法处理后的结果作为最终结果。通过-count命令行选项可以显式指定每个性能基准测试函数执行次数:

$go test -v -count 2 -bench . benchmark_intro_test.go
goos: darwin
goarch: amd64
BenchmarkConcatStringByOperator
BenchmarkConcatStringByOperator-8       12665250            89.8 ns/op
BenchmarkConcatStringByOperator-8       13099075            89.7 ns/op
BenchmarkConcatStringBySprintf
BenchmarkConcatStringBySprintf-8         2781075             433 ns/op
BenchmarkConcatStringBySprintf-8         2662507             433 ns/op
BenchmarkConcatStringByJoin
BenchmarkConcatStringByJoin-8           23679480            49.1 ns/op
BenchmarkConcatStringByJoin-8           24135014            49.6 ns/op
PASS
ok         command-line-arguments 8.225s

上面的例子中每个性能基准测试函数都被执行了两次(当然每次执行实质上都会运行多轮,b.N不同),输出了两个结果。

2. 并行执行的性能基准测试

并行执行的性能基准测试的代码写法如下:

func BenchmarkXxx(b *testing.B) {
    // ...
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            // 被测对象的执行代码
        }
    }
}

并行执行的基准测试主要用于为包含多goroutine同步设施(如互斥锁、读写锁、原子操作等)的被测代码建立性能基准。相比于顺序执行的基准测试,并行执行的基准测试更能真实反映出多goroutine情况下,被测代码在goroutine同步上的真实消耗。比如下面这个例子:

// chapter8/sources/benchmark_paralell_demo_test.go

var n1 int64

func addSyncByAtomic(delta int64) int64 {
    return atomic.AddInt64(&n1, delta)
}

func readSyncByAtomic() int64 {
    return atomic.LoadInt64(&n1)
}

var n2 int64
var rwmu sync.RWMutex

func addSyncByMutex(delta int64) {
    rwmu.Lock()
    n2 += delta
    rwmu.Unlock()
}

func readSyncByMutex() int64 {
    var n int64
    rwmu.RLock()
    n = n2
    rwmu.RUnlock()
    return n
}

func BenchmarkAddSyncByAtomic(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            addSyncByAtomic(1)
        }
    })
}

func BenchmarkReadSyncByAtomic(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            readSyncByAtomic()
        }
    })
}

func BenchmarkAddSyncByMutex(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            addSyncByMutex(1)
        }
    })
}

func BenchmarkReadSyncByMutex(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            readSyncByMutex()
        }
    })
}

运行该性能基准测试:

$go test -v -bench . benchmark_paralell_demo_test.go -cpu 2,4,8
goos: darwin
goarch: amd64
BenchmarkAddSyncByAtomic
BenchmarkAddSyncByAtomic-2        75208119              15.3 ns/op
BenchmarkAddSyncByAtomic-4        70117809              17.0 ns/op
BenchmarkAddSyncByAtomic-8        68664270              15.9 ns/op
BenchmarkReadSyncByAtomic
BenchmarkReadSyncByAtomic-2       1000000000           0.744 ns/op
BenchmarkReadSyncByAtomic-4       1000000000           0.384 ns/op
BenchmarkReadSyncByAtomic-8       1000000000           0.240 ns/op
BenchmarkAddSyncByMutex
BenchmarkAddSyncByMutex-2         37533390              31.4 ns/op
BenchmarkAddSyncByMutex-4         21660948              57.5 ns/op
BenchmarkAddSyncByMutex-8         16808721              72.6 ns/op
BenchmarkReadSyncByMutex
BenchmarkReadSyncByMutex-2        35535615              32.3 ns/op
BenchmarkReadSyncByMutex-4        29839219              39.6 ns/op
BenchmarkReadSyncByMutex-8        29936805              39.8 ns/op
PASS
ok         command-line-arguments 12.454s

上面的例子中通过-cpu 2,4,8命令行选项告知go test将每个性能基准测试函数分别在GOMAXPROCS等于2、4、8的情况下各运行一次。从测试的输出结果,我们可以很容易地看出不同被测函数的性能随着GOMAXPROCS增大之后的性能变化情况。

和顺序执行的性能基准测试不同,并行执行的性能基准测试会启动多个goroutine并行执行基准测试函数中的循环。这里也用一个例子来说明一下其执行流程:

// chapter8/sources/benchmark-impl/paralell_test.go
var (
    m     map[int64]int = make(map[int64]int, 20)
    mu    sync.Mutex
    round int64 = 1
)

func BenchmarkParalell(b *testing.B) {
    fmt.Printf("\ngoroutine[%d] enter BenchmarkParalell: round[%d], b.N[%d]\n",
           tls.ID(), atomic.LoadInt64(&round), b.N)
    defer func() {
        atomic.AddInt64(&round, 1)
    }()

    b.RunParallel(func(pb *testing.PB) {
        id := tls.ID()
        fmt.Printf("goroutine[%d] enter loop func in BenchmarkParalell: round[%d], b.N[%d]\n", tls.ID(), atomic.LoadInt64(&round), b.N)
        for pb.Next() {
            mu.Lock()
            _, ok := m[id]
            if !ok {
                m[id] = 1
            } else {
                m[id] = m[id] + 1
            }
            mu.Unlock()
        }

        mu.Lock()
        count := m[id]
        mu.Unlock()

        fmt.Printf("goroutine[%d] exit loop func in BenchmarkParalell: round[%d], loop[%d]\n", tls.ID(), atomic.LoadInt64(&round), count)
    })

    fmt.Printf("goroutine[%d] exit BenchmarkParalell: round[%d], b.N[%d]\n",
        tls.ID(), atomic.LoadInt64(&round), b.N)
}

以-cpu=2运行该例子:

$go test -v  -bench . paralell_test.go -cpu=2
goos: darwin
goarch: amd64
BenchmarkParalell

goroutine[1] enter BenchmarkParalell: round[1], b.N[1]
goroutine[2] enter loop func in BenchmarkParalell: round[1], b.N[1]
goroutine[2] exit loop func in BenchmarkParalell: round[1], loop[1]
goroutine[3] enter loop func in BenchmarkParalell: round[1], b.N[1]
goroutine[3] exit loop func in BenchmarkParalell: round[1], loop[0]
goroutine[1] exit BenchmarkParalell: round[1], b.N[1]

goroutine[4] enter BenchmarkParalell: round[2], b.N[100]
goroutine[5] enter loop func in BenchmarkParalell: round[2], b.N[100]
goroutine[5] exit loop func in BenchmarkParalell: round[2], loop[100]
goroutine[6] enter loop func in BenchmarkParalell: round[2], b.N[100]
goroutine[6] exit loop func in BenchmarkParalell: round[2], loop[0]
goroutine[4] exit BenchmarkParalell: round[2], b.N[100]

goroutine[4] enter BenchmarkParalell: round[3], b.N[10000]
goroutine[7] enter loop func in BenchmarkParalell: round[3], b.N[10000]
goroutine[8] enter loop func in BenchmarkParalell: round[3], b.N[10000]
goroutine[8] exit loop func in BenchmarkParalell: round[3], loop[4576]
goroutine[7] exit loop func in BenchmarkParalell: round[3], loop[5424]
goroutine[4] exit BenchmarkParalell: round[3], b.N[10000]

goroutine[4] enter BenchmarkParalell: round[4], b.N[1000000]
goroutine[9] enter loop func in BenchmarkParalell: round[4], b.N[1000000]
goroutine[10] enter loop func in BenchmarkParalell: round[4], b.N[1000000]
goroutine[9] exit loop func in BenchmarkParalell: round[4], loop[478750]
goroutine[10] exit loop func in BenchmarkParalell: round[4], loop[521250]
goroutine[4] exit BenchmarkParalell: round[4], b.N[1000000]

goroutine[4] enter BenchmarkParalell: round[5], b.N[25717561]
goroutine[11] enter loop func in BenchmarkParalell: round[5], b.N[25717561]
goroutine[12] enter loop func in BenchmarkParalell: round[5], b.N[25717561]
goroutine[12] exit loop func in BenchmarkParalell: round[5], loop[11651491]
goroutine[11] exit loop func in BenchmarkParalell: round[5], loop[14066070]
goroutine[4] exit BenchmarkParalell: round[5], b.N[25717561]
BenchmarkParalell-2       25717561               43.6 ns/op
PASS
ok         command-line-arguments 1.176s

我们看到,针对BenchmarkParalell基准测试的每一轮执行,go test都会启动GOMAXPROCS数量的新goroutine,这些goroutine共同执行b.N次循环,每个goroutine会尽量相对均衡地分担循环次数。