Keeping your caches hot

First of all, don't thrash your caches because main memory is very slow. This means that reading data all over the memory (also known as pointer chasing) isn't the best of ideas. Modern processors, such as programs that read consecutive chunks of memory in a predictable manner, allow them to leverage hardware-level prefetching. The word here is data locality.

A negative example for that, alas, is our old and trusty linked list! Traversing it's a real pointer-chasing feast, because all of the nodes are allocated dynamically and can be placed anywhere in memory. However, we could remedy it by using the already-mentioned techniques of preallocating and custom memory allocators. In that case, all of the nodes of a list would be located in the adjacent element of the preallocated buffer, making it again more cache friendly. Of course, we could just use an array from the start, but this is an example of how to improve data locality in similar cases.

The second classic mistake would be placing two often used integers, x and y, well apart from each other in a data structure, so that when we want to use both, we need to needlessly load two cache lines instead of one. What's used together should stay together; don't break your data structures on a cache line boundary!

Further optimization that's widely known is the replacement of an array of structures by a structure of arrays. This will be beneficial in the case of loading data for SIMD instructions and other techniques where we read data in parallel.

Another often-cited optimization trick is improving the performance of matrix multiplication by changing the data-access pattern. Instead of a straightforward triple i, j, k loop, we change it to the k, j, i loop:

for (size_t k = 0; k < P; ++k)
for (size_t j = 0; j <M; ++j)
for (size_t i = 0; i < N; ++i)
res[i][[j] += m1[i][[k] * m2[k][[j];

Here, just switching the order of traversal to be more cache-friendly and reading row elements of both matrices in consecutive manner will dramatically improve performance (in some measurements, up to 94%)!

What we learn here is that the outline and structure of our data, but also its access patterns, can have a big impact on our program's performance on modern processors!

The second type of cache, that is, the instruction cache, needs some love as well. Jumping through code back and forth is equally bad, as it's in the case of your data. This code locality is the next important notion. One possibility to influence that is to place your normal case code before the handling error, like this:

if (ok) {
do_work();
} else {
printf(“ERROR!”);
return;
}

This avoids a jump in normal case, hence improving code locality. We'll discuss that in more detail in Chapter 3Deep Dive into C++ and Performance, when we'll look at the role of the compiler in program optimization.