Baseline

Full source

Our first version will be little more than three simple, nested for-loops. This serves as an initial starting point, on top of we will gradually add more complexity, which should greatly improve the performance of our program.

C++ copy-paste

Let's start by implementing the single-threaded version of the algorithm. Recall how in the previous chapter we defined the C interface function step that wraps input pointers into slices and passes those slices to a Rust function called _step. One low-effort approach to implement _step is converting the C++ reference solution line by line into valid Rust syntax:

fn _step(r: &mut [f32], d: &[f32], n: usize) { for i in 0..n { for j in 0..n { let mut v = std::f32::INFINITY; for k in 0..n { let x = d[n*i + k]; let y = d[n*k + j]; let z = x + y; v = v.min(z); } r[n*i + j] = v; } } }

In addition to being very inefficient, this implementation has several Rust-specific problems that we will address in the upcoming chapters. But first, let's assume this really is our best idea so far and think about how to parallelize this. In the C++ reference solution, each iteration of the outermost for-loop is distributed into parallel threads by using a #pragma omp parallel for compile time macro from the OpenMP library. We don't have such macros in Rust, and even if we would start implementing some kind of thread pool with standard library threads or use some ready-made data parallelism solution, our problem will always be variable r. Since mutable references cannot be aliased, only one mutable reference to r can ever exist, which makes our current idea inherently sequential and unusable.

Borrowing

Before continuing, let's talk a bit about reference borrowing, which is a fundamental part of how Rust implements thread safety. When we pass r into _step from the extern wrapper function, we have to tell the compiler we are about to transfer a mutable reference r into the scope of _step from the scope of step:

_step(&mut r, d, n as usize);

In Rust this is called a mutable borrow. Mutable borrows cannot be aliased, which means it is not possible to have more than one mutable reference to r within one scope at a time. Immutable borrows, on the other hand, may be aliased. Therefore, we can have an arbitrary amount of immutable references to slice d in concurrently executing threads, but it is not possible to do the same for slice r. While this effectively eliminates the possibility of data races already at compile time, we need to think a bit more about how to properly distribute the mutable data of r into concurrent threads.

A parallelizable approach

We will solve this problem by partitioning r into non-overlapping, mutable subslices, and give ownership of each subslice to the thread that will write its results into that particular piece of memory. To encapsulate one unit of work for one thread, we replace the outermost for-loop by a function which captures all immutable state, slice d, by reference from the enclosing scope, and accepts a single, mutable row of r as an argument:

// Function: for some row i and every column j in d, // compute n results into r (r_row) let step_row = |(i, r_row): (usize, &mut [f32])| { for (j, res) in r_row.iter_mut().enumerate() { let mut v = std::f32::INFINITY; for k in 0..n { let x = d[n*i + k]; let y = d[n*k + j]; let z = x + y; v = v.min(z); } *res = v; } };

Note how res will always be equal to r[n*i + j].

In order to use this function on the result slice r, we must first partition r into rows of length n. Rust slices have a builtin method chunks_mut, which will partition the slice into non-overlapping, mutable subslices of a given length. If we want to partition r into mutable rows, each containing n elements, we can get an iterator over such mutable, row chunks with:

r.chunks_mut(n)

If we enumerate the iterator, we will get the original row indexes from 0 to n-1, and all that remains is to apply step_row on each (index, row_chunk) pair:

r.chunks_mut(n) .enumerate() .for_each(step_row);

The reason why we took this approach is that by explicitly partitioning r into new, mutable subslices, the compiler can pass ownership of these subslices to other scopes, without affecting the validity of other subslices. This allows us e.g. to implement a thread pool that executes step_row on each r_row subslice in parallel. Fortunately, there's already a crate for that. All we have to do is to replace chunks_mut with its parallel counterpart par_chunks_mut, which creates concurrent threads that can be used to apply step_row to each row chunk in parallel, in a work-stealing manner, until all rows have been processed:

r.par_chunks_mut(n) .enumerate() .for_each(step_row);

Benchmark

Let's run some benchmarks. We'll be using randomly generated input of size n = 6000 and run the step function with 4 threads on 4 cores for a single iteration. We measure the total running time in seconds and instructions per cycle (IPC). Here is a more detailed specification of the benchmark parameters and CPU. The C++ reference implementation will be compiled with Clang and GCC, so we'll be running 3 benchmarks in total. Here are the results:

ImplementationCompilerTime (s)IPC
C++ v0gcc 7.4.0-1ubuntu12890.39
C++ v0clang 6.0.0-1ubuntu22970.28
Rust v0rustc 1.38.0-nightly2850.78

All step functions take almost 300 seconds to complete when n = 6000. There seems to be some differences in the amount of instructions executed at each cycle. To find answers, we need to take a look at what the compilers produced for the innermost loop of the step function.

Assembly

gcc

Minimal loop that corresponds to a for loop in the source code, iterating one element at a time. See here for a detailed explanation on how it relates to the C++ code.

LOOP: vmovss xmm0,DWORD PTR [rdx+rax*1] vaddss xmm0,xmm0,DWORD PTR [rcx+rax*1] add rax,0x4 vminss xmm1,xmm0,xmm1 cmp rax,rsi jne LOOP

clang

Same as the gcc single element loop but it is unrolled for 4 iterations. Note how the loop register r8 is incremented by 4 after each iteration, and that the memory addresses from where we are loading 32-bit values are offset by r8*4 minus 12, 8, 4, and 0.

LOOP: vmovss xmm2,DWORD PTR [rdi+r8*4-0xc] vmovss xmm3,DWORD PTR [rdi+r8*4-0x8] vaddss xmm2,xmm2,DWORD PTR [r15+r8*4-0xc] vaddss xmm3,xmm3,DWORD PTR [r15+r8*4-0x8] vminss xmm1,xmm2,xmm1 vminss xmm1,xmm3,xmm1 vmovss xmm2,DWORD PTR [rdi+r8*4-0x4] vaddss xmm2,xmm2,DWORD PTR [r15+r8*4-0x4] vminss xmm1,xmm2,xmm1 vmovss xmm2,DWORD PTR [rdi+r8*4] vaddss xmm2,xmm2,DWORD PTR [r15+r8*4] vminss xmm1,xmm2,xmm1 add r8,0x4 cmp rbp,r8 jne LOOP

rustc

This looks like the gcc single element loop, but there is something extra going on. What we see here is array bounds checking before loading values from memory and a NaN check before updating the intermediate result (mutable variable v in the code).

LOOP: cmp rsi,rdx jae 137 cmp rax,rdx jae 146 mov rdi,QWORD PTR [rbx] vmovss xmm2,DWORD PTR [rdi+rsi*4] vaddss xmm2,xmm2,DWORD PTR [rdi+rax*4] vminss xmm3,xmm2,xmm1 vcmpunordss xmm1,xmm1,xmm1 vblendvps xmm1,xmm3,xmm2,xmm1 add rax,r8 inc rsi dec rbp jne LOOP

Let's look at it in smaller chunks.

Here we do bounds checking for rsi and rax, jumping out of the loop and starting a panic in case they have reached the threshold specified in rdx. We can also see that rdi is loaded from memory at each iteration even though it stays constant in this loop. The register is used when loading two f32 values from memory, so it is probably also related to bounds checking in some way.

cmp rsi,rdx jae 137 cmp rax,rdx jae 146 mov rdi,QWORD PTR [rbx]

Here is the useful stuff we want to do, load two f32s, add them, and update the current minimum.

vmovss xmm2,DWORD PTR [rdi+rsi*4] vaddss xmm2,xmm2,DWORD PTR [rdi+rax*4] vminss xmm3,xmm2,xmm1

However, instead of keeping the current minimum always in xmm1, the compiler uses a temporary register xmm3 for checking that the computed value is not NaN before writing it into xmm1. It seems that f32::min enforces a NaN-check (x < y || y != y) to comply with IEEE standards, which might be causing these extra instructions:

vcmpunordss xmm1,xmm1,xmm1 vblendvps xmm1,xmm3,xmm2,xmm1

The reason why these extra instructions did not affect the running time, despite leading to an increased amount of instructions per cycle, is probably because the CPU was sitting idle most of the time, waiting for memory accesses to complete. We are currently using a very poor memory access pattern by reading d column-wise. That's what we're going to fix in the next chapter.