Baseline
Our first version will be little more than three simple, nested for
-loops.
This serves as an initial starting point, on top of we will gradually add more complexity, which should greatly improve the performance of our program.
C++ copy-paste
Let's start by implementing the single-threaded version of the algorithm.
Recall how in the previous chapter we defined the C interface function step
that wraps input pointers into slices and passes those slices to a Rust function called _step
.
One low-effort approach to implement _step
is converting the C++ reference solution line by line into valid Rust syntax:
fn _step(r: &mut [f32], d: &[f32], n: usize) {
for i in 0..n {
for j in 0..n {
let mut v = std::f32::INFINITY;
for k in 0..n {
let x = d[n*i + k];
let y = d[n*k + j];
let z = x + y;
v = v.min(z);
}
r[n*i + j] = v;
}
}
}
In addition to being very inefficient, this implementation has several Rust-specific problems that we will address in the upcoming chapters.
But first, let's assume this really is our best idea so far and think about how to parallelize this.
In the C++ reference solution, each iteration of the outermost for
-loop is distributed into parallel threads by using a #pragma omp parallel for
compile time macro from the OpenMP library.
We don't have such macros in Rust, and even if we would start implementing some kind of thread pool with standard library threads or use some ready-made data parallelism solution, our problem will always be variable r
.
Since mutable references cannot be aliased, only one mutable reference to r
can ever exist, which makes our current idea inherently sequential and unusable.
Borrowing
Before continuing, let's talk a bit about reference borrowing, which is a fundamental part of how Rust implements thread safety.
When we pass r
into _step
from the extern wrapper function, we have to tell the compiler we are about to transfer a mutable reference r
into the scope of _step
from the scope of step
:
_step(&mut r, d, n as usize);
In Rust this is called a mutable borrow.
Mutable borrows cannot be aliased, which means it is not possible to have more than one mutable reference to r
within one scope at a time.
Immutable borrows, on the other hand, may be aliased.
Therefore, we can have an arbitrary amount of immutable references to slice d
in concurrently executing threads, but it is not possible to do the same for slice r
.
While this effectively eliminates the possibility of data races already at compile time, we need to think a bit more about how to properly distribute the mutable data of r
into concurrent threads.
A parallelizable approach
We will solve this problem by partitioning r
into non-overlapping, mutable subslices, and give ownership of each subslice to the thread that will write its results into that particular piece of memory.
To encapsulate one unit of work for one thread, we replace the outermost for
-loop by a function which captures all immutable state, slice d
, by reference from the enclosing scope, and accepts a single, mutable row of r
as an argument:
// Function: for some row i and every column j in d,
// compute n results into r (r_row)
let step_row = |(i, r_row): (usize, &mut [f32])| {
for (j, res) in r_row.iter_mut().enumerate() {
let mut v = std::f32::INFINITY;
for k in 0..n {
let x = d[n*i + k];
let y = d[n*k + j];
let z = x + y;
v = v.min(z);
}
*res = v;
}
};
Note how res
will always be equal to r[n*i + j]
.
In order to use this function on the result slice r
, we must first partition r
into rows of length n
.
Rust slices have a builtin method chunks_mut
, which will partition the slice into non-overlapping, mutable subslices of a given length.
If we want to partition r
into mutable rows, each containing n
elements, we can get an iterator over such mutable, row chunks with:
r.chunks_mut(n)
If we enumerate the iterator, we will get the original row indexes from 0
to n-1
, and all that remains is to apply step_row
on each (index, row_chunk)
pair:
r.chunks_mut(n)
.enumerate()
.for_each(step_row);
The reason why we took this approach is that by explicitly partitioning r
into new, mutable subslices, the compiler can pass ownership of these subslices to other scopes, without affecting the validity of other subslices.
This allows us e.g. to implement a thread pool that executes step_row
on each r_row
subslice in parallel.
Fortunately, there's already a crate for that.
All we have to do is to replace chunks_mut
with its parallel counterpart par_chunks_mut
, which creates concurrent threads that can be used to apply step_row
to each row chunk in parallel, in a work-stealing manner, until all rows have been processed:
r.par_chunks_mut(n)
.enumerate()
.for_each(step_row);
Benchmark
Let's run some benchmarks.
We'll be using randomly generated input of size n = 6000
and run the step
function with 4 threads on 4 cores for a single iteration.
We measure the total running time in seconds and instructions per cycle (IPC).
Here is a more detailed specification of the benchmark parameters and CPU.
The C++ reference implementation will be compiled with Clang and GCC, so we'll be running 3 benchmarks in total.
Here are the results:
Implementation | Compiler | Time (s) | IPC |
---|---|---|---|
C++ v0 | gcc 7.4.0-1ubuntu1 | 289 | 0.39 |
C++ v0 | clang 6.0.0-1ubuntu2 | 297 | 0.28 |
Rust v0 | rustc 1.38.0-nightly | 285 | 0.78 |
All step
functions take almost 300 seconds to complete when n = 6000
.
There seems to be some differences in the amount of instructions executed at each cycle.
To find answers, we need to take a look at what the compilers produced for the innermost loop of the step
function.
Assembly
gcc
Minimal loop that corresponds to a for
loop in the source code, iterating one element at a time.
See here for a detailed explanation on how it relates to the C++ code.
LOOP:
vmovss xmm0,DWORD PTR [rdx+rax*1]
vaddss xmm0,xmm0,DWORD PTR [rcx+rax*1]
add rax,0x4
vminss xmm1,xmm0,xmm1
cmp rax,rsi
jne LOOP
clang
Same as the gcc
single element loop but it is unrolled for 4 iterations.
Note how the loop register r8
is incremented by 4 after each iteration, and that the memory addresses from where we are loading 32-bit values are offset by r8*4
minus 12, 8, 4, and 0.
LOOP:
vmovss xmm2,DWORD PTR [rdi+r8*4-0xc]
vmovss xmm3,DWORD PTR [rdi+r8*4-0x8]
vaddss xmm2,xmm2,DWORD PTR [r15+r8*4-0xc]
vaddss xmm3,xmm3,DWORD PTR [r15+r8*4-0x8]
vminss xmm1,xmm2,xmm1
vminss xmm1,xmm3,xmm1
vmovss xmm2,DWORD PTR [rdi+r8*4-0x4]
vaddss xmm2,xmm2,DWORD PTR [r15+r8*4-0x4]
vminss xmm1,xmm2,xmm1
vmovss xmm2,DWORD PTR [rdi+r8*4]
vaddss xmm2,xmm2,DWORD PTR [r15+r8*4]
vminss xmm1,xmm2,xmm1
add r8,0x4
cmp rbp,r8
jne LOOP
rustc
This looks like the gcc
single element loop, but there is something extra going on.
What we see here is array bounds checking before loading values from memory and a NaN
check before updating the intermediate result (mutable variable v
in the code).
LOOP:
cmp rsi,rdx
jae 137
cmp rax,rdx
jae 146
mov rdi,QWORD PTR [rbx]
vmovss xmm2,DWORD PTR [rdi+rsi*4]
vaddss xmm2,xmm2,DWORD PTR [rdi+rax*4]
vminss xmm3,xmm2,xmm1
vcmpunordss xmm1,xmm1,xmm1
vblendvps xmm1,xmm3,xmm2,xmm1
add rax,r8
inc rsi
dec rbp
jne LOOP
Let's look at it in smaller chunks.
Here we do bounds checking for rsi
and rax
, jumping out of the loop and starting a panic
in case they have reached the threshold specified in rdx
.
We can also see that rdi
is loaded from memory at each iteration even though it stays constant in this loop.
The register is used when loading two f32
values from memory, so it is probably also related to bounds checking in some way.
cmp rsi,rdx
jae 137
cmp rax,rdx
jae 146
mov rdi,QWORD PTR [rbx]
Here is the useful stuff we want to do, load two f32
s, add them, and update the current minimum.
vmovss xmm2,DWORD PTR [rdi+rsi*4]
vaddss xmm2,xmm2,DWORD PTR [rdi+rax*4]
vminss xmm3,xmm2,xmm1
However, instead of keeping the current minimum always in xmm1
, the compiler uses a temporary register xmm3
for checking that the computed value is not NaN
before writing it into xmm1
.
It seems that f32::min
enforces a NaN
-check (x < y || y != y
) to comply with IEEE standards, which might be causing these extra instructions:
vcmpunordss xmm1,xmm1,xmm1
vblendvps xmm1,xmm3,xmm2,xmm1
The reason why these extra instructions did not affect the running time, despite leading to an increased amount of instructions per cycle, is probably because the CPU was sitting idle most of the time, waiting for memory accesses to complete.
We are currently using a very poor memory access pattern by reading d
column-wise.
That's what we're going to fix in the next chapter.