Linear reading
To enable a linear memory access pattern, the reference solution introduces a Θ(n²) preprocessing step that allocates additional space for storing the transpose of d
in row-major order.
This allows us to read the columns of d
linearly, using fully packed cache lines on each read.
The easiest way of allocating memory on the heap for contiguous elements is probably by creating a vector, which is a struct containing a pointer, size, and length.
We use the std::vec
compile-time macro to create a mutable vector of length n * n
, with all elements initialized to the value 0.0
, and then fill it with the transpose of d
.
Note that there is no need to annotate the type of the vector, since f32
is inferred from context:
// Transpose of d
let mut t = std::vec![0.0; n * n];
// Function: for some column j in d,
// copy all elements of that column into row i in t (t_row)
let transpose_column = |(j, t_row): (usize, &mut [f32])| {
for (i, x) in t_row.iter_mut().enumerate() {
*x = d[n*i + j];
}
};
// Copy all columns of d into rows of t in parallel
t.par_chunks_mut(n)
.enumerate()
.for_each(transpose_column);
Now all columns of d
have been stored as rows in t
, and all we have to do is to iterate over all row pair combinations of d
and t
.
As previously, we partition r
into n
non-overlapping, mutable rows such that each thread is working on one row at a time:
// Function: for some row i in d and all rows t,
// compute n results into row i in r (r_row)
let step_row = |(i, r_row): (usize, &mut [f32])| {
for (j, res) in r_row.iter_mut().enumerate() {
let mut v = std::f32::INFINITY;
for k in 0..n {
let x = d[n*i + k];
let y = t[n*j + k];
let z = x + y;
v = v.min(z);
}
*res = v;
}
};
// Partition r into rows containing n elements,
// and apply step_row on all rows in parallel
r.par_chunks_mut(n)
.enumerate()
.for_each(step_row);
Benchmark
We'll use the same settings as in v0
.
Implementation | Compiler | Time (s) | IPC |
---|---|---|---|
C++ v1 | gcc 7.4.0-1ubuntu1 | 60.5 | 1.54 |
C++ v1 | clang 6.0.0-1ubuntu2 | 60.5 | 1.00 |
Rust v1 | rustc 1.38.0-nightly | 114.6 | 2.11 |
The linear memory access pattern helps a lot here, compared to what we had in the previous version.
However, the Rust program is struggling to keep up, executing twice the amount of instructions per cycle as the C++ program while being almost two times slower.
In the previous chapter, we talked about array bounds checking and NaN
checks not affecting the running time due to a bad memory access pattern.
We fixed the memory access pattern but now the extra instructions are starting to slow us down.
Let's look at the most recent output from rustc
to see these extra instructions.
This time, we skip gcc
and clang
, because they produced almost the same output as in v0
.
rustc
Not much has changed from v0
, except that there is even more registers involved in doing bounds checking.
LOOP:
cmp rax,rdx
jae 13e
mov rcx,QWORD PTR [rbx+0x10]
cmp rcx,rsi
jbe 150
mov rcx,QWORD PTR [rbx]
mov r10,QWORD PTR [r15]
vmovss xmm2,DWORD PTR [r10+rax*4]
vaddss xmm2,xmm2,DWORD PTR [rcx+rsi*4]
vminss xmm3,xmm2,xmm1
vcmpunordss xmm1,xmm1,xmm1
vblendvps xmm1,xmm3,xmm2,xmm1
inc rsi
inc rax
dec rdi
jne LOOP
Running the Rust program benchmark with perf-record
suggests that a significant amount of the running time is spent doing NaN
checks with vcmpunordss
and vblendvps
.
Dealing with the NaN
check
Let's remove the NaN
checks by replacing f32::min
in the inner loop by a simple if-else
expression:
for k in 0..n {
let x = d[n*i + k];
let y = t[n*j + k];
let z = x + y;
v = if v < z { v } else { z };
}
Compiling and checking the output we see that the NaN
checks are gone from our loop:
LOOP:
cmp rax,rdx
jae 133
mov rcx,QWORD PTR [rbx+0x10]
cmp rcx,rsi
jbe 145
mov rcx,QWORD PTR [rbx]
mov r10,QWORD PTR [r15]
vmovss xmm2,DWORD PTR [r10+rax*4]
vaddss xmm2,xmm2,DWORD PTR [rcx+rsi*4]
vminss xmm1,xmm1,xmm2
inc rsi
inc rax
dec rdi
jne LOOP
Benchmarking the Rust program shows that the running time also improved quite a lot:
Implementation | Compiler | Time (s) | IPC |
---|---|---|---|
C++ v1 | gcc 7.4.0-1ubuntu1 | 60.5 | 1.54 |
C++ v1 | clang 6.0.0-1ubuntu2 | 60.5 | 1.00 |
Rust v1 | rustc 1.38.0-nightly | 60.8 | 3.43 |
What about the array bounds checks?
Our mid-range CPU seems to be handling them without any problems even in the most performance critical loop.
However, the bounds checks are certainly not free, as we can see from the amount of IPC.
The C++ implementation of v1
is a proof that it is possible to solve the problem with significantly less instructions.
On other hand, we don't want to remove the bounds checks completely, since we'd prefer to use as little unsafe
Rust as possible.
Dealing with the bounds checks
Our solution is similar to the preprocessing step of computing the transpose of d
:
We will perform a bit of extra work outside the loop to remove a lot of work from inside the loop.
If we extract one row of d
and one row of t
as subslices before the inner loop starts, the compiler will have a chance to assert that the starting and ending index of the subslices are within the bounds of the slices we extract the subslices from:
let step_row = |(i, r_row): (usize, &mut [f32])| {
// Get a view of row i of d as a subslice
let d_row = &d[n*i..n*(i+1)];
for (j, res) in r_row.iter_mut().enumerate() {
// Same for row j in t
let t_row = &t[n*j..n*(j+1)];
let mut v = std::f32::INFINITY;
for k in 0..n {
let x = d_row[k];
let y = t_row[k];
let z = x + y;
v = if v < z { v } else { z };
}
*res = v;
}
};
After compiling the program, we can see that the compiler still wants to check that k
is in bounds.
Since rsi
is incremented by 1 after each iteration, and it is used to load two f32
s, it is very likely equal to our k
.
LOOP:
cmp r10,rsi
je 194
vmovss xmm2,DWORD PTR [rdx+rsi*4]
vaddss xmm2,xmm2,DWORD PTR [rax+rsi*4]
inc rsi
vminss xmm1,xmm1,xmm2
cmp rcx,rsi
jne LOOP
Benchmarks show that the amount of IPC reduced significantly:
Implementation | Compiler | Time (s) | IPC |
---|---|---|---|
C++ v1 | gcc 7.4.0-1ubuntu1 | 60.5 | 1.54 |
C++ v1 | clang 6.0.0-1ubuntu2 | 60.5 | 1.00 |
Rust v1 | rustc 1.38.0-nightly | 60.6 | 2.02 |
Let's get all bounds checking out of the loop.
We are currently using k
only for accessing every element of d_row
and t_row
between 0..n
, so we might as well use iterators over both subslices.
If we zip them them together, there's no need for k
anymore.
for (&x, &y) in d_row.iter().zip(t_row.iter()) {
let z = x + y;
v = if v < z { v } else { z };
}
After compiling the program, we can see that not only did the compiler remove the bounds checks but it also unrolled 8 iterations of the loop:
LOOP:
vmovss xmm2,DWORD PTR [r9+r15*4-0x1c]
vmovss xmm3,DWORD PTR [r9+r15*4-0x18]
vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0x1c]
vminss xmm1,xmm1,xmm2
vaddss xmm2,xmm3,DWORD PTR [r13+r15*4-0x18]
vmovss xmm3,DWORD PTR [r9+r15*4-0x14]
vaddss xmm3,xmm3,DWORD PTR [r13+r15*4-0x14]
vminss xmm1,xmm1,xmm2
vminss xmm1,xmm1,xmm3
vmovss xmm2,DWORD PTR [r9+r15*4-0x10]
vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0x10]
vminss xmm1,xmm1,xmm2
vmovss xmm2,DWORD PTR [r9+r15*4-0xc]
vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0xc]
vmovss xmm3,DWORD PTR [r9+r15*4-0x8]
vaddss xmm3,xmm3,DWORD PTR [r13+r15*4-0x8]
vminss xmm1,xmm1,xmm2
vminss xmm1,xmm1,xmm3
vmovss xmm2,DWORD PTR [r9+r15*4-0x4]
vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0x4]
vminss xmm1,xmm1,xmm2
vmovss xmm2,DWORD PTR [r9+r15*4]
vaddss xmm2,xmm2,DWORD PTR [r13+r15*4+0x0]
add r15,0x8
vminss xmm1,xmm1,xmm2
cmp rax,r15
jne LOOP
Recall how clang
unrolled the loop in v0
in an exactly similar way.
Since our program is still memory bottlenecked, the unrolling does not affect the running time.
However, it does reduce the total amount of IPC:
Implementation | Compiler | Time (s) | IPC |
---|---|---|---|
C++ v1 | gcc 7.4.0-1ubuntu1 | 60.5 | 1.54 |
C++ v1 | clang 6.0.0-1ubuntu2 | 60.5 | 1.00 |
Rust v1 | rustc 1.38.0-nightly | 60.6 | 0.92 |
The reason for this is that we have more instructions doing the useful stuff (e.g. loading memory vmovss
, addition vaddss
, and computing minimums vminss
) than loop related instructions such as comparisons and jumps.
Compare this to the gcc
single element loop of v0
.
iter
all the things
If we succeeded in eliminating k
from the innermost loop by using iterators, can we remove all loop variables with iterators?
We are using chunks_mut
to divide r
into rows of length n
, so why not do something similar with d
and t
but with immutable chunks instead?
Our function computes n
results for a row i
in d
into row i
in r
.
We can make i
redundant by chunking d
into rows at the same time as r
, zip the row iterators into pairs and apply step_row
in parallel on all (r_row, d_row)
pairs.
Inside step_row
, we loop over all columns j
of d
, i.e. all rows j
of t
.
If we chunk up t
into n
rows of length n
inside step_row
, we can zip up that iterator with row i
of r
and we have made index j
redundant.
Finally, we wrap our if-else
minimum into a function and put it into our toolbox:
#[inline(always)]
pub fn min(x: f32, y: f32) -> f32 {
if x < y { x } else { y }
}
Here's the final version of v1
version of step_row
:
// Function: for some row i in d (d_row) and all rows t (t_rows),
// compute n results into a row in r (r_row)
let step_row = |(r_row, d_row): (&mut [f32], &[f32])| {
let t_rows = t.chunks_exact(n);
for (res, t_row) in r_row.iter_mut().zip(t_rows) {
*res = d_row.iter()
.zip(t_row)
.fold(std::f32::INFINITY, |v, (&x, &y)| min(v, x + y));
}
};
// Partition r and d into slices, each containing a single row of r and d,
// and apply the function on the row pairs
r.par_chunks_mut(n)
.zip(d.par_chunks(n))
.for_each(step_row);
Compiler output and benchmark results are not changed.
It's nice to see functional code that performs as well as a C++ program. However, as we start pushing the CPU towards its limits, we eventually have to trade away some "functional prettiness" for raw performance, e.g. by loop unrolling and using hard-coded amounts of variables.