Instruction level parallelism (ILP)
Our program does not take advantage of the fact that modern CPUs are superscalar processors, capable of executing several independent instructions simultaneously.
The problem in our v1
implementation is that each step is dependent on the previous step, caused by this part:
let z = x + y;
v = min(v, z);
We will solve this by using a simple idea from the reference solution: accumulate results into 4 independent, intermediate results and merge them only after processing the whole row.
Suppose we have some row of d
, containing the elements x0, x1, x2, x3, ..., xn
, and some column of d
(i.e. row of t
), containing the elements y0, y1, y2, y3, ..., yn
.
Then, we compute results for all rows by accumulating intermediate results into 4 variables v0, v1, v2, v3
as follows:
// iteration 1
v0 = min(v0, x0 + y0);
v1 = min(v1, x1 + y1);
v2 = min(v2, x2 + y2);
v3 = min(v3, x3 + y3);
// iteration 2
v0 = min(v0, x4 + y4);
v1 = min(v1, x5 + y5);
v2 = min(v2, x6 + y6);
v3 = min(v3, x7 + y7);
// iteration 3
v0 = min(v0, x8 + y8);
v1 = min(v1, x9 + y9);
v2 = min(v2, x10 + y10);
v3 = min(v3, x11 + y11);
// etc ...
This should allow the CPU to write results into 4 independent registers for each intermediate result.
Before we can update the step_row
function, we need to make sure the amount of elements on each row is always a multiple of 4 to keep the performance-critical loop free of messy, unnecessary branching.
As previously, we transpose d
to allow linear reading of its columns, but have to make sure the row length of the transpose is also divisible by 4.
The preprocessing looks a bit more complicated, but is essentially the same as doing the transpose in v1
, except that we copy the values of d
also into vd
, which is padded with std::f32::INFINITY
values to make its rows divisible by 4:
const BLOCK_SIZE: usize = 4;
let blocks_per_row = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
let n_padded = blocks_per_row * BLOCK_SIZE;
// d and transpose of d with extra room at the end of each row,
// both initially filled with f32::INFINITY
let mut vd = std::vec![std::f32::INFINITY; n_padded * n];
let mut vt = std::vec![std::f32::INFINITY; n_padded * n];
// Function: for one row of vd and vt,
// copy a row at 'i' of d into vd and column at 'i' of d into vt
let preprocess_row = |(i, (vd_row, vt_row)): (usize, (&mut [f32], &mut [f32]))| {
for (j, (x, y)) in vd_row.iter_mut().zip(vt_row.iter_mut()).enumerate() {
if i < n && j < n {
*x = d[n*i + j];
*y = d[n*j + i];
}
}
};
// Partition vd and vt into rows, apply preprocessing in parallel for each row pair
vd.par_chunks_mut(n_padded)
.zip(vt.par_chunks_mut(n_padded))
.enumerate()
.for_each(preprocess_row);
Now vd
contains the original d
and vt
contains the transpose of d
, but both have been padded with extra columns to the right containing f32::INFINITY
s to ensure the width of vd
and vt
is always divisible by 4.
Then, we partition r
and vd
into row chunks, zip them into row chunk pairs and apply step_row
in parallel for each row of vd
, writing the results into its paired result row chunk.
Each thread will compute results over all rows of vt
.
// Function: for some row in vd (vd_row) and all rows in vt (vt_rows),
// compute all results for a row in r (r_row), corresponding to the row index of vd_row.
let step_row = |(r_row, vd_row): (&mut [f32], &[f32])| {
let vt_rows = vt.chunks_exact(n_padded);
// Length of a zipped iterator is the length of the shorter iterator in the zip pair so this never exceeds n
for (res, vt_row) in r_row.iter_mut().zip(vt_rows) {
// Partition both rows into chunks of size 4
// (x0, x1, x2, x3), (x4, x5, x6, x7), ...
let vd_blocks = vd_row.chunks_exact(BLOCK_SIZE);
// (y0, y1, y2, y3), (y4, y5, y6, y7), ...
let vt_blocks = vt_row.chunks_exact(BLOCK_SIZE);
// Using an array here is bit more convenient than 4 different variables, e.g. v0, v1, v2, v3
let mut block = [std::f32::INFINITY; BLOCK_SIZE];
// Accumulate all results as in v1, but 4 elements at a time
for (vd_block, vt_block) in vd_blocks.zip(vt_blocks) {
for (b, (&x, &y)) in block.iter_mut().zip(vd_block.iter().zip(vt_block)) {
*b = min(*b, x + y);
}
}
// Fold 4 intermediate values into a single minimum and assign to final result
*res = block.iter().fold(std::f32::INFINITY, |acc, &x| min(acc, x));
}
};
r.par_chunks_mut(n)
.zip(vd.par_chunks(n_padded))
.for_each(step_row);
Benchmark
We'll now compare the Rust implementation to the reference C++ version, which will be compiled with both Clang and GCC. If we run the benchmark program for a single iteration with the same parameters as previously, we get:
Implementation | Compiler | Time (s) | IPC |
---|---|---|---|
C++ v2 | gcc 7.4.0-1ubuntu1 | 20.8 | 2.88 |
C++ v2 | clang 6.0.0-1ubuntu2 | 44.6 | 3.23 |
Rust v2 | rustc 1.38.0-nightly | 17.0 | 2.43 |
Two interesting questions arise:
- Why is
rustc
outperforminggcc
? - What on earth is
clang
doing?
Let's compare the disassembly of all 3 versions.
rustc
I omitted a portion of code above LOOP
, up until label 1f0
since perf-record
placed most CPU cycles between LOOP
and the jb
instruction that jumps to LOOP
.
It looks like the compiler outsmarted us by ignoring our attempt of writing code that utilizes ILP and instead auto-vectorized our loop, which now does all the work with two 128-bit SIMD registers:
LOOP:
mov rbp,r14
add rbp,rbx
je 1f0 ; about 20 lines above LOOP
inc rcx
vmovups xmm3,XMMWORD PTR [r14+rbx*1]
vaddps xmm3,xmm3,XMMWORD PTR [r10+rbx*1]
vpermilps xmm3,xmm3,0x1b
vminps xmm2,xmm2,xmm3
add rbx,0x10
cmp rcx,rax
jb LOOP
We'll be rewriting most of our code with 256-bit vector types and instructions in v3
, but let's take a look at what the compiler managed to generate here.
We load 4 consecutive f32
values from vd_row
into a 128-bit vector register xmm3
:
vmovups xmm3,XMMWORD PTR [r14+rbx*1]
Then we load 4 consecutive f32
values from vt_row
, add those to the 4 values in xmm3
using a single SIMD add-instruction, and store the result in xmm3
:
vaddps xmm3,xmm3,XMMWORD PTR [r10+rbx*1]
Using vpermilps
with shuffle control 0x1b = 0b00_01_10_11
will reverse the order of 4 elements in xmm3
, but I don't know why the compiler wants to use this here, especially inside the loop.
However, we are going to use these kind of SIMD register permutations ourselves later in v5
to significantly lower the total amount of memory accesses.
vpermilps xmm3,xmm3,0x1b
We use a single SIMD min-instruction for 4 f32
result values in xmm2
and 4 sums in xmm3
we got from the previous step and store the result in xmm2
:
vminps xmm2,xmm2,xmm3
We increment the loop variable by 16, which will jump over 4 f32
s in the next loop, and start over:
add rbx,0x10
cmp rcx,rax
jb LOOP
clang
I did not try to figure out what happens here, but it looks like a failed auto-vectorization attempt:
LOOP:
; other half with similar lines omitted
lea edx,[rax+r14*1+0x2]
movsxd rdx,edx
lea esi,[r15+r14*1+0x2]
movsxd rsi,esi
lea edi,[rax+r14*1+0x3]
movsxd rdi,edi
lea ebx,[r15+r14*1+0x3]
movsxd rbx,ebx
vmovss xmm0,DWORD PTR [r8+rdi*4]
vinsertps xmm0,xmm0,DWORD PTR [r8+rdx*4],0x10
vmovss xmm3,DWORD PTR [rbp+rbx*4+0x0]
vinsertps xmm3,xmm3,DWORD PTR [rbp+rsi*4+0x0],0x10
vaddps xmm0,xmm0,xmm3
vpmovzxdq xmm3,xmm0
vcmpltps xmm0,xmm0,xmm4
vunpcklps xmm0,xmm2,xmm0
vblendvpd xmm6,xmm6,xmm3,xmm0
vpermilps xmm7,xmm5,0xe8
vpermilps xmm4,xmm6,0xe8
add r14d,0x4
add rcx,0xffffffffffffffff
jne LOOP
gcc
GCC did not auto-vectorize anything but produced a good example of ILP:
LOOP:
lea rcx,[r10+rcx*4]
lea r8,[r8+r9*1+0x10]
nop WORD PTR cs:[rax+rax*1+0x0]
vmovss xmm0,DWORD PTR [rcx]
vaddss xmm0,xmm0,DWORD PTR [rax]
add rax,0x10
add rcx,0x10
vminss xmm1,xmm0,xmm1
vmovss xmm0,DWORD PTR [rcx-0xc]
vaddss xmm0,xmm0,DWORD PTR [rax-0xc]
vminss xmm4,xmm0,xmm4
vmovss xmm0,DWORD PTR [rcx-0x8]
vaddss xmm0,xmm0,DWORD PTR [rax-0x8]
vminss xmm3,xmm0,xmm3
vmovss xmm0,DWORD PTR [rcx-0x4]
vaddss xmm0,xmm0,DWORD PTR [rax-0x4]
vminss xmm2,xmm0,xmm2
cmp r8,rax
jne LOOP
This is what we were trying to achieve, to have 4 independent registers for updating the minimums. You can read more about it here.
We are not going to twist our Rust code so we can get a good ILP example out of it, the auto-vectorization already produced code that was more efficient than the gcc
ILP example above.
However, this was just an example, and we'll be needing ILP extensively later in v4
.
First, let's rewrite our code using SIMD instructions.