# Instruction level parallelism (ILP)

Full source

Our program does not take advantage of the fact that modern CPUs are superscalar processors, capable of executing several independent instructions simultaneously. The problem in our `v1` implementation is that each step is dependent on the previous step, caused by this part:

``````    let z = x + y;
v = min(v, z);
``````

We will solve this by using a simple idea from the reference solution: accumulate results into 4 independent, intermediate results and merge them only after processing the whole row.

Suppose we have some row of `d`, containing the elements `x0, x1, x2, x3, ..., xn`, and some column of `d` (i.e. row of `t`), containing the elements `y0, y1, y2, y3, ..., yn`. Then, we compute results for all rows by accumulating intermediate results into 4 variables `v0, v1, v2, v3` as follows:

``````    // iteration 1
v0 = min(v0, x0 + y0);
v1 = min(v1, x1 + y1);
v2 = min(v2, x2 + y2);
v3 = min(v3, x3 + y3);
// iteration 2
v0 = min(v0, x4 + y4);
v1 = min(v1, x5 + y5);
v2 = min(v2, x6 + y6);
v3 = min(v3, x7 + y7);
// iteration 3
v0 = min(v0, x8 + y8);
v1 = min(v1, x9 + y9);
v2 = min(v2, x10 + y10);
v3 = min(v3, x11 + y11);
// etc ...
``````

This should allow the CPU to write results into 4 independent registers for each intermediate result.

Before we can update the `step_row` function, we need to make sure the amount of elements on each row is always a multiple of 4 to keep the performance-critical loop free of messy, unnecessary branching. As previously, we transpose `d` to allow linear reading of its columns, but have to make sure the row length of the transpose is also divisible by 4. The preprocessing looks a bit more complicated, but is essentially the same as doing the transpose in `v1`, except that we copy the values of `d` also into `vd`, which is padded with `std::f32::INFINITY` values to make its rows divisible by 4:

``````    const BLOCK_SIZE: usize = 4;
let blocks_per_row = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
let n_padded = blocks_per_row * BLOCK_SIZE;
// d and transpose of d with extra room at the end of each row,
// both initially filled with f32::INFINITY
let mut vd = std::vec![std::f32::INFINITY; n_padded * n];
let mut vt = std::vec![std::f32::INFINITY; n_padded * n];
// Function: for one row of vd and vt,
// copy a row at 'i' of d into vd and column at 'i' of d into vt
let preprocess_row = |(i, (vd_row, vt_row)): (usize, (&mut [f32], &mut [f32]))| {
for (j, (x, y)) in vd_row.iter_mut().zip(vt_row.iter_mut()).enumerate() {
if i < n && j < n {
*x = d[n*i + j];
*y = d[n*j + i];
}
}
};
// Partition vd and vt into rows, apply preprocessing in parallel for each row pair
.enumerate()
.for_each(preprocess_row);
``````

Now `vd` contains the original `d` and `vt` contains the transpose of `d`, but both have been padded with extra columns to the right containing `f32::INFINITY`s to ensure the width of `vd` and `vt` is always divisible by 4. Then, we partition `r` and `vd` into row chunks, zip them into row chunk pairs and apply `step_row` in parallel for each row of `vd`, writing the results into its paired result row chunk. Each thread will compute results over all rows of `vt`.

``````    // Function: for some row in vd (vd_row) and all rows in vt (vt_rows),
// compute all results for a row in r (r_row), corresponding to the row index of vd_row.
let step_row = |(r_row, vd_row): (&mut [f32], &[f32])| {
// Length of a zipped iterator is the length of the shorter iterator in the zip pair so this never exceeds n
for (res, vt_row) in r_row.iter_mut().zip(vt_rows) {
// Partition both rows into chunks of size 4
// (x0, x1, x2, x3), (x4, x5, x6, x7), ...
let vd_blocks = vd_row.chunks_exact(BLOCK_SIZE);
// (y0, y1, y2, y3), (y4, y5, y6, y7), ...
let vt_blocks = vt_row.chunks_exact(BLOCK_SIZE);
// Using an array here is bit more convenient than 4 different variables, e.g. v0, v1, v2, v3
let mut block = [std::f32::INFINITY; BLOCK_SIZE];
// Accumulate all results as in v1, but 4 elements at a time
for (vd_block, vt_block) in vd_blocks.zip(vt_blocks) {
for (b, (&x, &y)) in block.iter_mut().zip(vd_block.iter().zip(vt_block)) {
*b = min(*b, x + y);
}
}
// Fold 4 intermediate values into a single minimum and assign to final result
*res = block.iter().fold(std::f32::INFINITY, |acc, &x| min(acc, x));
}
};
r.par_chunks_mut(n)
.for_each(step_row);
``````

## Benchmark

We'll now compare the Rust implementation to the reference C++ version, which will be compiled with both Clang and GCC. If we run the benchmark program for a single iteration with the same parameters as previously, we get:

ImplementationCompilerTime (s)IPC
C++ `v2``gcc 7.4.0-1ubuntu1`20.82.88
C++ `v2``clang 6.0.0-1ubuntu2`44.63.23
Rust `v2``rustc 1.38.0-nightly`17.02.43

Two interesting questions arise:

• Why is `rustc` outperforming `gcc`?
• What on earth is `clang` doing?

Let's compare the disassembly of all 3 versions.

### `rustc`

I omitted a portion of code above `LOOP`, up until label `1f0` since `perf-record` placed most CPU cycles between `LOOP` and the `jb` instruction that jumps to `LOOP`.

It looks like the compiler outsmarted us by ignoring our attempt of writing code that utilizes ILP and instead auto-vectorized our loop, which now does all the work with two 128-bit SIMD registers:

``````LOOP:
mov       rbp,r14
je        1f0 ; about 20 lines above LOOP
inc       rcx
vmovups   xmm3,XMMWORD PTR [r14+rbx*1]
vpermilps xmm3,xmm3,0x1b
vminps    xmm2,xmm2,xmm3
cmp       rcx,rax
jb        LOOP

``````

We'll be rewriting most of our code with 256-bit vector types and instructions in `v3`, but let's take a look at what the compiler managed to generate here.

We load 4 consecutive `f32` values from `vd_row` into a 128-bit vector register `xmm3`:

``````    vmovups   xmm3,XMMWORD PTR [r14+rbx*1]
``````

Then we load 4 consecutive `f32` values from `vt_row`, add those to the 4 values in `xmm3` using a single SIMD add-instruction, and store the result in `xmm3`:

``````    vaddps    xmm3,xmm3,XMMWORD PTR [r10+rbx*1]
``````

Using `vpermilps` with shuffle control `0x1b = 0b00_01_10_11` will reverse the order of 4 elements in `xmm3`, but I don't know why the compiler wants to use this here, especially inside the loop. However, we are going to use these kind of SIMD register permutations ourselves later in `v5` to significantly lower the total amount of memory accesses.

``````    vpermilps xmm3,xmm3,0x1b
``````

We use a single SIMD min-instruction for 4 `f32` result values in `xmm2` and 4 sums in `xmm3` we got from the previous step and store the result in `xmm2`:

``````    vminps    xmm2,xmm2,xmm3
``````

We increment the loop variable by 16, which will jump over 4 `f32`s in the next loop, and start over:

``````    add       rbx,0x10
cmp       rcx,rax
jb        LOOP
``````

### `clang`

I did not try to figure out what happens here, but it looks like a failed auto-vectorization attempt:

``````LOOP:
; other half with similar lines omitted
lea       edx,[rax+r14*1+0x2]
movsxd    rdx,edx
lea       esi,[r15+r14*1+0x2]
movsxd    rsi,esi
lea       edi,[rax+r14*1+0x3]
movsxd    rdi,edi
lea       ebx,[r15+r14*1+0x3]
movsxd    rbx,ebx
vmovss    xmm0,DWORD PTR [r8+rdi*4]
vinsertps xmm0,xmm0,DWORD PTR [r8+rdx*4],0x10
vmovss    xmm3,DWORD PTR [rbp+rbx*4+0x0]
vinsertps xmm3,xmm3,DWORD PTR [rbp+rsi*4+0x0],0x10
vpmovzxdq xmm3,xmm0
vcmpltps  xmm0,xmm0,xmm4
vunpcklps xmm0,xmm2,xmm0
vblendvpd xmm6,xmm6,xmm3,xmm0
vpermilps xmm7,xmm5,0xe8
vpermilps xmm4,xmm6,0xe8
jne       LOOP

``````

### `gcc`

GCC did not auto-vectorize anything but produced a good example of ILP:

``````LOOP:
lea    rcx,[r10+rcx*4]
lea    r8,[r8+r9*1+0x10]
nop    WORD PTR cs:[rax+rax*1+0x0]
vmovss xmm0,DWORD PTR [rcx]
vminss xmm1,xmm0,xmm1
vmovss xmm0,DWORD PTR [rcx-0xc]
vminss xmm4,xmm0,xmm4
vmovss xmm0,DWORD PTR [rcx-0x8]
We are not going to twist our Rust code so we can get a good ILP example out of it, the auto-vectorization already produced code that was more efficient than the `gcc` ILP example above. However, this was just an example, and we'll be needing ILP extensively later in `v4`. First, let's rewrite our code using SIMD instructions.