# Instruction level parallelism (ILP)

Our program does not take advantage of the fact that modern CPUs are superscalar processors, capable of executing several independent instructions simultaneously.
The problem in our `v1`

implementation is that each step is dependent on the previous step, caused by this part:

```
let z = x + y;
v = min(v, z);
```

We will solve this by using a simple idea from the reference solution: accumulate results into 4 independent, intermediate results and merge them only after processing the whole row.

Suppose we have some row of `d`

, containing the elements `x0, x1, x2, x3, ..., xn`

, and some column of `d`

(i.e. row of `t`

), containing the elements `y0, y1, y2, y3, ..., yn`

.
Then, we compute results for all rows by accumulating intermediate results into 4 variables `v0, v1, v2, v3`

as follows:

```
// iteration 1
v0 = min(v0, x0 + y0);
v1 = min(v1, x1 + y1);
v2 = min(v2, x2 + y2);
v3 = min(v3, x3 + y3);
// iteration 2
v0 = min(v0, x4 + y4);
v1 = min(v1, x5 + y5);
v2 = min(v2, x6 + y6);
v3 = min(v3, x7 + y7);
// iteration 3
v0 = min(v0, x8 + y8);
v1 = min(v1, x9 + y9);
v2 = min(v2, x10 + y10);
v3 = min(v3, x11 + y11);
// etc ...
```

This should allow the CPU to write results into 4 independent registers for each intermediate result.

Before we can update the `step_row`

function, we need to make sure the amount of elements on each row is always a multiple of 4 to keep the performance-critical loop free of messy, unnecessary branching.
As previously, we transpose `d`

to allow linear reading of its columns, but have to make sure the row length of the transpose is also divisible by 4.
The preprocessing looks a bit more complicated, but is essentially the same as doing the transpose in `v1`

, except that we copy the values of `d`

also into `vd`

, which is padded with `std::f32::INFINITY`

values to make its rows divisible by 4:

```
const BLOCK_SIZE: usize = 4;
let blocks_per_row = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;
let n_padded = blocks_per_row * BLOCK_SIZE;
// d and transpose of d with extra room at the end of each row,
// both initially filled with f32::INFINITY
let mut vd = std::vec![std::f32::INFINITY; n_padded * n];
let mut vt = std::vec![std::f32::INFINITY; n_padded * n];
// Function: for one row of vd and vt,
// copy a row at 'i' of d into vd and column at 'i' of d into vt
let preprocess_row = |(i, (vd_row, vt_row)): (usize, (&mut [f32], &mut [f32]))| {
for (j, (x, y)) in vd_row.iter_mut().zip(vt_row.iter_mut()).enumerate() {
if i < n && j < n {
*x = d[n*i + j];
*y = d[n*j + i];
}
}
};
// Partition vd and vt into rows, apply preprocessing in parallel for each row pair
vd.par_chunks_mut(n_padded)
.zip(vt.par_chunks_mut(n_padded))
.enumerate()
.for_each(preprocess_row);
```

Now `vd`

contains the original `d`

and `vt`

contains the transpose of `d`

, but both have been padded with extra columns to the right containing `f32::INFINITY`

s to ensure the width of `vd`

and `vt`

is always divisible by 4.
Then, we partition `r`

and `vd`

into row chunks, zip them into row chunk pairs and apply `step_row`

in parallel for each row of `vd`

, writing the results into its paired result row chunk.
Each thread will compute results over all rows of `vt`

.

```
// Function: for some row in vd (vd_row) and all rows in vt (vt_rows),
// compute all results for a row in r (r_row), corresponding to the row index of vd_row.
let step_row = |(r_row, vd_row): (&mut [f32], &[f32])| {
let vt_rows = vt.chunks_exact(n_padded);
// Length of a zipped iterator is the length of the shorter iterator in the zip pair so this never exceeds n
for (res, vt_row) in r_row.iter_mut().zip(vt_rows) {
// Partition both rows into chunks of size 4
// (x0, x1, x2, x3), (x4, x5, x6, x7), ...
let vd_blocks = vd_row.chunks_exact(BLOCK_SIZE);
// (y0, y1, y2, y3), (y4, y5, y6, y7), ...
let vt_blocks = vt_row.chunks_exact(BLOCK_SIZE);
// Using an array here is bit more convenient than 4 different variables, e.g. v0, v1, v2, v3
let mut block = [std::f32::INFINITY; BLOCK_SIZE];
// Accumulate all results as in v1, but 4 elements at a time
for (vd_block, vt_block) in vd_blocks.zip(vt_blocks) {
for (b, (&x, &y)) in block.iter_mut().zip(vd_block.iter().zip(vt_block)) {
*b = min(*b, x + y);
}
}
// Fold 4 intermediate values into a single minimum and assign to final result
*res = block.iter().fold(std::f32::INFINITY, |acc, &x| min(acc, x));
}
};
r.par_chunks_mut(n)
.zip(vd.par_chunks(n_padded))
.for_each(step_row);
```

## Benchmark

We'll now compare the Rust implementation to the reference C++ version, which will be compiled with both Clang and GCC. If we run the benchmark program for a single iteration with the same parameters as previously, we get:

Implementation | Compiler | Time (s) | IPC |
---|---|---|---|

C++ `v2` | `gcc 7.4.0-1ubuntu1` | 20.8 | 2.88 |

C++ `v2` | `clang 6.0.0-1ubuntu2` | 44.6 | 3.23 |

Rust `v2` | `rustc 1.38.0-nightly` | 17.0 | 2.43 |

Two interesting questions arise:

- Why is
`rustc`

outperforming`gcc`

? - What on earth is
`clang`

doing?

Let's compare the disassembly of all 3 versions.

`rustc`

I omitted a portion of code above `LOOP`

, up until label `1f0`

since `perf-record`

placed most CPU cycles between `LOOP`

and the `jb`

instruction that jumps to `LOOP`

.

It looks like the compiler outsmarted us by ignoring our attempt of writing code that utilizes ILP and instead auto-vectorized our loop, which now does all the work with two 128-bit SIMD registers:

```
LOOP:
mov rbp,r14
add rbp,rbx
je 1f0 ; about 20 lines above LOOP
inc rcx
vmovups xmm3,XMMWORD PTR [r14+rbx*1]
vaddps xmm3,xmm3,XMMWORD PTR [r10+rbx*1]
vpermilps xmm3,xmm3,0x1b
vminps xmm2,xmm2,xmm3
add rbx,0x10
cmp rcx,rax
jb LOOP
```

We'll be rewriting most of our code with 256-bit vector types and instructions in `v3`

, but let's take a look at what the compiler managed to generate here.

We load 4 consecutive `f32`

values from `vd_row`

into a 128-bit vector register `xmm3`

:

```
vmovups xmm3,XMMWORD PTR [r14+rbx*1]
```

Then we load 4 consecutive `f32`

values from `vt_row`

, add those to the 4 values in `xmm3`

using a single SIMD add-instruction, and store the result in `xmm3`

:

```
vaddps xmm3,xmm3,XMMWORD PTR [r10+rbx*1]
```

Using `vpermilps`

with shuffle control `0x1b = 0b00_01_10_11`

will reverse the order of 4 elements in `xmm3`

, but I don't know why the compiler wants to use this here, especially inside the loop.
However, we are going to use these kind of SIMD register permutations ourselves later in `v5`

to significantly lower the total amount of memory accesses.

```
vpermilps xmm3,xmm3,0x1b
```

We use a single SIMD min-instruction for 4 `f32`

result values in `xmm2`

and 4 sums in `xmm3`

we got from the previous step and store the result in `xmm2`

:

```
vminps xmm2,xmm2,xmm3
```

We increment the loop variable by 16, which will jump over 4 `f32`

s in the next loop, and start over:

```
add rbx,0x10
cmp rcx,rax
jb LOOP
```

`clang`

I did not try to figure out what happens here, but it looks like a failed auto-vectorization attempt:

```
LOOP:
; other half with similar lines omitted
lea edx,[rax+r14*1+0x2]
movsxd rdx,edx
lea esi,[r15+r14*1+0x2]
movsxd rsi,esi
lea edi,[rax+r14*1+0x3]
movsxd rdi,edi
lea ebx,[r15+r14*1+0x3]
movsxd rbx,ebx
vmovss xmm0,DWORD PTR [r8+rdi*4]
vinsertps xmm0,xmm0,DWORD PTR [r8+rdx*4],0x10
vmovss xmm3,DWORD PTR [rbp+rbx*4+0x0]
vinsertps xmm3,xmm3,DWORD PTR [rbp+rsi*4+0x0],0x10
vaddps xmm0,xmm0,xmm3
vpmovzxdq xmm3,xmm0
vcmpltps xmm0,xmm0,xmm4
vunpcklps xmm0,xmm2,xmm0
vblendvpd xmm6,xmm6,xmm3,xmm0
vpermilps xmm7,xmm5,0xe8
vpermilps xmm4,xmm6,0xe8
add r14d,0x4
add rcx,0xffffffffffffffff
jne LOOP
```

`gcc`

GCC did not auto-vectorize anything but produced a good example of ILP:

```
LOOP:
lea rcx,[r10+rcx*4]
lea r8,[r8+r9*1+0x10]
nop WORD PTR cs:[rax+rax*1+0x0]
vmovss xmm0,DWORD PTR [rcx]
vaddss xmm0,xmm0,DWORD PTR [rax]
add rax,0x10
add rcx,0x10
vminss xmm1,xmm0,xmm1
vmovss xmm0,DWORD PTR [rcx-0xc]
vaddss xmm0,xmm0,DWORD PTR [rax-0xc]
vminss xmm4,xmm0,xmm4
vmovss xmm0,DWORD PTR [rcx-0x8]
vaddss xmm0,xmm0,DWORD PTR [rax-0x8]
vminss xmm3,xmm0,xmm3
vmovss xmm0,DWORD PTR [rcx-0x4]
vaddss xmm0,xmm0,DWORD PTR [rax-0x4]
vminss xmm2,xmm0,xmm2
cmp r8,rax
jne LOOP
```

This is what we were trying to achieve, to have 4 independent registers for updating the minimums. You can read more about it here.

We are not going to twist our Rust code so we can get a good ILP example out of it, the auto-vectorization already produced code that was more efficient than the `gcc`

ILP example above.
However, this was just an example, and we'll be needing ILP extensively later in `v4`

.
First, let's rewrite our code using SIMD instructions.