# Linear reading

To enable a linear memory access pattern, the reference solution introduces a Θ(n²) preprocessing step that allocates additional space for storing the transpose of `d`

in row-major order.
This allows us to read the columns of `d`

linearly, using fully packed cache lines on each read.

The easiest way of allocating memory on the heap for contiguous elements is probably by creating a vector, which is a struct containing a pointer, size, and length.
We use the `std::vec`

compile-time macro to create a mutable vector of length `n * n`

, with all elements initialized to the value `0.0`

, and then fill it with the transpose of `d`

.
Note that there is no need to annotate the type of the vector, since `f32`

is inferred from context:

```
// Transpose of d
let mut t = std::vec![0.0; n * n];
// Function: for some column j in d,
// copy all elements of that column into row i in t (t_row)
let transpose_column = |(j, t_row): (usize, &mut [f32])| {
for (i, x) in t_row.iter_mut().enumerate() {
*x = d[n*i + j];
}
};
// Copy all columns of d into rows of t in parallel
t.par_chunks_mut(n)
.enumerate()
.for_each(transpose_column);
```

Now all columns of `d`

have been stored as rows in `t`

, and all we have to do is to iterate over all row pair combinations of `d`

and `t`

.
As previously, we partition `r`

into `n`

non-overlapping, mutable rows such that each thread is working on one row at a time:

```
// Function: for some row i in d and all rows t,
// compute n results into row i in r (r_row)
let step_row = |(i, r_row): (usize, &mut [f32])| {
for (j, res) in r_row.iter_mut().enumerate() {
let mut v = std::f32::INFINITY;
for k in 0..n {
let x = d[n*i + k];
let y = t[n*j + k];
let z = x + y;
v = v.min(z);
}
*res = v;
}
};
// Partition r into rows containing n elements,
// and apply step_row on all rows in parallel
r.par_chunks_mut(n)
.enumerate()
.for_each(step_row);
```

## Benchmark

We'll use the same settings as in `v0`

.

Implementation | Compiler | Time (s) | IPC |
---|---|---|---|

C++ `v1` | `gcc 7.4.0-1ubuntu1` | 60.5 | 1.54 |

C++ `v1` | `clang 6.0.0-1ubuntu2` | 60.5 | 1.00 |

Rust `v1` | `rustc 1.38.0-nightly` | 114.6 | 2.11 |

The linear memory access pattern helps a lot here, compared to what we had in the previous version.
However, the Rust program is struggling to keep up, executing twice the amount of instructions per cycle as the C++ program while being almost two times slower.
In the previous chapter, we talked about array bounds checking and `NaN`

checks not affecting the running time due to a bad memory access pattern.
We fixed the memory access pattern but now the extra instructions are starting to slow us down.

Let's look at the most recent output from `rustc`

to see these extra instructions.
This time, we skip `gcc`

and `clang`

, because they produced almost the same output as in `v0`

.

`rustc`

Not much has changed from `v0`

, except that there is even more registers involved in doing bounds checking.

```
LOOP:
cmp rax,rdx
jae 13e
mov rcx,QWORD PTR [rbx+0x10]
cmp rcx,rsi
jbe 150
mov rcx,QWORD PTR [rbx]
mov r10,QWORD PTR [r15]
vmovss xmm2,DWORD PTR [r10+rax*4]
vaddss xmm2,xmm2,DWORD PTR [rcx+rsi*4]
vminss xmm3,xmm2,xmm1
vcmpunordss xmm1,xmm1,xmm1
vblendvps xmm1,xmm3,xmm2,xmm1
inc rsi
inc rax
dec rdi
jne LOOP
```

Running the Rust program benchmark with `perf-record`

suggests that a significant amount of the running time is spent doing `NaN`

checks with `vcmpunordss`

and `vblendvps`

.

### Dealing with the `NaN`

check

Let's remove the `NaN`

checks by replacing `f32::min`

in the inner loop by a simple `if-else`

expression:

```
for k in 0..n {
let x = d[n*i + k];
let y = t[n*j + k];
let z = x + y;
v = if v < z { v } else { z };
}
```

Compiling and checking the output we see that the `NaN`

checks are gone from our loop:

```
LOOP:
cmp rax,rdx
jae 133
mov rcx,QWORD PTR [rbx+0x10]
cmp rcx,rsi
jbe 145
mov rcx,QWORD PTR [rbx]
mov r10,QWORD PTR [r15]
vmovss xmm2,DWORD PTR [r10+rax*4]
vaddss xmm2,xmm2,DWORD PTR [rcx+rsi*4]
vminss xmm1,xmm1,xmm2
inc rsi
inc rax
dec rdi
jne LOOP
```

Benchmarking the Rust program shows that the running time also improved quite a lot:

Implementation | Compiler | Time (s) | IPC |
---|---|---|---|

C++ `v1` | `gcc 7.4.0-1ubuntu1` | 60.5 | 1.54 |

C++ `v1` | `clang 6.0.0-1ubuntu2` | 60.5 | 1.00 |

Rust `v1` | `rustc 1.38.0-nightly` | 60.8 | 3.43 |

What about the array bounds checks?
Our mid-range CPU seems to be handling them without any problems even in the most performance critical loop.
However, the bounds checks are certainly not free, as we can see from the amount of IPC.
The C++ implementation of `v1`

is a proof that it is possible to solve the problem with significantly less instructions.
On other hand, we don't want to remove the bounds checks completely, since we'd prefer to use as little `unsafe`

Rust as possible.

### Dealing with the bounds checks

Our solution is similar to the preprocessing step of computing the transpose of `d`

:
We will perform a bit of extra work outside the loop to remove a lot of work from inside the loop.
If we extract one row of `d`

and one row of `t`

as subslices before the inner loop starts, the compiler will have a chance to assert that the starting and ending index of the subslices are within the bounds of the slices we extract the subslices from:

```
let step_row = |(i, r_row): (usize, &mut [f32])| {
// Get a view of row i of d as a subslice
let d_row = &d[n*i..n*(i+1)];
for (j, res) in r_row.iter_mut().enumerate() {
// Same for row j in t
let t_row = &t[n*j..n*(j+1)];
let mut v = std::f32::INFINITY;
for k in 0..n {
let x = d_row[k];
let y = t_row[k];
let z = x + y;
v = if v < z { v } else { z };
}
*res = v;
}
};
```

After compiling the program, we can see that the compiler still wants to check that `k`

is in bounds.
Since `rsi`

is incremented by 1 after each iteration, and it is used to load two `f32`

s, it is very likely equal to our `k`

.

```
LOOP:
cmp r10,rsi
je 194
vmovss xmm2,DWORD PTR [rdx+rsi*4]
vaddss xmm2,xmm2,DWORD PTR [rax+rsi*4]
inc rsi
vminss xmm1,xmm1,xmm2
cmp rcx,rsi
jne LOOP
```

Benchmarks show that the amount of IPC reduced significantly:

Implementation | Compiler | Time (s) | IPC |
---|---|---|---|

C++ `v1` | `gcc 7.4.0-1ubuntu1` | 60.5 | 1.54 |

C++ `v1` | `clang 6.0.0-1ubuntu2` | 60.5 | 1.00 |

Rust `v1` | `rustc 1.38.0-nightly` | 60.6 | 2.02 |

Let's get all bounds checking out of the loop.
We are currently using `k`

only for accessing every element of `d_row`

and `t_row`

between `0..n`

, so we might as well use iterators over both subslices.
If we zip them them together, there's no need for `k`

anymore.

```
for (&x, &y) in d_row.iter().zip(t_row.iter()) {
let z = x + y;
v = if v < z { v } else { z };
}
```

After compiling the program, we can see that not only did the compiler remove the bounds checks but it also unrolled 8 iterations of the loop:

```
LOOP:
vmovss xmm2,DWORD PTR [r9+r15*4-0x1c]
vmovss xmm3,DWORD PTR [r9+r15*4-0x18]
vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0x1c]
vminss xmm1,xmm1,xmm2
vaddss xmm2,xmm3,DWORD PTR [r13+r15*4-0x18]
vmovss xmm3,DWORD PTR [r9+r15*4-0x14]
vaddss xmm3,xmm3,DWORD PTR [r13+r15*4-0x14]
vminss xmm1,xmm1,xmm2
vminss xmm1,xmm1,xmm3
vmovss xmm2,DWORD PTR [r9+r15*4-0x10]
vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0x10]
vminss xmm1,xmm1,xmm2
vmovss xmm2,DWORD PTR [r9+r15*4-0xc]
vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0xc]
vmovss xmm3,DWORD PTR [r9+r15*4-0x8]
vaddss xmm3,xmm3,DWORD PTR [r13+r15*4-0x8]
vminss xmm1,xmm1,xmm2
vminss xmm1,xmm1,xmm3
vmovss xmm2,DWORD PTR [r9+r15*4-0x4]
vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0x4]
vminss xmm1,xmm1,xmm2
vmovss xmm2,DWORD PTR [r9+r15*4]
vaddss xmm2,xmm2,DWORD PTR [r13+r15*4+0x0]
add r15,0x8
vminss xmm1,xmm1,xmm2
cmp rax,r15
jne LOOP
```

Recall how `clang`

unrolled the loop in `v0`

in an exactly similar way.
Since our program is still memory bottlenecked, the unrolling does not affect the running time.
However, it does reduce the total amount of IPC:

Implementation | Compiler | Time (s) | IPC |
---|---|---|---|

C++ `v1` | `gcc 7.4.0-1ubuntu1` | 60.5 | 1.54 |

C++ `v1` | `clang 6.0.0-1ubuntu2` | 60.5 | 1.00 |

Rust `v1` | `rustc 1.38.0-nightly` | 60.6 | 0.92 |

The reason for this is that we have more instructions doing the useful stuff (e.g. loading memory `vmovss`

, addition `vaddss`

, and computing minimums `vminss`

) than loop related instructions such as comparisons and jumps.
Compare this to the `gcc`

single element loop of `v0`

.

`iter`

all the things

If we succeeded in eliminating `k`

from the innermost loop by using iterators, can we remove all loop variables with iterators?
We are using `chunks_mut`

to divide `r`

into rows of length `n`

, so why not do something similar with `d`

and `t`

but with immutable chunks instead?

Our function computes `n`

results for a row `i`

in `d`

into row `i`

in `r`

.
We can make `i`

redundant by chunking `d`

into rows at the same time as `r`

, zip the row iterators into pairs and apply `step_row`

in parallel on all `(r_row, d_row)`

pairs.
Inside `step_row`

, we loop over all columns `j`

of `d`

, i.e. all rows `j`

of `t`

.
If we chunk up `t`

into `n`

rows of length `n`

inside `step_row`

, we can zip up that iterator with row `i`

of `r`

and we have made index `j`

redundant.

Finally, we wrap our `if-else`

minimum into a function and put it into our toolbox:

```
#[inline(always)]
pub fn min(x: f32, y: f32) -> f32 {
if x < y { x } else { y }
}
```

Here's the final version of `v1`

version of `step_row`

:

```
// Function: for some row i in d (d_row) and all rows t (t_rows),
// compute n results into a row in r (r_row)
let step_row = |(r_row, d_row): (&mut [f32], &[f32])| {
let t_rows = t.chunks_exact(n);
for (res, t_row) in r_row.iter_mut().zip(t_rows) {
*res = d_row.iter()
.zip(t_row)
.fold(std::f32::INFINITY, |v, (&x, &y)| min(v, x + y));
}
};
// Partition r and d into slices, each containing a single row of r and d,
// and apply the function on the row pairs
r.par_chunks_mut(n)
.zip(d.par_chunks(n))
.for_each(step_row);
```

Compiler output and benchmark results are not changed.

It's nice to see functional code that performs as well as a C++ program. However, as we start pushing the CPU towards its limits, we eventually have to trade away some "functional prettiness" for raw performance, e.g. by loop unrolling and using hard-coded amounts of variables.