Linear reading

Full source

To enable a linear memory access pattern, the reference solution introduces a Θ(n²) preprocessing step that allocates additional space for storing the transpose of d in row-major order. This allows us to read the columns of d linearly, using fully packed cache lines on each read.

The easiest way of allocating memory on the heap for contiguous elements is probably by creating a vector, which is a struct containing a pointer, size, and length. We use the std::vec compile-time macro to create a mutable vector of length n * n, with all elements initialized to the value 0.0, and then fill it with the transpose of d. Note that there is no need to annotate the type of the vector, since f32 is inferred from context:

    // Transpose of d
    let mut t = std::vec![0.0; n * n];
    // Function: for some column j in d,
    // copy all elements of that column into row i in t (t_row)
    let transpose_column = |(j, t_row): (usize, &mut [f32])| {
        for (i, x) in t_row.iter_mut().enumerate() {
            *x = d[n*i + j];
        }
    };
    // Copy all columns of d into rows of t in parallel
    t.par_chunks_mut(n)
        .enumerate()
        .for_each(transpose_column);

Now all columns of d have been stored as rows in t, and all we have to do is to iterate over all row pair combinations of d and t. As previously, we partition r into n non-overlapping, mutable rows such that each thread is working on one row at a time:

    // Function: for some row i in d and all rows t,
    // compute n results into row i in r (r_row)
    let step_row = |(i, r_row): (usize, &mut [f32])| {
        for (j, res) in r_row.iter_mut().enumerate() {
            let mut v = std::f32::INFINITY;
            for k in 0..n {
                let x = d[n*i + k];
                let y = t[n*j + k];
                let z = x + y;
                v = v.min(z);
            }
            *res = v;
        }
    };
    // Partition r into rows containing n elements,
    // and apply step_row on all rows in parallel
    r.par_chunks_mut(n)
        .enumerate()
        .for_each(step_row);

Benchmark

We'll use the same settings as in v0.

ImplementationCompilerTime (s)IPC
C++ v1gcc 7.4.0-1ubuntu160.51.54
C++ v1clang 6.0.0-1ubuntu260.51.00
Rust v1rustc 1.38.0-nightly114.62.11

The linear memory access pattern helps a lot here, compared to what we had in the previous version. However, the Rust program is struggling to keep up, executing twice the amount of instructions per cycle as the C++ program while being almost two times slower. In the previous chapter, we talked about array bounds checking and NaN checks not affecting the running time due to a bad memory access pattern. We fixed the memory access pattern but now the extra instructions are starting to slow us down.

Let's look at the most recent output from rustc to see these extra instructions. This time, we skip gcc and clang, because they produced almost the same output as in v0.

rustc

Not much has changed from v0, except that there is even more registers involved in doing bounds checking.

LOOP:
    cmp         rax,rdx
    jae         13e
    mov         rcx,QWORD PTR [rbx+0x10]
    cmp         rcx,rsi
    jbe         150
    mov         rcx,QWORD PTR [rbx]
    mov         r10,QWORD PTR [r15]
    vmovss      xmm2,DWORD PTR [r10+rax*4]
    vaddss      xmm2,xmm2,DWORD PTR [rcx+rsi*4]
    vminss      xmm3,xmm2,xmm1
    vcmpunordss xmm1,xmm1,xmm1
    vblendvps   xmm1,xmm3,xmm2,xmm1
    inc         rsi
    inc         rax
    dec         rdi
    jne         LOOP

Running the Rust program benchmark with perf-record suggests that a significant amount of the running time is spent doing NaN checks with vcmpunordss and vblendvps.

Dealing with the NaN check

Let's remove the NaN checks by replacing f32::min in the inner loop by a simple if-else expression:

    for k in 0..n {
        let x = d[n*i + k];
        let y = t[n*j + k];
        let z = x + y;
        v = if v < z { v } else { z };
    }

Compiling and checking the output we see that the NaN checks are gone from our loop:

LOOP:
    cmp    rax,rdx
    jae    133
    mov    rcx,QWORD PTR [rbx+0x10]
    cmp    rcx,rsi
    jbe    145
    mov    rcx,QWORD PTR [rbx]
    mov    r10,QWORD PTR [r15]
    vmovss xmm2,DWORD PTR [r10+rax*4]
    vaddss xmm2,xmm2,DWORD PTR [rcx+rsi*4]
    vminss xmm1,xmm1,xmm2
    inc    rsi
    inc    rax
    dec    rdi
    jne    LOOP

Benchmarking the Rust program shows that the running time also improved quite a lot:

ImplementationCompilerTime (s)IPC
C++ v1gcc 7.4.0-1ubuntu160.51.54
C++ v1clang 6.0.0-1ubuntu260.51.00
Rust v1rustc 1.38.0-nightly60.83.43

What about the array bounds checks? Our mid-range CPU seems to be handling them without any problems even in the most performance critical loop. However, the bounds checks are certainly not free, as we can see from the amount of IPC. The C++ implementation of v1 is a proof that it is possible to solve the problem with significantly less instructions. On other hand, we don't want to remove the bounds checks completely, since we'd prefer to use as little unsafe Rust as possible.

Dealing with the bounds checks

Our solution is similar to the preprocessing step of computing the transpose of d: We will perform a bit of extra work outside the loop to remove a lot of work from inside the loop. If we extract one row of d and one row of t as subslices before the inner loop starts, the compiler will have a chance to assert that the starting and ending index of the subslices are within the bounds of the slices we extract the subslices from:

    let step_row = |(i, r_row): (usize, &mut [f32])| {
        // Get a view of row i of d as a subslice
        let d_row = &d[n*i..n*(i+1)];
        for (j, res) in r_row.iter_mut().enumerate() {
            // Same for row j in t
            let t_row = &t[n*j..n*(j+1)];
            let mut v = std::f32::INFINITY;
            for k in 0..n {
                let x = d_row[k];
                let y = t_row[k];
                let z = x + y;
                v = if v < z { v } else { z };
            }
            *res = v;
        }
    };

After compiling the program, we can see that the compiler still wants to check that k is in bounds. Since rsi is incremented by 1 after each iteration, and it is used to load two f32s, it is very likely equal to our k.

LOOP:
    cmp    r10,rsi
    je     194
    vmovss xmm2,DWORD PTR [rdx+rsi*4]
    vaddss xmm2,xmm2,DWORD PTR [rax+rsi*4]
    inc    rsi
    vminss xmm1,xmm1,xmm2
    cmp    rcx,rsi
    jne    LOOP

Benchmarks show that the amount of IPC reduced significantly:

ImplementationCompilerTime (s)IPC
C++ v1gcc 7.4.0-1ubuntu160.51.54
C++ v1clang 6.0.0-1ubuntu260.51.00
Rust v1rustc 1.38.0-nightly60.62.02

Let's get all bounds checking out of the loop. We are currently using k only for accessing every element of d_row and t_row between 0..n, so we might as well use iterators over both subslices. If we zip them them together, there's no need for k anymore.

    for (&x, &y) in d_row.iter().zip(t_row.iter()) {
        let z = x + y;
        v = if v < z { v } else { z };
    }

After compiling the program, we can see that not only did the compiler remove the bounds checks but it also unrolled 8 iterations of the loop:

LOOP:
    vmovss xmm2,DWORD PTR [r9+r15*4-0x1c]
    vmovss xmm3,DWORD PTR [r9+r15*4-0x18]
    vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0x1c]
    vminss xmm1,xmm1,xmm2
    vaddss xmm2,xmm3,DWORD PTR [r13+r15*4-0x18]
    vmovss xmm3,DWORD PTR [r9+r15*4-0x14]
    vaddss xmm3,xmm3,DWORD PTR [r13+r15*4-0x14]
    vminss xmm1,xmm1,xmm2
    vminss xmm1,xmm1,xmm3
    vmovss xmm2,DWORD PTR [r9+r15*4-0x10]
    vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0x10]
    vminss xmm1,xmm1,xmm2
    vmovss xmm2,DWORD PTR [r9+r15*4-0xc]
    vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0xc]
    vmovss xmm3,DWORD PTR [r9+r15*4-0x8]
    vaddss xmm3,xmm3,DWORD PTR [r13+r15*4-0x8]
    vminss xmm1,xmm1,xmm2
    vminss xmm1,xmm1,xmm3
    vmovss xmm2,DWORD PTR [r9+r15*4-0x4]
    vaddss xmm2,xmm2,DWORD PTR [r13+r15*4-0x4]
    vminss xmm1,xmm1,xmm2
    vmovss xmm2,DWORD PTR [r9+r15*4]
    vaddss xmm2,xmm2,DWORD PTR [r13+r15*4+0x0]
    add    r15,0x8
    vminss xmm1,xmm1,xmm2
    cmp    rax,r15
    jne    LOOP

Recall how clang unrolled the loop in v0 in an exactly similar way. Since our program is still memory bottlenecked, the unrolling does not affect the running time. However, it does reduce the total amount of IPC:

ImplementationCompilerTime (s)IPC
C++ v1gcc 7.4.0-1ubuntu160.51.54
C++ v1clang 6.0.0-1ubuntu260.51.00
Rust v1rustc 1.38.0-nightly60.60.92

The reason for this is that we have more instructions doing the useful stuff (e.g. loading memory vmovss, addition vaddss, and computing minimums vminss) than loop related instructions such as comparisons and jumps. Compare this to the gcc single element loop of v0.

iter all the things

If we succeeded in eliminating k from the innermost loop by using iterators, can we remove all loop variables with iterators? We are using chunks_mut to divide r into rows of length n, so why not do something similar with d and t but with immutable chunks instead?

Our function computes n results for a row i in d into row i in r. We can make i redundant by chunking d into rows at the same time as r, zip the row iterators into pairs and apply step_row in parallel on all (r_row, d_row) pairs. Inside step_row, we loop over all columns j of d, i.e. all rows j of t. If we chunk up t into n rows of length n inside step_row, we can zip up that iterator with row i of r and we have made index j redundant.

Finally, we wrap our if-else minimum into a function and put it into our toolbox:

#[inline(always)]
pub fn min(x: f32, y: f32) -> f32 {
    if x < y { x } else { y }
}

Here's the final version of v1 version of step_row:

    // Function: for some row i in d (d_row) and all rows t (t_rows),
    // compute n results into a row in r (r_row)
    let step_row = |(r_row, d_row): (&mut [f32], &[f32])| {
        let t_rows = t.chunks_exact(n);
        for (res, t_row) in r_row.iter_mut().zip(t_rows) {
            *res = d_row.iter()
                        .zip(t_row)
                        .fold(std::f32::INFINITY, |v, (&x, &y)| min(v, x + y));
        }
    };
    // Partition r and d into slices, each containing a single row of r and d,
    // and apply the function on the row pairs
    r.par_chunks_mut(n)
        .zip(d.par_chunks(n))
        .for_each(step_row);

Compiler output and benchmark results are not changed.

It's nice to see functional code that performs as well as a C++ program. However, as we start pushing the CPU towards its limits, we eventually have to trade away some "functional prettiness" for raw performance, e.g. by loop unrolling and using hard-coded amounts of variables.