# Cache reuse

In our final version, we will attempt to increase cache locality also for data from `vt`

, by reading `f32x8`

row pairs from `vd`

and `vt`

using a Z-order curve iteration pattern.
If you look at this animation, we will implement the last pattern to the right.
Please see the reference materials for a detailed explanation.

## Implementation

This version will be an extension to `v5`

, and we won't be using the prefetching hints seen in `v6`

.
There won't be any changes to the performance critical loop or result extraction.
However, we need to rewrite most of the code to support the Z-order iteration pattern.
Our approach will be the same as in the reference implementation:

- Create a 2-dimensional Z-order index pattern by sorting the interleaved bits of row index
`i`

and column index`j`

. - Compute partial results in vertical stripes of 500 columns.
- Extract final results from partial results.

### Preparation

We start by defining some constants. We'll fix the width of all vertical stripes to 500 columns.

```
// How many adjacent columns to process during one pass
// Smaller numbers improve cache locality but add overhead
// from having to merge partial results
const COLS_PER_STRIPE: usize = 500;
let vecs_per_col = (n + simd::f32x8_LENGTH - 1) / simd::f32x8_LENGTH;
```

Then we create the 2-dimensional Z-order pattern for pairs of `i`

and `j`

.
We'll use the same trick as in the reference implementation, which is to use the parallel deposit intrinsic function for scattering the bits of `i`

into odd indexed bits, `j`

into even indexed bits, and `OR`

the results.
We wrap it into a function `z_encode`

and put it into our toolbox:

```
#[inline]
pub fn z_encode(x: u32, y: u32) -> u32 {
let odd_bits = 0x55555555;
let even_bits = 0xAAAAAAAA;
unsafe { _pdep_u32(x, odd_bits) | _pdep_u32(y, even_bits) }
}
```

If `n`

would always be a power of 2, there would be no need to handle edge cases, since `z_encode`

would always return the correct `z`

-index.
However, when `n`

is not a power of 2, we must make sure to skip all `z`

-indexes that are out of bounds.
We use the same approach as in the reference solution, which is to create a vector `row_pairs`

containing 3-tuples `(z_encode(i, j), i, j)`

and sort it by the `z`

-index.
When we enumerate the sorted `row_pairs`

, we get correct `z`

-indexes that do not include out of bounds row and column indexes.

```
// Build a Z-order curve iteration pattern of pairs (i, j)
// by using interleaved bits of i and j as a sort key
let mut row_pairs = std::vec![(0, 0, 0); vecs_per_col * vecs_per_col];
// Define a function that interleaves one row of indexes
let interleave_row = |(i, row): (usize, &mut [(usize, usize, usize)])| {
for (j, x) in row.iter_mut().enumerate() {
let z = z_encode(i as u32, j as u32);
*x = (z as usize, i, j);
}
};
// Apply the function independently on all rows and sort by ija
row_pairs
.par_chunks_mut(vecs_per_col)
.enumerate()
.for_each(interleave_row);
// We don't need stable sort since there are no duplicate keys
row_pairs.par_sort_unstable();
```

Recall how we used an 8-by-8 `tmp`

block in previous versions to store partial results.
In this version, we'll store a `tmp`

block for every Z-order index pair `(i, j)`

into `partial_results`

.
By storing `tmp`

blocks into `partial_results`

for every index pair, we can fairly easily load and write into the correct `tmp`

block when we process each vertical stripe of data.

```
// We'll be processing the input one stripe at a time
let mut vd = std::vec![simd::f32x8_infty(); COLS_PER_STRIPE * vecs_per_col];
let mut vt = std::vec![simd::f32x8_infty(); COLS_PER_STRIPE * vecs_per_col];
// Non-overlapping working memory for threads to update their results
// When enumerated in 8 element chunks, indexes the Z-order curve keys
let mut partial_results = std::vec![simd::f32x8_infty(); vecs_per_col * vecs_per_col * simd::f32x8_LENGTH];
```

### Computing results in vertical stripes

Now, we will compute all the results.
Note that we haven't initialized the values for `vd`

and `vt`

yet.
We'll do it inside the loop, one stripe at a time.
Here's a brief overview what happens during one pass over one stripe:

```
// Process vd and vt in Z-order one vertical stripe at a time, writing partial results in parallel
let num_vertical_stripes = (n + COLS_PER_STRIPE - 1) / COLS_PER_STRIPE;
for stripe in 0..num_vertical_stripes {
let col_begin = stripe * COLS_PER_STRIPE;
let col_end = n.min((stripe + 1) * COLS_PER_STRIPE);
// ...
// pack one stripe of vd and vt from d
// ...
// 1. load results from previous stripe
// 2. compute results for this stripe
// 3. save results for next stripe
}
```

The actual computation is not very different from `v5`

, except that we are processing `vd`

and `vt`

in stripes.
Also, we cannot extract results before we have processed all stripes, so each thread will load and save a `tmp`

block from `partial_results`

for every pair of indexes `i`

and `j`

.
After loading one stripe of `vd`

and `vt`

from `d`

, we process them in Z-order using index pairs `(i, j)`

from `row_pairs`

.
If we enumerate `row_pairs`

, we also get the index of each `tmp`

block in `partial_results`

, so we might as well zip `row_pairs`

with `partial_results`

to avoid using the `z`

-indexes directly.
We apply `step_partial_block`

in parallel such that each thread computes results for one `tmp`

block at index `z`

in `partial_results`

and index pair `(i, j)`

at index `z`

in `row_pairs`

:

```
// Function: for a f32x8 block of partial results and indexes row i col j,
// 1. Load tmp from partial results
// 2. Accumulate results for row i and column j into tmp
// 3. Write tmp into the original partial results block
let step_partial_block = |(prev_tmp, &(_, i, j)): (&mut [f32x8], &(usize, usize, usize))| {
// Copy results from previous pass over previous stripe
let mut tmp = [simd::f32x8_infty(); simd::f32x8_LENGTH];
tmp.copy_from_slice(&prev_tmp);
// Get slices over current stripes of row i and column j
let vd_row = &vd[(COLS_PER_STRIPE * i)..(COLS_PER_STRIPE * (i + 1))];
let vt_row = &vt[(COLS_PER_STRIPE * j)..(COLS_PER_STRIPE * (j + 1))];
for (&d0, &t0) in vd_row.iter().zip(vt_row) {
let d2 = simd::swap(d0, 2);
let d4 = simd::swap(d0, 4);
let d6 = simd::swap(d4, 2);
let t1 = simd::swap(t0, 1);
tmp[0] = simd::min(tmp[0], simd::add(d0, t0));
tmp[1] = simd::min(tmp[1], simd::add(d0, t1));
tmp[2] = simd::min(tmp[2], simd::add(d2, t0));
tmp[3] = simd::min(tmp[3], simd::add(d2, t1));
tmp[4] = simd::min(tmp[4], simd::add(d4, t0));
tmp[5] = simd::min(tmp[5], simd::add(d4, t1));
tmp[6] = simd::min(tmp[6], simd::add(d6, t0));
tmp[7] = simd::min(tmp[7], simd::add(d6, t1));
}
// Store partial results (8 vecs of type f32x8) to global memory
// for processing next stripe
prev_tmp.copy_from_slice(&tmp);
};
// Process current stripe in parallel, each thread filling one `tmp` block
partial_results
.par_chunks_mut(simd::f32x8_LENGTH)
.zip(row_pairs.par_iter())
.for_each(step_partial_block);
```

### Extracting results

After accumulating results over each vertical stripe, we need to extract all results from the partial results that are in Z-order.

First, let's replace the `z`

-indexes in `row_pairs`

with a linear index and sort `row_pairs`

by `(i, j)`

in order to get a mapping from `z`

to the correct partial result.
This allows us to chunk `r`

into rows indexed by `i`

, and write all results to each row element at `j`

by reading `partial_results`

linearly.

```
// Replace ij sorting key by linear index to get a mapping to partial_results,
// then sort row_pairs by (i, j)
let replace_z_index_row = |(z_row, index_row): (usize, &mut [(usize, usize, usize)])| {
for (z, idx) in index_row.iter_mut().enumerate() {
let (_, i, j) = *idx;
*idx = (z_row * vecs_per_col + z, i, j);
}
};
let key_ij = |&idx: &(usize, usize, usize)| { (idx.1, idx.2) };
row_pairs
.par_chunks_mut(vecs_per_col)
.enumerate()
.for_each(replace_z_index_row);
row_pairs.par_sort_unstable_by_key(key_ij);
```

Now, `row_pairs`

is ordered linearly, first by `i`

then by `j`

, such that the first element in each tuple element of `row_pairs`

corresponds to the starting index of an 8-by-8 `tmp`

block in `partial_results`

.

We chunk `r`

into 8-row blocks and read the `tmp`

result blocks from `partial_results`

and extract 64 `f32`

results exactly as in `v5`

.

```
// Function: for 8 rows in r starting at row i*8,
// read partial results at z-index corresponding to each row i and column j
// and write them to r
let set_z_order_result_block = |(i, r_row_block): (usize, &mut [f32])| {
for j in 0..vecs_per_col {
// Get z-order index for row i and column j
let z = row_pairs[i * vecs_per_col + j].0 * simd::f32x8_LENGTH;
// Load tmp from z-order partial results for this i, j pair
let mut tmp = [simd::f32x8_infty(); simd::f32x8_LENGTH];
tmp.copy_from_slice(&partial_results[z..z + simd::f32x8_LENGTH]);
// Continue exactly as in v5
for k in (1..simd::f32x8_LENGTH).step_by(2) {
tmp[k] = simd::swap(tmp[k], 1);
}
for (tmp_i, r_row) in r_row_block.chunks_exact_mut(n).enumerate() {
for tmp_j in 0..simd::f32x8_LENGTH {
let res_j = j * simd::f32x8_LENGTH + tmp_j;
if res_j < n {
let v = tmp[tmp_i ^ tmp_j];
let vi = tmp_j as u8;
r_row[res_j] = simd::extract(v, vi);
}
}
}
}
};
r.par_chunks_mut(simd::f32x8_LENGTH * n)
.enumerate()
.for_each(set_z_order_result_block);
```

## Benchmark

Implementation | Compiler | Time (s) | IPC |
---|---|---|---|

C++ `v7` | `gcc 7.4.0-1ubuntu1` | 2.04 | 2.94 |

C++ `v7` | `clang 6.0.0-1ubuntu2` | 2.16 | 2.20 |

Rust `v7` | `rustc 1.38.0-nightly` | 2.25 | 2.79 |

We managed to get a small improvement compared to the Rust program from `v5`

, but not as much as in the C++ versions.
The performance critical loop is the same as in `v5`

, which means we cannot search for answers in the assembly code, or at least not as easily as previously.
One possible performance bottleneck could be that we sort the Z-order indexes twice in the Rust program, while it is done only once in the C++ version.
Using a better approach for Z-order encoding and decoding might help reducing the running times.