Cache reuse
In our final version, we will attempt to increase cache locality also for data from vt
, by reading f32x8
row pairs from vd
and vt
using a Z-order curve iteration pattern.
If you look at this animation, we will implement the last pattern to the right.
Please see the reference materials for a detailed explanation.
Implementation
This version will be an extension to v5
, and we won't be using the prefetching hints seen in v6
.
There won't be any changes to the performance critical loop or result extraction.
However, we need to rewrite most of the code to support the Z-order iteration pattern.
Our approach will be the same as in the reference implementation:
- Create a 2-dimensional Z-order index pattern by sorting the interleaved bits of row index
i
and column indexj
. - Compute partial results in vertical stripes of 500 columns.
- Extract final results from partial results.
Preparation
We start by defining some constants. We'll fix the width of all vertical stripes to 500 columns.
// How many adjacent columns to process during one pass
// Smaller numbers improve cache locality but add overhead
// from having to merge partial results
const COLS_PER_STRIPE: usize = 500;
let vecs_per_col = (n + simd::f32x8_LENGTH - 1) / simd::f32x8_LENGTH;
Then we create the 2-dimensional Z-order pattern for pairs of i
and j
.
We'll use the same trick as in the reference implementation, which is to use the parallel deposit intrinsic function for scattering the bits of i
into odd indexed bits, j
into even indexed bits, and OR
the results.
We wrap it into a function z_encode
and put it into our toolbox:
#[inline]
pub fn z_encode(x: u32, y: u32) -> u32 {
let odd_bits = 0x55555555;
let even_bits = 0xAAAAAAAA;
unsafe { _pdep_u32(x, odd_bits) | _pdep_u32(y, even_bits) }
}
If n
would always be a power of 2, there would be no need to handle edge cases, since z_encode
would always return the correct z
-index.
However, when n
is not a power of 2, we must make sure to skip all z
-indexes that are out of bounds.
We use the same approach as in the reference solution, which is to create a vector row_pairs
containing 3-tuples (z_encode(i, j), i, j)
and sort it by the z
-index.
When we enumerate the sorted row_pairs
, we get correct z
-indexes that do not include out of bounds row and column indexes.
// Build a Z-order curve iteration pattern of pairs (i, j)
// by using interleaved bits of i and j as a sort key
let mut row_pairs = std::vec![(0, 0, 0); vecs_per_col * vecs_per_col];
// Define a function that interleaves one row of indexes
let interleave_row = |(i, row): (usize, &mut [(usize, usize, usize)])| {
for (j, x) in row.iter_mut().enumerate() {
let z = z_encode(i as u32, j as u32);
*x = (z as usize, i, j);
}
};
// Apply the function independently on all rows and sort by ija
row_pairs
.par_chunks_mut(vecs_per_col)
.enumerate()
.for_each(interleave_row);
// We don't need stable sort since there are no duplicate keys
row_pairs.par_sort_unstable();
Recall how we used an 8-by-8 tmp
block in previous versions to store partial results.
In this version, we'll store a tmp
block for every Z-order index pair (i, j)
into partial_results
.
By storing tmp
blocks into partial_results
for every index pair, we can fairly easily load and write into the correct tmp
block when we process each vertical stripe of data.
// We'll be processing the input one stripe at a time
let mut vd = std::vec![simd::f32x8_infty(); COLS_PER_STRIPE * vecs_per_col];
let mut vt = std::vec![simd::f32x8_infty(); COLS_PER_STRIPE * vecs_per_col];
// Non-overlapping working memory for threads to update their results
// When enumerated in 8 element chunks, indexes the Z-order curve keys
let mut partial_results = std::vec![simd::f32x8_infty(); vecs_per_col * vecs_per_col * simd::f32x8_LENGTH];
Computing results in vertical stripes
Now, we will compute all the results.
Note that we haven't initialized the values for vd
and vt
yet.
We'll do it inside the loop, one stripe at a time.
Here's a brief overview what happens during one pass over one stripe:
// Process vd and vt in Z-order one vertical stripe at a time, writing partial results in parallel
let num_vertical_stripes = (n + COLS_PER_STRIPE - 1) / COLS_PER_STRIPE;
for stripe in 0..num_vertical_stripes {
let col_begin = stripe * COLS_PER_STRIPE;
let col_end = n.min((stripe + 1) * COLS_PER_STRIPE);
// ...
// pack one stripe of vd and vt from d
// ...
// 1. load results from previous stripe
// 2. compute results for this stripe
// 3. save results for next stripe
}
The actual computation is not very different from v5
, except that we are processing vd
and vt
in stripes.
Also, we cannot extract results before we have processed all stripes, so each thread will load and save a tmp
block from partial_results
for every pair of indexes i
and j
.
After loading one stripe of vd
and vt
from d
, we process them in Z-order using index pairs (i, j)
from row_pairs
.
If we enumerate row_pairs
, we also get the index of each tmp
block in partial_results
, so we might as well zip row_pairs
with partial_results
to avoid using the z
-indexes directly.
We apply step_partial_block
in parallel such that each thread computes results for one tmp
block at index z
in partial_results
and index pair (i, j)
at index z
in row_pairs
:
// Function: for a f32x8 block of partial results and indexes row i col j,
// 1. Load tmp from partial results
// 2. Accumulate results for row i and column j into tmp
// 3. Write tmp into the original partial results block
let step_partial_block = |(prev_tmp, &(_, i, j)): (&mut [f32x8], &(usize, usize, usize))| {
// Copy results from previous pass over previous stripe
let mut tmp = [simd::f32x8_infty(); simd::f32x8_LENGTH];
tmp.copy_from_slice(&prev_tmp);
// Get slices over current stripes of row i and column j
let vd_row = &vd[(COLS_PER_STRIPE * i)..(COLS_PER_STRIPE * (i + 1))];
let vt_row = &vt[(COLS_PER_STRIPE * j)..(COLS_PER_STRIPE * (j + 1))];
for (&d0, &t0) in vd_row.iter().zip(vt_row) {
let d2 = simd::swap(d0, 2);
let d4 = simd::swap(d0, 4);
let d6 = simd::swap(d4, 2);
let t1 = simd::swap(t0, 1);
tmp[0] = simd::min(tmp[0], simd::add(d0, t0));
tmp[1] = simd::min(tmp[1], simd::add(d0, t1));
tmp[2] = simd::min(tmp[2], simd::add(d2, t0));
tmp[3] = simd::min(tmp[3], simd::add(d2, t1));
tmp[4] = simd::min(tmp[4], simd::add(d4, t0));
tmp[5] = simd::min(tmp[5], simd::add(d4, t1));
tmp[6] = simd::min(tmp[6], simd::add(d6, t0));
tmp[7] = simd::min(tmp[7], simd::add(d6, t1));
}
// Store partial results (8 vecs of type f32x8) to global memory
// for processing next stripe
prev_tmp.copy_from_slice(&tmp);
};
// Process current stripe in parallel, each thread filling one `tmp` block
partial_results
.par_chunks_mut(simd::f32x8_LENGTH)
.zip(row_pairs.par_iter())
.for_each(step_partial_block);
Extracting results
After accumulating results over each vertical stripe, we need to extract all results from the partial results that are in Z-order.
First, let's replace the z
-indexes in row_pairs
with a linear index and sort row_pairs
by (i, j)
in order to get a mapping from z
to the correct partial result.
This allows us to chunk r
into rows indexed by i
, and write all results to each row element at j
by reading partial_results
linearly.
// Replace ij sorting key by linear index to get a mapping to partial_results,
// then sort row_pairs by (i, j)
let replace_z_index_row = |(z_row, index_row): (usize, &mut [(usize, usize, usize)])| {
for (z, idx) in index_row.iter_mut().enumerate() {
let (_, i, j) = *idx;
*idx = (z_row * vecs_per_col + z, i, j);
}
};
let key_ij = |&idx: &(usize, usize, usize)| { (idx.1, idx.2) };
row_pairs
.par_chunks_mut(vecs_per_col)
.enumerate()
.for_each(replace_z_index_row);
row_pairs.par_sort_unstable_by_key(key_ij);
Now, row_pairs
is ordered linearly, first by i
then by j
, such that the first element in each tuple element of row_pairs
corresponds to the starting index of an 8-by-8 tmp
block in partial_results
.
We chunk r
into 8-row blocks and read the tmp
result blocks from partial_results
and extract 64 f32
results exactly as in v5
.
// Function: for 8 rows in r starting at row i*8,
// read partial results at z-index corresponding to each row i and column j
// and write them to r
let set_z_order_result_block = |(i, r_row_block): (usize, &mut [f32])| {
for j in 0..vecs_per_col {
// Get z-order index for row i and column j
let z = row_pairs[i * vecs_per_col + j].0 * simd::f32x8_LENGTH;
// Load tmp from z-order partial results for this i, j pair
let mut tmp = [simd::f32x8_infty(); simd::f32x8_LENGTH];
tmp.copy_from_slice(&partial_results[z..z + simd::f32x8_LENGTH]);
// Continue exactly as in v5
for k in (1..simd::f32x8_LENGTH).step_by(2) {
tmp[k] = simd::swap(tmp[k], 1);
}
for (tmp_i, r_row) in r_row_block.chunks_exact_mut(n).enumerate() {
for tmp_j in 0..simd::f32x8_LENGTH {
let res_j = j * simd::f32x8_LENGTH + tmp_j;
if res_j < n {
let v = tmp[tmp_i ^ tmp_j];
let vi = tmp_j as u8;
r_row[res_j] = simd::extract(v, vi);
}
}
}
}
};
r.par_chunks_mut(simd::f32x8_LENGTH * n)
.enumerate()
.for_each(set_z_order_result_block);
Benchmark
Implementation | Compiler | Time (s) | IPC |
---|---|---|---|
C++ v7 | gcc 7.4.0-1ubuntu1 | 2.04 | 2.94 |
C++ v7 | clang 6.0.0-1ubuntu2 | 2.16 | 2.20 |
Rust v7 | rustc 1.38.0-nightly | 2.25 | 2.79 |
We managed to get a small improvement compared to the Rust program from v5
, but not as much as in the C++ versions.
The performance critical loop is the same as in v5
, which means we cannot search for answers in the assembly code, or at least not as easily as previously.
One possible performance bottleneck could be that we sort the Z-order indexes twice in the Rust program, while it is done only once in the C++ version.
Using a better approach for Z-order encoding and decoding might help reducing the running times.