Cache reuse

Full source

In our final version, we will attempt to increase cache locality also for data from vt, by reading f32x8 row pairs from vd and vt using a Z-order curve iteration pattern. If you look at this animation, we will implement the last pattern to the right. Please see the reference materials for a detailed explanation.

Implementation

This version will be an extension to v5, and we won't be using the prefetching hints seen in v6. There won't be any changes to the performance critical loop or result extraction. However, we need to rewrite most of the code to support the Z-order iteration pattern. Our approach will be the same as in the reference implementation:

Create a 2-dimensional Z-order index pattern by sorting the interleaved bits of row index i and column index j.
Compute partial results in vertical stripes of 500 columns.
Extract final results from partial results.

Preparation

We start by defining some constants. We'll fix the width of all vertical stripes to 500 columns.

    // How many adjacent columns to process during one pass
    // Smaller numbers improve cache locality but add overhead
    // from having to merge partial results
    const COLS_PER_STRIPE: usize = 500;
    let vecs_per_col = (n + simd::f32x8_LENGTH - 1) / simd::f32x8_LENGTH;

Then we create the 2-dimensional Z-order pattern for pairs of i and j. We'll use the same trick as in the reference implementation, which is to use the parallel deposit intrinsic function for scattering the bits of i into odd indexed bits, j into even indexed bits, and OR the results. We wrap it into a function z_encode and put it into our toolbox:

#[inline]
pub fn z_encode(x: u32, y: u32) -> u32 {
    let odd_bits = 0x55555555;
    let even_bits = 0xAAAAAAAA;
    unsafe { _pdep_u32(x, odd_bits) | _pdep_u32(y, even_bits) }
}

If n would always be a power of 2, there would be no need to handle edge cases, since z_encode would always return the correct z-index. However, when n is not a power of 2, we must make sure to skip all z-indexes that are out of bounds. We use the same approach as in the reference solution, which is to create a vector row_pairs containing 3-tuples (z_encode(i, j), i, j) and sort it by the z-index. When we enumerate the sorted row_pairs, we get correct z-indexes that do not include out of bounds row and column indexes.

    // Build a Z-order curve iteration pattern of pairs (i, j)
    // by using interleaved bits of i and j as a sort key
    let mut row_pairs = std::vec![(0, 0, 0); vecs_per_col * vecs_per_col];
    // Define a function that interleaves one row of indexes
    let interleave_row = |(i, row): (usize, &mut [(usize, usize, usize)])| {
        for (j, x) in row.iter_mut().enumerate() {
            let z = z_encode(i as u32, j as u32);
            *x = (z as usize, i, j);
        }
    };
    // Apply the function independently on all rows and sort by ija
    row_pairs
        .par_chunks_mut(vecs_per_col)
        .enumerate()
        .for_each(interleave_row);
    // We don't need stable sort since there are no duplicate keys
    row_pairs.par_sort_unstable();

Recall how we used an 8-by-8 tmp block in previous versions to store partial results. In this version, we'll store a tmp block for every Z-order index pair (i, j) into partial_results. By storing tmp blocks into partial_results for every index pair, we can fairly easily load and write into the correct tmp block when we process each vertical stripe of data.

    // We'll be processing the input one stripe at a time
    let mut vd = std::vec![simd::f32x8_infty(); COLS_PER_STRIPE * vecs_per_col];
    let mut vt = std::vec![simd::f32x8_infty(); COLS_PER_STRIPE * vecs_per_col];
    // Non-overlapping working memory for threads to update their results
    // When enumerated in 8 element chunks, indexes the Z-order curve keys
    let mut partial_results = std::vec![simd::f32x8_infty(); vecs_per_col * vecs_per_col * simd::f32x8_LENGTH];

Computing results in vertical stripes

Now, we will compute all the results. Note that we haven't initialized the values for vd and vt yet. We'll do it inside the loop, one stripe at a time. Here's a brief overview what happens during one pass over one stripe:

    // Process vd and vt in Z-order one vertical stripe at a time, writing partial results in parallel
    let num_vertical_stripes = (n + COLS_PER_STRIPE - 1) / COLS_PER_STRIPE;
    for stripe in 0..num_vertical_stripes {
        let col_begin = stripe * COLS_PER_STRIPE;
        let col_end = n.min((stripe + 1) * COLS_PER_STRIPE);
        // ...
        // pack one stripe of vd and vt from d
        // ...
        // 1. load results from previous stripe
        // 2. compute results for this stripe
        // 3. save results for next stripe
    }

The actual computation is not very different from v5, except that we are processing vd and vt in stripes. Also, we cannot extract results before we have processed all stripes, so each thread will load and save a tmp block from partial_results for every pair of indexes i and j. After loading one stripe of vd and vt from d, we process them in Z-order using index pairs (i, j) from row_pairs. If we enumerate row_pairs, we also get the index of each tmp block in partial_results, so we might as well zip row_pairs with partial_results to avoid using the z-indexes directly. We apply step_partial_block in parallel such that each thread computes results for one tmp block at index z in partial_results and index pair (i, j) at index z in row_pairs:

        // Function: for a f32x8 block of partial results and indexes row i col j,
        // 1. Load tmp from partial results
        // 2. Accumulate results for row i and column j into tmp
        // 3. Write tmp into the original partial results block
        let step_partial_block = |(prev_tmp, &(_, i, j)): (&mut [f32x8], &(usize, usize, usize))| {
            // Copy results from previous pass over previous stripe
            let mut tmp = [simd::f32x8_infty(); simd::f32x8_LENGTH];
            tmp.copy_from_slice(&prev_tmp);
            // Get slices over current stripes of row i and column j
            let vd_row = &vd[(COLS_PER_STRIPE * i)..(COLS_PER_STRIPE * (i + 1))];
            let vt_row = &vt[(COLS_PER_STRIPE * j)..(COLS_PER_STRIPE * (j + 1))];
            for (&d0, &t0) in vd_row.iter().zip(vt_row) {
                let d2 = simd::swap(d0, 2);
                let d4 = simd::swap(d0, 4);
                let d6 = simd::swap(d4, 2);
                let t1 = simd::swap(t0, 1);
                tmp[0] = simd::min(tmp[0], simd::add(d0, t0));
                tmp[1] = simd::min(tmp[1], simd::add(d0, t1));
                tmp[2] = simd::min(tmp[2], simd::add(d2, t0));
                tmp[3] = simd::min(tmp[3], simd::add(d2, t1));
                tmp[4] = simd::min(tmp[4], simd::add(d4, t0));
                tmp[5] = simd::min(tmp[5], simd::add(d4, t1));
                tmp[6] = simd::min(tmp[6], simd::add(d6, t0));
                tmp[7] = simd::min(tmp[7], simd::add(d6, t1));
            }
            // Store partial results (8 vecs of type f32x8) to global memory
            // for processing next stripe
            prev_tmp.copy_from_slice(&tmp);
        };
        // Process current stripe in parallel, each thread filling one `tmp` block
        partial_results
            .par_chunks_mut(simd::f32x8_LENGTH)
            .zip(row_pairs.par_iter())
            .for_each(step_partial_block);

Extracting results

After accumulating results over each vertical stripe, we need to extract all results from the partial results that are in Z-order.

First, let's replace the z-indexes in row_pairs with a linear index and sort row_pairs by (i, j) in order to get a mapping from z to the correct partial result. This allows us to chunk r into rows indexed by i, and write all results to each row element at j by reading partial_results linearly.

    // Replace ij sorting key by linear index to get a mapping to partial_results,
    // then sort row_pairs by (i, j)
    let replace_z_index_row = |(z_row, index_row): (usize, &mut [(usize, usize, usize)])| {
        for (z, idx) in index_row.iter_mut().enumerate() {
            let (_, i, j) = *idx;
            *idx = (z_row * vecs_per_col + z, i, j);
        }
    };
    let key_ij = |&idx: &(usize, usize, usize)| { (idx.1, idx.2) };
    row_pairs
        .par_chunks_mut(vecs_per_col)
        .enumerate()
        .for_each(replace_z_index_row);
    row_pairs.par_sort_unstable_by_key(key_ij);

Now, row_pairs is ordered linearly, first by i then by j, such that the first element in each tuple element of row_pairs corresponds to the starting index of an 8-by-8 tmp block in partial_results.

We chunk r into 8-row blocks and read the tmp result blocks from partial_results and extract 64 f32 results exactly as in v5.

    // Function: for 8 rows in r starting at row i*8,
    // read partial results at z-index corresponding to each row i and column j
    // and write them to r
    let set_z_order_result_block = |(i, r_row_block): (usize, &mut [f32])| {
        for j in 0..vecs_per_col {
            // Get z-order index for row i and column j
            let z = row_pairs[i * vecs_per_col + j].0 * simd::f32x8_LENGTH;
            // Load tmp from z-order partial results for this i, j pair
            let mut tmp = [simd::f32x8_infty(); simd::f32x8_LENGTH];
            tmp.copy_from_slice(&partial_results[z..z + simd::f32x8_LENGTH]);
            // Continue exactly as in v5
            for k in (1..simd::f32x8_LENGTH).step_by(2) {
                tmp[k] = simd::swap(tmp[k], 1);
            }
            for (tmp_i, r_row) in r_row_block.chunks_exact_mut(n).enumerate() {
                for tmp_j in 0..simd::f32x8_LENGTH {
                    let res_j = j * simd::f32x8_LENGTH + tmp_j;
                    if res_j < n {
                        let v = tmp[tmp_i ^ tmp_j];
                        let vi = tmp_j as u8;
                        r_row[res_j] = simd::extract(v, vi);
                    }
                }
            }
        }
    };
    r.par_chunks_mut(simd::f32x8_LENGTH * n)
        .enumerate()
        .for_each(set_z_order_result_block);

Benchmark

Implementation	Compiler	Time (s)	IPC
C++ `v7`	`gcc 7.4.0-1ubuntu1`	2.04	2.94
C++ `v7`	`clang 6.0.0-1ubuntu2`	2.16	2.20
Rust `v7`	`rustc 1.38.0-nightly`	2.25	2.79

We managed to get a small improvement compared to the Rust program from v5, but not as much as in the C++ versions. The performance critical loop is the same as in v5, which means we cannot search for answers in the assembly code, or at least not as easily as previously. One possible performance bottleneck could be that we sort the Z-order indexes twice in the Rust program, while it is done only once in the C++ version. Using a better approach for Z-order encoding and decoding might help reducing the running times.

Comparing parallel Rust and C++