Software prefetching
In this version we will attempt to take advantage of vacant CPU execution ports by inserting prefetch instructions to reduce average memory access latency in the performance critical loop.
The motivation behind this idea is explained in the reference materials.
Note that vpermpd
and vpermilpd
use same execution ports as vperm2f128
and vpermilps
, so the reasoning holds also for clang
and rustc
.
Implementation
We won't be making much changes from v5
since we only want to insert 2 prefetcht0
instructions in the innermost loop.
prefetcht0
uses the strongest locality hint T0, which requests the data to be loaded into all cache levels.
The instruction is provided in the same Intel intrinsics crate we have been using for inserting SIMD instructions, where it is defined as _mm_prefetch
.
Since we will be using it only for prefetching addresses containing f32x8
s, we might as well wrap it into a helper function and put it in our SIMD helper module:
#[inline]
pub fn prefetch(p: *const f32x8, offset: isize) {
unsafe { _mm_prefetch(p.offset(offset) as *const i8, _MM_HINT_T0) }
}
The function takes as arguments the memory address of an f32x8
, for which we want to request a cache line fetch using locality T0.
In C, p.offset(offset)
would basically be equal to p + offset
.
We need the unsafe
expression both for using _mm_prefetch
intrinsic and p.offset
, but we shouldn't have to worry about memory safety so much here since we only need the offset address, the pointer will not be dereferenced.
Now that we have our prefetch-helper, we can add it to our v5
implementation.
First, we get a pair of f32x8
pointers to the current row pair vd_row
and vt_row
:
// Everything is exactly as in v5, but we add some prefetch instructions in the innermost loop
let step_row_block = |(r_row_block, vd_row): (&mut [f32], &[f32x8])| {
// Create const raw pointers for specifying addresses to prefetch
let vd_row_ptr = vd_row.as_ptr();
const PREFETCH_LENGTH: usize = 20;
for (j, vt_row) in vt.chunks_exact(n).enumerate() {
let vt_row_ptr = vt_row.as_ptr();
PREFETCH_LENGTH = 20
is the amount of f32x8
addresses we want to look ahead, and it was chosen empirically in the reference implementation.
We'll insert two prefetch-hints for addresses 20 elements ahead of d0
and t0
in the beginning of the innermost loop:
let mut tmp = [simd::f32x8_infty(); simd::f32x8_LENGTH];
for (col, (&d0, &t0)) in vd_row.iter().zip(vt_row).enumerate() {
// Insert prefetch hints for fetching the cache line containing
// the memory address 20 addresses ahead of the current column
simd::prefetch(vd_row_ptr, (col + PREFETCH_LENGTH) as isize);
simd::prefetch(vt_row_ptr, (col + PREFETCH_LENGTH) as isize);
let d2 = simd::swap(d0, 2);
let d4 = simd::swap(d0, 4);
let d6 = simd::swap(d4, 2);
let t1 = simd::swap(t0, 1);
tmp[0] = simd::min(tmp[0], simd::add(d0, t0));
tmp[1] = simd::min(tmp[1], simd::add(d0, t1));
tmp[2] = simd::min(tmp[2], simd::add(d2, t0));
tmp[3] = simd::min(tmp[3], simd::add(d2, t1));
tmp[4] = simd::min(tmp[4], simd::add(d4, t0));
tmp[5] = simd::min(tmp[5], simd::add(d4, t1));
tmp[6] = simd::min(tmp[6], simd::add(d6, t0));
tmp[7] = simd::min(tmp[7], simd::add(d6, t1));
}
That's about it, let's run the benchmarks. C++ version available here.
Implementation | Compiler | Time (s) | IPC |
---|---|---|---|
C++ v6 | gcc 7.4.0-1ubuntu1 | 2.10 | 3.20 |
C++ v6 | clang 6.0.0-1ubuntu2 | 2.33 | 2.25 |
Rust v6 | rustc 1.38.0-nightly | 2.67 | 2.77 |
Something is not right, the Rust implementation became slower compared to the previous version.
Let's look at the assembly.
gcc
LOOP:
vmovaps ymm2,YMMWORD PTR [rdx-0x280]
vmovaps ymm3,YMMWORD PTR [rax-0x280]
prefetcht0 BYTE PTR [rax]
add rax,0x20
prefetcht0 BYTE PTR [rdx]
add rdx,0x20
vpermilps ymm0,ymm2,0xb1
vperm2f128 ymm13,ymm3,ymm3,0x1
vpermilps ymm14,ymm3,0x4e
vaddps ymm15,ymm3,ymm2
vaddps ymm3,ymm3,ymm0
vpermilps ymm1,ymm13,0x4e
vminps ymm7,ymm7,ymm3
vaddps ymm3,ymm2,ymm14
vaddps ymm14,ymm0,ymm14
vminps ymm11,ymm11,ymm15
vminps ymm10,ymm10,ymm3
vaddps ymm3,ymm2,ymm13
vaddps ymm13,ymm0,ymm13
vaddps ymm2,ymm2,ymm1
vaddps ymm0,ymm0,ymm1
vminps ymm6,ymm6,ymm14
vminps ymm9,ymm9,ymm3
vminps ymm5,ymm5,ymm13
vminps ymm8,ymm8,ymm2
vminps ymm4,ymm4,ymm0
cmp rax,rcx
jne LOOP
There are two prefetch-hints prefetcht0
, placed 0x280
bytes ahead of the current loop indexes in registers rdx
and rax
.
This equals 20 f32x8
vectors, because each f32x8
is 32 bytes and 0x280/32 = 20
, as we wanted.
clang
LOOP:
prefetcht0 BYTE PTR [rcx+rdi*1]
prefetcht0 BYTE PTR [rax+rdi*1]
vmovapd ymm9,YMMWORD PTR [rcx+rdi*1-0x280]
vmovaps ymm10,YMMWORD PTR [rax+rdi*1-0x280]
vpermpd ymm11,ymm9,0x4e
vpermilpd ymm12,ymm9,0x5
vpermilpd ymm13,ymm11,0x5
vpermilps ymm14,ymm10,0xb1
vaddps ymm15,ymm9,ymm10
vminps ymm8,ymm8,ymm15
vaddps ymm9,ymm9,ymm14
vminps ymm4,ymm4,ymm9
vaddps ymm9,ymm12,ymm10
vminps ymm7,ymm7,ymm9
vaddps ymm9,ymm12,ymm14
vminps ymm3,ymm3,ymm9
vaddps ymm9,ymm11,ymm10
vminps ymm6,ymm6,ymm9
vaddps ymm9,ymm11,ymm14
vminps ymm2,ymm2,ymm9
vaddps ymm9,ymm10,ymm13
vminps ymm5,ymm5,ymm9
vaddps ymm9,ymm13,ymm14
vminps ymm1,ymm1,ymm9
add rdi,0x20
add rdx,0xffffffffffffffff
jne LOOP
rustc
LOOP:
inc rbx
vmovapd ymm9,YMMWORD PTR [r11+rsi*1-0x280]
vmovaps ymm10,YMMWORD PTR [rcx+rsi*1]
prefetcht0 BYTE PTR [r11+rsi*1]
prefetcht0 BYTE PTR [rdx+rsi*1]
vpermilpd ymm11,ymm9,0x5
vpermpd ymm12,ymm9,0x4e
vpermpd ymm13,ymm9,0x1b
vpermilps ymm14,ymm10,0xb1
vaddps ymm15,ymm9,ymm10
vminps ymm8,ymm8,ymm15
vmovaps YMMWORD PTR [rsp+0xc0],ymm8
vaddps ymm9,ymm9,ymm14
vminps ymm7,ymm7,ymm9
vmovaps YMMWORD PTR [rsp+0xe0],ymm7
vaddps ymm9,ymm11,ymm10
vminps ymm6,ymm6,ymm9
vmovaps YMMWORD PTR [rsp+0x100],ymm6
vaddps ymm9,ymm11,ymm14
vminps ymm5,ymm5,ymm9
vmovaps YMMWORD PTR [rsp+0x120],ymm5
vaddps ymm9,ymm12,ymm10
vminps ymm4,ymm4,ymm9
vmovaps YMMWORD PTR [rsp+0x140],ymm4
vaddps ymm9,ymm12,ymm14
vminps ymm3,ymm3,ymm9
vmovaps YMMWORD PTR [rsp+0x160],ymm3
vaddps ymm9,ymm10,ymm13
vminps ymm2,ymm2,ymm9
vmovaps YMMWORD PTR [rsp+0x180],ymm2
vaddps ymm9,ymm13,ymm14
vminps ymm1,ymm1,ymm9
vmovaps YMMWORD PTR [rsp+0x1a0],ymm1
add rsi,0x20
cmp rbx,rax
jb LOOP
We can see two prefetch instructions with locality hint T0, but for some reason there is also pretty bad register spilling.
This behaviour seems a bit odd, since the only thing we changed in the inner loop from v5
was to add two prefetch instructions.
Also, we can see that after writing a register into memory, the same register is not used anywhere in the loop during that iteration.
Recall how we faced the same issue in v4
, which we solved by unrolling the tmp
results array into separate, mutable variables.
This seemed to encourage the compiler to keep the temporary results in registers for the duration of the loop, so let's do the same also here.
Full step_row_block
implementation
// Everything is mostly as in v5,
// but we add some prefetch instructions in the innermost loop,
// and unroll the tmp results array to avoid register spilling
let step_row_block = |(r_row_block, vd_row): (&mut [f32], &[f32x8])| {
// Create const raw pointers for specifying addresses to prefetch
let vd_row_ptr = vd_row.as_ptr();
const PREFETCH_LENGTH: usize = 20;
for (j, vt_row) in vt.chunks_exact(n).enumerate() {
let vt_row_ptr = vt_row.as_ptr();
let mut tmp0 = simd::f32x8_infty();
let mut tmp1 = simd::f32x8_infty();
let mut tmp2 = simd::f32x8_infty();
let mut tmp3 = simd::f32x8_infty();
let mut tmp4 = simd::f32x8_infty();
let mut tmp5 = simd::f32x8_infty();
let mut tmp6 = simd::f32x8_infty();
let mut tmp7 = simd::f32x8_infty();
for (col, (&d0, &t0)) in vd_row.iter().zip(vt_row).enumerate() {
// Insert prefetch hints for fetching the cache line containing
// the memory address 20 addresses ahead of the current column
simd::prefetch(vd_row_ptr, (col + PREFETCH_LENGTH) as isize);
simd::prefetch(vt_row_ptr, (col + PREFETCH_LENGTH) as isize);
let d2 = simd::swap(d0, 2);
let d4 = simd::swap(d0, 4);
let d6 = simd::swap(d4, 2);
let t1 = simd::swap(t0, 1);
tmp0 = simd::min(tmp0, simd::add(d0, t0));
tmp1 = simd::min(tmp1, simd::add(d0, t1));
tmp2 = simd::min(tmp2, simd::add(d2, t0));
tmp3 = simd::min(tmp3, simd::add(d2, t1));
tmp4 = simd::min(tmp4, simd::add(d4, t0));
tmp5 = simd::min(tmp5, simd::add(d4, t1));
tmp6 = simd::min(tmp6, simd::add(d6, t0));
tmp7 = simd::min(tmp7, simd::add(d6, t1));
}
let tmp = [
tmp0, simd::swap(tmp1, 1),
tmp2, simd::swap(tmp3, 1),
tmp4, simd::swap(tmp5, 1),
tmp6, simd::swap(tmp7, 1),
];
for (tmp_i, r_row) in r_row_block.chunks_exact_mut(n).enumerate() {
for tmp_j in 0..simd::f32x8_LENGTH {
let res_j = j * simd::f32x8_LENGTH + tmp_j;
if res_j < n {
let v = tmp[tmp_i ^ tmp_j];
let vi = tmp_j as u8;
r_row[res_j] = simd::extract(v, vi);
}
}
}
}
};
r.par_chunks_mut(simd::f32x8_LENGTH * n)
.zip(vd.par_chunks(n))
.for_each(step_row_block);
rustc
without spilling
LOOP:
inc rbx
vmovapd ymm9,YMMWORD PTR [r11+rsi*1-0x280]
vmovaps ymm10,YMMWORD PTR [rcx+rsi*1]
prefetcht0 BYTE PTR [r11+rsi*1]
prefetcht0 BYTE PTR [rdx+rsi*1]
vpermilpd ymm11,ymm9,0x5
vpermpd ymm12,ymm9,0x4e
vpermpd ymm13,ymm9,0x1b
vpermilps ymm14,ymm10,0xb1
vaddps ymm15,ymm9,ymm10
vminps ymm4,ymm4,ymm15
vaddps ymm9,ymm9,ymm14
vminps ymm8,ymm8,ymm9
vaddps ymm9,ymm11,ymm10
vminps ymm3,ymm3,ymm9
vaddps ymm9,ymm11,ymm14
vminps ymm7,ymm7,ymm9
vaddps ymm9,ymm12,ymm10
vminps ymm2,ymm2,ymm9
vaddps ymm9,ymm12,ymm14
vminps ymm6,ymm6,ymm9
vaddps ymm9,ymm10,ymm13
vminps ymm1,ymm1,ymm9
vaddps ymm9,ymm13,ymm14
vminps ymm5,ymm5,ymm9
add rsi,0x20
cmp rbx,rax
jb LOOP
Benchmark
Implementation | Compiler | Time (s) | IPC |
---|---|---|---|
C++ v6 | gcc 7.4.0-1ubuntu1 | 2.10 | 3.20 |
C++ v6 | clang 6.0.0-1ubuntu2 | 2.33 | 2.25 |
Rust v6 | rustc 1.38.0-nightly | 2.16 | 3.23 |