How Far Can We Push Optimization?

In this blog post, we'll take a simple 2x nearest-neighbour resize and try to optimize it to be as fast as possible. We'll touch upon CPU cache, SIMD, and some light parallelization. Here's a sneak peek for the benchmarks;

5.63ms → Naive (column-access) resize

1.24ms → Row-access resize

0.88ms → memcpy

0.88ms → unaligned SIMD

0.72ms → aligned, non-temporal SIMD

0.60ms → parallelization with std::for_each

The Problem

In Ichi, there's a function that allocates a surface for drawing. For convenience, the function re-allocates the surface at each resize instead of allocating a big chunk up front. However, for HiDPI situations (i.e. when RGFW reports that monitor-pixelRatio == 2) we need to allocate double the window size for the surface. This occurs when we set UI scaling to 200% on Windows, and on Macs with Retina screens this is default behaviour. When this happens, our surface becomes double the size of our internal buffer. We can't just memcpy our internal buffer to window surface; we need to double it.

Specifically, we need to 2x nearest-neighbour resize our internal buffer to target surface. This, of course, can have a huge performance impact since we are not doing any blind byte-by-byte buffer copy anymore. Fortunately for us, the problem is well-defined and edge cases are almost non-existent. So optimizing this operation should be a relatively straightforward (and fun!) task.

Working with 200% UI scaling just to trigger the resize path is a bit pointless, so I'm manually setting monitor-pixelRatio = 2 in GetSurface() to trigger the resize while keeping my screen the same.

Benchmarking

Before any sort of work, we should have a clear benchmarking strategy. For this, I'm using a simple header file that I've written for quick benchmarking. We can use it like this;

BENCHMARK_START(100)
renderer.Present(surface->data, surface->w, surface->h);
BENCHMARK_END()

The code between BENCHMARK_START(x) and BENCHMARK_END() is looped x times and mean time is reported. Also, before each run FlushCache() is called to evict any relevant data to the benchmarked code to simulate cold cache by copying 128 megabytes of dummy data around. This is important; if we do not do this our benchmark iterations will be faster than our real-world runs since benchmark data (source buffer contents etc.) will still be hot and in CPU cache. As a result successive iterations will get faster and faster while benchmarking.

Our internal buffer size will be 1280x960 and target surface will be the double of that, so 2560x1920.

My CPU is Ryzen 9 7950X with PBO, to 5.1ghz max clock speed. I'm running my DRAM 3600mhz in double channel. DRAM speeds will be relevant later.

I've 2 different brands of ram sticks in my system so I'm just running these in standard JEDEC speeds to avoid instabilities ˙◠˙

Compiler Options

Compilers are pretty smart and optimizing ones are excellent at the task. We'll test our implementations on "Release" builds which implies -O3 optimization level and aggressive inlining. Let's also turn on "Whole Program Optimization" using cmake, so that more optimizations can be done at link time.

Our goal is not to beat the compiler (which is fool's errand anyway, as we'll see later), rather letting it do the best it can so that we can focus on the algorithm itself.

I'm using clang-cl, which is clang but compatible with msvc frontend, i.e. accepts msvc-style arguments. clang is great!

A Naive Implementation

Okay, lets start with a simple implementation. Nearest neighbour resize means that we take a pixel and duplicate it to four pixels; essentially doubling it at each axis. Given a pixel at source (x, y), we place it at target (2x, 2y), (2x +1, 2y), (2x, 2y+1), (2x +1, 2y +1);

static size_t CalculateIndex(size_t x, size_t y, uint32_t width)
{
    return x + (y * width);
}

for (size_t x = 0; x < sourceWidth; x++)
{
    for (size_t y = 0; y < sourceHeight; y++)
    {
        size_t index = CalculateIndex(x, y, sourceWidth);
        Pixel pix = sourceBuffer[index];

        size_t targetX_1 = x * 2;
        size_t targetX_2 = targetX_1 + 1;
        size_t targetY_1 = y * 2;
        size_t targetY_2 = targetY_1 + 1;

        size_t idx1 = CalculateIndex(targetX_1, targetY_1, targetWidth);
        size_t idx2 = CalculateIndex(targetX_2, targetY_1, targetWidth);
        size_t idx3 = CalculateIndex(targetX_1, targetY_2, targetWidth);
        size_t idx4 = CalculateIndex(targetX_2, targetY_2, targetWidth);

        targetBuffer[idx1] = pix;
        targetBuffer[idx2] = pix;
        targetBuffer[idx3] = pix;
        targetBuffer[idx4] = pix;
    }
}

In my system that block of code takes about 5.63ms mean. That's slow, real slow. If we target 16.6ms, which is 60fps, that's 3/1 of frametime just being spent to buffer copy. We have a long way to go.

A Simple Edit

Currently, we iterate over x first, then we iterate over y. Writing the for loop for x first then y tickles my brain and makes me happy but I'm not the one executing the code, the CPU does. Let's think about what that means.

That block of code is accessing pixels in columns. Meaning, it takes the first pixel of the first row, then it takes the first pixel of the second row etc. Then, it jumps into the second column and repeats. So where's the problem? Problem is how's the data is laid out in memory.

If we take a look at CalculateIndex() we can see that our data is laid out in rows across the memory. We store the first row, and then we store the second row etc. When we iterate over columns instead of rows, CPU jumps across a whole width (1920x4 bytes in our case) pixels of data each time.

If our buffers were somewhat morton z ordered, this would not be a problem though!

Modern CPU's have lots of clever tricks that makes them go blazingly fast. One of them is prefetching. Accessing DRAM is costly, so the hardware prefetcher analyzes our access patterns and tries to load up the data we need ahead of time into CPU cache, so that instead of waiting several hundred cycles for the data to arrive we can just fetch it from the fast cache. Hardware prefetcher is as smart as it gets but it still expects linear access most of the time. If we jump around, we introduce cache misses and instead of reading from the cache, we hit DRAM and wait for about 10 times more on average for data access.

Okay, that's a lot of words. So what's the fix?

We'll just swap the loop order, so that we loop over y first, then x. That'll make it so that we loop in rows, instead of columns;

for (size_t y = 0; y < sourceHeight; y++)
{
    for (size_t x = 0; x < sourceWidth; x++)
    {
    ...

With that, we are down to 1.24ms mean! Wait, what?!

A Deeper Look

With simple loop order edit, we end up with 5x performance boost. But can that be simply explained with cache hits? We can go a little bit deeper and take a look at the compiler generated asm code.

I've set a breakpoint at targetBuffer[idx1] = pix; in Visual Studio and opened up the disassembly window. Let's take a look at column (worst) and row (best) cases:

column-access:
     mov         r15d,dword ptr [rdi]  
     mov         dword ptr [rbx-4],r15d  
     mov         dword ptr [rbx],r15d  
     mov         dword ptr [rbx+r8*4-4],r15d  
     mov         dword ptr [rbx+r8*4],r15d  
     add         rbx,r9  
     add         rdi,r10  
     dec         r14 # decrement loop counter
     jne         Renderer::Present+0F0h (07FF7CDEE4A90h) # loop back to top

Column access asm code is straightforward. The code stores a pixel in 4 different places, then decrements the counter and repeats. Nothing weird about that.

 row-access:
     mov         esi,dword ptr [r12+rdx*4-4]  
     mov         dword ptr [r15+rdx*8-0Ch],esi  
     mov         dword ptr [r15+rdx*8-8],esi  
     mov         dword ptr [rbx+rdx*8-0Ch],esi  
     mov         dword ptr [rbx+rdx*8-8],esi  
     mov         esi,dword ptr [r12+rdx*4]  
     mov         dword ptr [r15+rdx*8-4],esi  
     mov         dword ptr [r15+rdx*8],esi  
     mov         dword ptr [rbx+rdx*8-4],esi  
     mov         dword ptr [rbx+rdx*8],esi  
     add         rdx,2  # jump over 2 pixels
     cmp         r14,rdx  
     jne         Renderer::Present+4B0h (07FF6B3D64E50h) # loop back to top

Row access asm code however is unrolled 2x, meaning it loops half as much and does 2 operations per loop. When we made our buffer access linear, compiler took that opportunity to unroll the loop to reduce bookkeeping overhead. While it's helping here, it seems that main benefit is still coming from cache hits and compiler is not doing something exotic behind our backs. Nice!

Helping the Hardware Further

So we are down to 1.24ms. Can we do better? I think we can. A simple observation about the resize we are doing that the target buffer's rows are duplicates. Meaning row[0] == row[1] and row[2] == row[3] and so on. So instead of copying the same pixels over and over and jumping to a row below while writing, let's just write to a single row, then copy that row to one below. We decompose the algorithm into two separate tasks, but the tasks become linear, so we might get a performance boost there.

Apparently this is called loop fission. Neat!

Simplest way to do that is memcpy'ng the row below;

for (size_t y = 0; y < sourceHeight; y++)
{
    size_t targetY_1 = y * 2;
    size_t targetY_2 = targetY_1 + 1;

    for (size_t x = 0; x < sourceWidth; x++)
    {
        size_t index = CalculateIndex(x, y, sourceWidth);
        Pixel pix = sourceBuffer[index];

        size_t targetX_1 = x * 2;
        size_t targetX_2 = targetX_1 + 1;

        size_t idx1 = CalculateIndex(targetX_1, targetY_1, targetWidth);
        size_t idx2 = CalculateIndex(targetX_2, targetY_1, targetWidth);

        targetBuffer[idx1] = pix;
        targetBuffer[idx2] = pix;
    }

    size_t from = CalculateIndex(0, targetY_1, targetWidth);
    size_t to = CalculateIndex(0, targetY_1 + 1, targetWidth);

    Pixel* src = targetBuffer + from;
    Pixel* dst = targetBuffer + to;

    std::memcpy((void*)(dst), (void*)(src), sizeof(Pixel) * targetWidth);
}

With that, we are down to 0.88ms mean!

Trying to Beat memcpy and Failing Miserably

memcpy routines are generally pretty optimized, but the compiler still emits call memcpy instruction. This is a function call, which means storing the stack frame and jumping to another routine each row. We can simply do memcpy inline here to eliminate those function calls. We can write a pretty straightforward buffer copy like so;

size_t from = CalculateIndex(0, targetY_1, targetWidth);
size_t to = CalculateIndex(0, targetY_1 + 1, targetWidth);

Pixel* src = targetBuffer + from;
Pixel* dst = targetBuffer + to;

for (size_t i = 0; i < targetWidth; i++)
{
    *dst = *src;

    dst++;
    src++;
}

And the result is... 0.88ms mean. Huh. And the compiler even vectorized the loop, which iterates over 128 bytes and the result is still the same;

manual-memcpy:
vmovups     ymm0,ymmword ptr [rsi+rbx]  
vmovups     ymm1,ymmword ptr [rsi+rbx+20h]  
vmovups     ymm2,ymmword ptr [rsi+rbx+40h]  
vmovups     ymm3,ymmword ptr [rsi+rbx+60h]  
vmovups     ymmword ptr [rdx+rbx-60h],ymm0  
vmovups     ymmword ptr [rdx+rbx-40h],ymm1  
vmovups     ymmword ptr [rdx+rbx-20h],ymm2  
vmovups     ymmword ptr [rdx+rbx],ymm3  
sub         rbx,0FFFFFFFFFFFFFF80h # advance over 128 bytes!
cmp         r15,rbx  
jne         Renderer::Present+3B0h (07FF6A18F4D50h)

ymm registers hold 256 bits of data and are part of the AVX instruction set. Here, the CPU is doing 4 256-bit loads and 4 256-bit stores. It loas 8 pixels at the same time, 4 times, and then stores 8 pixels at the same time, 4 times per loop. Even with that, we can't beat memcpy at its own game.

So let's change the game.

SIMD!

Taking a page out of compilers book, let's learn some siliconese (¬‿¬) and write SIMD to accelerate our loop even further.

SIMD stands for "single instruction, multiple data". Like it's name, it exposes some asm instructions to operate on multiple elements of data at the same time. Imagine adding a constant acceleration to an array of velocities; instead of loading and modifying one element, we can load 4 and modify in a single pass which improves performance greatly.

But we are not doing arithmetic here. We need to duplicate the channels of a pixels so that RGBA -> RRGGBBAA, so what can we do?

Fortunately each SIMD flavor (x86 has SSE and AVX while ARM has NEON) contains many operations and are not limited to arithmetic, since permuting the data is so common in signal and image processing. There's a high chance that there's an operation that we can use to double our pixel channels across multiple elements.

Before that, how can we even begin to write SIMD code anyway? There's two options; We can either write asm directly (ehh) or use intrinsics (yay!).

There's also a third option; use a portable SIMD library like xsimd or highway but where's the fun in that?

Fortunately, using intrinsics are straightforward as they are functions that compiler maps to asm directly. First, we'll start with AVX since I'm on Windows right now. We can write NEON code for Apple Silicon later.

While I tried to explain as much as I could, specifics of using SIMD is not the focus of this blog post. I'm also not an expert in this subject. I highly recommend this source as a starter though.

Let's start by going over the load instructions on Intel Intrinsics Guide. We'll Target AVX. We'll load 128-bits of data and duplicate it to 256-bits, then we'll write that to the target buffer.

The intrinsics that we need for that operations took some time (a full day lol) to find, but here they are;

_mm_loadu_si128: Given an unaligned (we'll touch upon this later) address, load 128 bits of data. Easy enough. We can load 4 pixels at a time with this.
_mm_unpacklo_epi32: This intrinsic takes 2 arguments, uses the lower 64-bit parts and interleaves them in 32-bit elements, then returns the resulting 128-bit value. Given abcd - efgh it returns aebf.
_mm_unpackhi_epi32: Same as it's sibling, it returns cgdh.
_mm256_set_m128: We use this to combine the two 128-bit values that we constructed from unpacking.
_mm256_storeu_si256: and finally, we store the resulting 256-bit value.

We use the same loaded data for the unpacklo and unpackhi. Meaning, given 4 pixels of 128-bit data [P0, P1, P2, P3] unpacklo returns [P0, P0, P1, P1] and unpackhi returns [P2, P2, P3, P3].

Let's also split the loop so that we iterate over rows linearly. Let's ignore the tail end of the loop and pretend that our width is a multiple of 4. With all that said, the code is;

#include <immintrin.h>
...
for (size_t y = 0; y < sourceHeight; y++)
{
    size_t targetY_1 = y * 2;
    size_t targetY_2 = targetY_1 + 1;

    size_t fromIndex = CalculateIndex(0, y, sourceWidth);
    size_t toIndex = CalculateIndex(0, targetY_1, targetWidth);

    constexpr int simdElements = 4;
    for (size_t x = 0; x < sourceWidth; x += simdElements)
    {
        __m128i* source = reinterpret_cast<__m128i*>(sourceBuffer + fromIndex);
        __m256i* target = reinterpret_cast<__m256i*>(targetBuffer + toIndex);

        __m128i pix = _mm_loadu_si128(source);
        __m128i p0p0p1p1 = _mm_unpacklo_epi32(pix, pix);
        __m128i p2p2p3p3 = _mm_unpackhi_epi32(pix, pix);
        __m256i interleaved = _mm256_set_m128(p2p2p3p3, p0p0p1p1);
        _mm256_storeu_si256(target, interleaved);

        fromIndex += simdElements;
        toIndex += simdElements * 2;
    }

    fromIndex = CalculateIndex(0, y, sourceWidth);
    toIndex = CalculateIndex(0, targetY_2, targetWidth);

    for (size_t x = 0; x < sourceWidth; x += simdElements)
    {
        __m128i* source = reinterpret_cast<__m128i*>(sourceBuffer + fromIndex);
        __m256i* targetBelow = reinterpret_cast<__m256i*>(targetBuffer + toIndex);

        __m128i pix = _mm_loadu_si128(source);
        __m128i p0p0p1p1 = _mm_unpacklo_epi32(pix, pix);
        __m128i p2p2p3p3 = _mm_unpackhi_epi32(pix, pix);
        __m256i interleaved = _mm256_set_m128(p2p2p3p3, p0p0p1p1);
        _mm256_storeu_si256(targetBelow, interleaved);

        fromIndex += simdElements;
        toIndex += simdElements * 2;
    }
}

And the result is... 0.88ms mean! Come ON!! (╯°□°)╯︵ ┻━┻

Limits of the Machine

It is clear that we are hitting some kind of hardware limit.

I've 1800mhz but dual channel (effective 3600mhz) DDR5 installed, so that means this rig should be able to transfer 3600 million 64-bit chunks. Since we are single threaded let's take the single channel case, so 1800 million chunks of 64-bits (more like 2x32-bits but I digress) per second it is.

A quick math later, we arrive at 28.8gb/sec throughput. If we were using dual channel, we would be saturating the DRAM bus at about 57.6gb/sec. Our target buffer is 2560x1920x4 bytes large, which equals to 18.75 mb's of data. We are currently transferring it in about 0.88ms, that means 20.8gb/sec. We are below single channel by about %30. While we can't really reach peak theoretical throughput, we can expect to reach about %90 of the way there.

We are missing 6gb/sec throughput, if we assume 90% real-world utilization at 26-27gb/sec. The missing %20 percent of our performance is being wasted somewhere and it is clear that it is not compute. The SIMD ops we are doing are trivial, as we are just moving bytes around.

Two things are still relevant here. First, our accesses are unaligned, which hurts performance but not much. However, the real bottleneck is our old friend, the CPU cache. Let's take a look.

Loading: Dropping the U

For SIMD, alignment matters. Generally, the data must be aligned to 16 bytes of boundaries so that CPU can issue loads efficiently. We can actually see this remark in the _mm_load_si128 intrinsic's description (notice the lack of u at the end of load).

But for store, we are going to do something else.

Storing: Saying Goodbye to Cache

We were designing our algorithm around the cache access up to this point. But now, it kinda gets in our way. When we write something, it first goes into cache, then it is committed to DRAM. We don't touch this data after we write it anyway, so we don't need cache writes. If we can bypass the cache, we might gain some throughput here.

Fortunately, smart people already thought about that and implemented non-temporal access, in our case, streaming writes.

I believe memcpy can't use non-temporal write optimization since user might want to use the copied buffer immediately. This is one little case that we might have upper hand against it, but I haven't verified it. In any case, it should be absurdly optimized anyway.

We are going to use _mm256_stream_si256, so we need 32-byte alignment, which implies 16-byte alignment too for our loads.

Alignment, But Not the D&D Kind

So, we need to do a couple of things. Our source and target buffers are needed to be aligned. We are using std::vector for automatic memory management, so we need to use a custom allocator for it to make it allocate aligned memory. I'm not going to roll my own, so let's just yoink one from this link and dump it in the project.

Lets change a few declarations;

static RGFW_surface* GetSurface(RGFW_monitor* monitor, int width, int height)
{
    static std::vector<uint32_t, AlignedAllocator<uint32_t, 32>> buffer{};
    ...

class Renderer
{
...
private:
    std::vector<Pixel, AlignedAllocator<Pixel, 32>> m_buffer{};
};

We need to ensure width is multiple of 8 (32 bytes) so that when we proceed to next row, the start of it is aligned too;

RGFW_window_getSize(window, &width, &height);
width += (8 - width % 8) % 8;

We hint the compiler that the pointers are aligned;

Pixel* sourceBuffer = std::assume_aligned<32>(m_buffer.data());
Pixel* targetBuffer = std::assume_aligned<32>(reinterpret_cast<Pixel*>(target));

Bringing It All Home

Finally, we change the loadu to load and storeu to stream variants. Let's not forget to put _mm_sfence at the end of the function, which is a requirement for using streaming writes;

for (size_t y = 0; y < sourceHeight; y++)
{
    size_t targetY_1 = y * 2;
    size_t targetY_2 = targetY_1 + 1;

    size_t fromIndex = CalculateIndex(0, y, sourceWidth);
    size_t toIndex = CalculateIndex(0, targetY_1, targetWidth);

    constexpr int simdElements = 4;
    for (size_t x = 0; x < sourceWidth; x += simdElements)
    {
        __m128i* source = reinterpret_cast<__m128i*>(sourceBuffer + fromIndex);
        __m256i* target = reinterpret_cast<__m256i*>(targetBuffer + toIndex);

        __m128i pix = _mm_load_si128(source);
        __m128i p0p0p1p1 = _mm_unpacklo_epi32(pix, pix);
        __m128i p2p2p3p3 = _mm_unpackhi_epi32(pix, pix);
        __m256i interleaved = _mm256_set_m128(p2p2p3p3, p0p0p1p1);
        _mm256_stream_si256(target, interleaved);

        fromIndex += simdElements;
        toIndex += simdElements * 2;
    }

    fromIndex = CalculateIndex(0, y, sourceWidth);
    toIndex = CalculateIndex(0, targetY_2, targetWidth);

    for (size_t x = 0; x < sourceWidth; x += simdElements)
    {
        __m128i* source = reinterpret_cast<__m128i*>(sourceBuffer + fromIndex);
        __m256i* targetBelow = reinterpret_cast<__m256i*>(targetBuffer + toIndex);

        __m128i pix = _mm_load_si128(source);
        __m128i p0p0p1p1 = _mm_unpacklo_epi32(pix, pix);
        __m128i p2p2p3p3 = _mm_unpackhi_epi32(pix, pix);
        __m256i interleaved = _mm256_set_m128(p2p2p3p3, p0p0p1p1);
        _mm256_stream_si256(targetBelow, interleaved);

        fromIndex += simdElements;
        toIndex += simdElements * 2;
    }
}

_mm_sfence(); // make sure streamed writes are visible after simd ops

And with that, we are down to 0.72ms mean! Bypassing the cache while writing netted us about %20 improvement. Right now, we are writing about 26gb/sec, about %90 for single channel throughput

Whew. What a ride. We learned to speak siliconese, the ancient language of the thinking rocks, to finally beat that pesky memcpy and the compiler.

Or did we? Let's take a look at the disassembly.

vmovaps     xmm1,xmmword ptr [rdx+rdi*4] # load 
vpermps     ymm1,ymm0,ymm1 # permute
vmovntps    ymmword ptr [rsi+rdi*8],ymm1 # store

That's odd. We used load, unpacklo, unpackhi, set, and stream. These should roughly map to;

vmovaps # load
punpckldq # unpack low
punpckhdq # unpack high
vinsertf128 # merge, but may be done in-place so optional
vmovntps # store

But we only see 3 instructions in the disassembly. Intrinsics should match the asm, but compiler recognized our access pattern and rewrote our loop body to load/permute/store triplet. It seems that doing 2 unpacks were inefficient after all and compiler knew better.

It was really a fool's errand to fight with the compiler. But we at least managed to give it a recognizable pattern to optimize upon, and that's a great consolation prize.

Going Wide

We went deep. To improve this even further, we need to go wide. And that means multiple cores!

We are going to keep this simple. Multithreading is its own beast, but there are ways to dip our toes in it. Specifically, we can use std::for_each with execution policies. This setup is great, we can just give it a lambda and say "run this in multiple threads please." Since our rows are independent each other and do not overlap, we can parallelize across rows.

One single caveat is that execution policies are not supported on MacOS, sadly. I want to test the threaded version on Apple Silicon but that's a problem for another time.

But first, we need to generate row indexes. That's easy, we can create a vector while resizing our buffer;

static std::vector<size_t> s_rowIterator{};
void Renderer::Resize(const uint32_t w, const uint32_t h)
{
...
    s_rowIterator.resize(height);
    for (size_t i = 0; i < height; i++)
    {
        s_rowIterator[i] = i;
    }
...
}

And then, we can replace the outer for loop with std::for_each;

    std::for_each(std::execution::par_unseq, 
	    s_rowIterator.begin(), s_rowIterator.end(), 
	    [sourceWidth, targetWidth, sourceBuffer, targetBuffer](size_t y) 
    {
        size_t targetY_1 = y * 2;
        size_t targetY_2 = targetY_1 + 1;

        size_t fromIndex = CalculateIndex(0, y, sourceWidth);
        size_t toIndex = CalculateIndex(0, targetY_1, targetWidth);

        constexpr int simdElements = 4;
        for (size_t x = 0; x < sourceWidth; x += simdElements)
        {
            __m128i* source = reinterpret_cast<__m128i*>(sourceBuffer + fromIndex);
            __m256i* target = reinterpret_cast<__m256i*>(targetBuffer + toIndex);

            __m128i pix = _mm_load_si128(source);
            __m128i p0p0p1p1 = _mm_unpacklo_epi32(pix, pix);
            __m128i p2p2p3p3 = _mm_unpackhi_epi32(pix, pix);
            __m256i interleaved = _mm256_set_m128(p2p2p3p3, p0p0p1p1);
            _mm256_stream_si256(target, interleaved);

            fromIndex += simdElements;
            toIndex += simdElements * 2;
        }

        fromIndex = CalculateIndex(0, y, sourceWidth);
        toIndex = CalculateIndex(0, targetY_2, targetWidth);

        for (size_t x = 0; x < sourceWidth; x += simdElements)
        {
            __m128i* source = reinterpret_cast<__m128i*>(sourceBuffer + fromIndex);
            __m256i* targetBelow = reinterpret_cast<__m256i*>(targetBuffer + toIndex);

            __m128i pix = _mm_load_si128(source);
            __m128i p0p0p1p1 = _mm_unpacklo_epi32(pix, pix);
            __m128i p2p2p3p3 = _mm_unpackhi_epi32(pix, pix);
            __m256i interleaved = _mm256_set_m128(p2p2p3p3, p0p0p1p1);
            _mm256_stream_si256(targetBelow, interleaved);

            fromIndex += simdElements;
            toIndex += simdElements * 2;
        }
    });

    // make sure streamed writes are visible after simd ops
    _mm_sfence();

std::execution::par_unseq here means that this block of code can be parallelized and vectorized. Since we already vectorized it, we won't benefit from it, but it doesn't hurt to hint the compiler that our row accesses are independent of each other.

And with that, we are down to 0.60ms, which is about 32gb/sec throughput. Above single channel, but still well below dual channel at 57gb/sec. To hit maximum theoretical bandwidth, we possibly need more bespoke threading, like dividing the rows into adjacent buckets, core pinning etc.

But still, we managed to come down to well below 1 ms on default screen size. On maximized window at 5120x2738, I get 1.55ms or so, about 53.5mb of data and a 35gb/sec throughput. So it seems that our work scales well with our buffer, which is nice!

Conclusion

I did not expect this post to be this long but it was worth it. A well defined problem makes a fun tinkering exercise. I wanted to hand-write some SIMD code for a while now, and this made perfect excuse for it. Great performance gains are bonus. Learned a lot along the way too.

I can't really comment on ARM architecture, but I went along and added NEON intrinsics. Just another dialect of siliconese after all ( ¬ᴗ¬). The availability of zip intrinsic makes that codepath more simplified, and the performance is great on a M1 Macbook Pro, well under 1ms on a maximized window.

I hope you too learned something valuable dear viewer and until next time.