Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
999 views
in Technique[技术] by (71.8m points)

assembly - Reverse byte order in XMM or YMM register?

Let's say I want to reverse the byte order of a very large byte array. I can do this the slow way using the main registers but I would like to speed it up using the XMM or YMM registers.

Is there a way to reverse the byte order in an XMM or YMM register?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Yes, use SSSE3 _mm_shuffle_epi8 or AVX2 _mm256_shuffle_epi8 to shuffle bytes within 16-byte AVX2 "lanes". Depending on the shuffle control vector, you can swap pairs of bytes, reverse 4-byte units, or reverse 8-byte units. Or reverse all 16 bytes.

But vpshufb isn't lane-crossing, so you can't reverse 32 bytes with one instruction until AVX512VBMI vpermb. vpshufb ymm does 2x 16-byte shuffles in the two 128-bit lanes of the YMM vector.

So if you're byte-reversing an entire array, rather than the endianness / byte-order of individual elements in an array, you have 3 options:

  • Stick to 128-bit vectors (simple and portable, and probably not slower on current CPUs). And only needs 16-byte alignment for best performance.
  • Load with vmovdqu / vinsert128, then vpshufb then 32-byte store. (Or do 32-byte loads and split 16-byte stores, but that's probably not as good). Vectorize random init and print for BigInt with decimal digit array, with AVX2? includes a cache-blocked byte-aarray reverse into a tmp buffer to feed fwrite in 8kiB chunks.
  • Use vpermq to lane-swap before or after vpshufb (not great on AMD, and bottlenecks on 1 per clock shuffle throughput on current Intel). But potentially very good on Ice Lake (2 shuffle ports)

vpshufb is a single uop instruction on Intel, or 2 on AMD, and processes 32 bytes of data at once.

For very large inputs, it's probably worth it to reach a 32 or 64-byte alignment boundary before your vectorized loop, so none of the loads/stores cross cache-line boundaries. (For small inputs the minor benefit isn't worth the extra prologue/epilogue code and branching.)


But potentially even better is to only swap a 16kiB chunk before you use it, so it's still hot in L1d cache when the next step reads it. This is called cache blocking. Or maybe use 128kiB chunks to block for L2 cache size.

You might swap in chunks as you read the data from a file. e.g. do read() system calls in chunks of 64k or 128k and swap the result while it's still hot in cache after the kernel copied data from the pagecache into your user-space buffer. Or use mmap to memory-map a file, and run a copy-and-swap loop from that. (Or for a private mapping, in-place swap; but that will trigger copy-on-write anyway so not much benefit. And file-backed mmap on Linux can't use anonymous hugepages).

Another option is to simply swap on the fly if you only read the data a couple times; if the later uses are still memory bound, or have room for a shuffle uop without bottlenecking, it probably won't slow them down to shuffle on the fly.

A pass that touches all your data and only byte-swaps it has very poor computational intensity; you want to be doing more things with your data while it's in registers, or at least while it's hot in cache. But if you only byte-swap once and then read the data many times, or in a random access pattern, or from another language like Python or JavaScript that can't efficiently swap on the fly, then sure do a swap pass.

Or a swap pass is helpful if you will make multiple passes over it that aren't memory-bound, and an extra shuffle would slow down each later pass. In that case you do want to cache-block the swapping so the later pass's input is hot in cache.


The scalar option, bswap, is limited to at best 8 bytes per clock cycle, and every 8 bytes needs a separate load and store instruction. (movbe to load from memory with byte-swapping saves an instruction, but on mainstream CPUs doesn't micro-fuse into a single load+swap uop. On Silvermont it's single-uop, though.) Also, Intel bswap r64 is 2 uops, so it's not great.

This might saturate single-threaded memory bandwidth on modern CPUs with some loop unrolling, but SIMD with fewer total uops to process the same data lets out-of-order execution "see" farther ahead and start processing TLB misses for upcoming pages sooner, for example. HW data prefetch and TLB prefetch do help a lot, but it's typically at least slightly better to use wider loads/stores for memcpy.

(vpshufb is cheap enough that this will still basically perform like memcpy. Or better if rewriting in place.)

And of course if you ever have any cache hits, even just L3 cache, SIMD will really shine.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...