How I learned to stop worrying and love byte ordering
In the last post I announced musli-zerocopy
. In this one I’ll discuss one
of the concerns you’ll need to mitigate to build and distribute archives:
endianness.
There’s broadly two distinct ways that we organize numerical types in memory. We either store the least significant byte first in memory, known as “little endian”, or we store them last calling this “big endian”. This is known as the endianness or byte order of a value in memory.
0x42 0x00 0x00 0x00 // 0x42u32 little endian
0x00 0x00 0x00 0x42 // 0x42u32 big endian
Your system has a native byte order. This is the way it expects data to be organized in memory, so when you’re accessing a value like:
let value = 42u32;
dbg!(value);
The value 42
above is stored on the stack using one of the above layouts. Most
likely little endian.
The first version of musli-zerocopy
ignores byte ordering. You’d be using the
byte order that is native to the platform you’re on. This is not much of a
problem for simple structures, since primitive fields in simple structs like
u32
natively can represent data in any byte order. Rust’s std
library comes
with primitives for handling conversion, which in byte-order parlance is known
as “swapping”.
#[repr(C)]
struct Archive {
field: u32,
field2: u64,
}
let archive = Archive {
field: 42u32.to_be(),
field2: 42u64.to_le(),
}
A struct living in memory which quite happily uses a mixed byte order across fields.
Now if you don’t know the byte order of the fields in the archive. Reading it would be surprising:
dbg!(archive.field); // 704643072
dbg!(archive.field2); // 42
We get this value (0x2a000000
, since 42
is 0x2a
) because of the mismatch
between the native byte order, and the byte order of the the field. To read it
correctly, we need to swap the bytes back. All of these achieve that result on a
little-endian system:
dbg!(u32::from_be(archive.field)); // 42
dbg!(archive.field.swap_bytes()); // 42
dbg!(archive.field.to_be()); // 42
It’s worth pondering why
u32::to_be
works just as well. There’s no inherent byte order in the value being swapped. All a method liketo_be
does is convert to a no-op in case the native byte order is already big-endian. Otherwise it performs the same swapping operation.
I hope this makes it apparent why a byte-order naive application can’t process archives correctly unless the byte order used in the archive matches the native byte order. In order to interpret the values correctly we somehow need extra information and careful processing.
The dictionary I built this library for demands that archives are distributed.
So one way or another it has to be portable with byte ordering in mind. In this
post I’ll detail high level strategies to deal with byte ordering, and introduce
the tools I’ve added to musli-zerocopy
to make it - if not easy - manageable
to work with.
Overview:
- Why even bother?
- Option #1: Don’t bother
- Option #2: Care a lot
- Option #3: Why not both?
- Option #4: Squint hard enough and everything looks like an optimization
- Detour: A missed enum optimization
Why even bother?
Let me make a prediction: Your system uses little endian. This is a good
prediction, because little endian has essentially won. One reason it’s a great
choice is because converting between wide to narrow types is easy because their
arrangement overlaps. To go from a u32
to a u16
you simply just read the
first two bytes of the u32
easy peasy.
0x42 0x00 0x00 0x00 // 0x42u32
0x42 0x00 ---- ---- // 0x42u16
None of the platforms I intend to distribute my dictionary too are big-endian. Yet it’s still important to make sure that it works:
- You don’t want to end up relying on abstractions or designs which end up being a dead end or at worst cause bit rot down the line. Just in case.
- Some formats impose arbitrary ordering demands. I’d like to be able to support these.
- Big-endian for reasons ended up being the de-facto byte order used in network protocols.
But why is it good for values to be natively byte ordered?
Rust can generate significantly improved code when things align well with the
current platform. The following is a difference between two functions - one
which loads an [u32; 8]
array using u32
’s in its native byte order
(effectively copying it), and the other converted using a non-native byte order:
use musli_zerocopy::endian;
pub fn convert_le(values: [u32; 8]) -> [u32; 8] {
endian::from_le(values)
}
pub fn convert_be(values: [u32; 8]) -> [u32; 8] {
endian::from_be(values)
}
The two functions result in the following difference in assembly:
mov rax, rdi
- movups xmm0, xmmword, ptr, [rsi]
- movups xmm1, xmmword, ptr, [rsi, +, 16]
- movups xmmword, ptr, [rdi, +, 16], xmm1
- movups xmmword, ptr, [rdi], xmm0
+ movdqu xmm0, xmmword, ptr, [rsi]
+ movdqu xmm1, xmmword, ptr, [rsi, +, 16]
+ pxor xmm2, xmm2
+ movdqa xmm3, xmm0
+ punpckhbw xmm3, xmm2
+ pshuflw xmm3, xmm3, 27
+ pshufhw xmm3, xmm3, 27
+ punpcklbw xmm0, xmm2
+ pshuflw xmm0, xmm0, 27
+ pshufhw xmm0, xmm0, 27
+ packuswb xmm0, xmm3
+ movdqa xmm3, xmm1
+ punpckhbw xmm3, xmm2
+ pshuflw xmm3, xmm3, 27
+ pshufhw xmm3, xmm3, 27
+ punpcklbw xmm1, xmm2
+ pshuflw xmm1, xmm1, 27
+ pshufhw xmm1, xmm1, 27
+ packuswb xmm1, xmm3
+ movdqu xmmword, ptr, [rdi, +, 16], xmm1
+ movdqu xmmword, ptr, [rdi], xmm0
ret
Despite some neat vectorization, there’s still a significant difference in code size and execution complexity. Work is work, and work that can be avoided means faster applications that use less power.
So how do we accomplish this? In the next four sections I’ll cover some of the available options.
Option #1: Don’t bother
The documentation in musli-zerocopy
(Reading data) mentions that there are some pieces of
data which are needed to transmit an archive from one system to another, we call
it its DNA (borrowed from Blender), these are:
- The alignment of the buffer.
- The byte order of the system which produced the buffer.
- If pointer-sized types are used, the size of a pointer.
- The structure of the type being read.
One option is to include a portable header which indicates its format and if
there is a mismatch we refuse to read the archive and bail!
with an error.
We want to avoid the possibility of the application “accidentally” working since
this might cause unexpected behaviors. Unexpected behaviors at best causes an
error later down the line, at worst a logic bug leading to a security issue.
If security is an application concern, additional safeguards should be put into place, such as integrity checking. But this is not unique to zero-copy serialization.
use anyhow::bail;
use musli_zerocopy::ZeroCopy;
use musli_zerocopy::endian;
/// A marker type for the archive indicating its endianness.
///
/// Since this is a single byte, it is already portable.
#[derive(ZeroCopy)]
#[repr(u8)]
enum Endianness {
Big,
Little,
}
/// The header of the archive.
#[derive(ZeroCopy)]
#[repr(C)]
struct Header {
endianness: Endianness,
pointer_size: u32,
alignment: u32,
}
let buf: &Buf = /* .. */;
let header = buf.load_at::<Header>(0)?;
if header.endianness != endian::pick!(Endianness::Big, Endianness::Little) {
bail!("Archive has the wrong endianness");
}
The pick!
macro is a helper provided by musli-zerocopy
. It will evaluate
to the first argument on big endian systems, and the second on little endian
systems. We’ll make more use of this below.
This is a nice alternative in case your application can download a platform
suitable archive directly. Its DNA can then be represented as a target string,
like dictionary-jp-64-be.bin
.
Option #2: Care a lot
This option is a bit more intrusive, but allows for an archive to be used even if there is a DNA mismatch. We can interact with the archive, but systems with a mismatching byte order will simply have to take the performance hit.
In musli-zerocopy
this is done by explicitly setting endianness and using
wrapping primitives like Endian<T, E>
:
use musli_zerocopy::{Ref, Endian};
use musli_zerocopy::endian::Little;
#[derive(ZeroCopy)]
#[repr(C)]
struct Data
where
E: ByteOrder,
{
name: Ref<str, Little>,
age: Endian<u32, Little>,
}
Reading the field can only be done through accessors such as Endian::to_ne
.
Calling the method will perform the conversion in-place.
use musli_zerocopy::Buf;
let buf = &Buf = /* .. */;
let data: &Data = buf.load_at(0)?;
assert_eq!(data.age.to_ne(), 35);
It isn’t strictly necessary to make use of endian-aware primitives in
musli-zerocopy
, since most numerical types can inhabit all reflexive bit
patterns. It could simply be declared like this and force the bit pattern using
one of the provided conversion methods such as endian::from_le
.
use musli_zerocopy::{Buf, Ref, Endian};
use musli_zerocopy::endian;
#[derive(Clone, Copy, ZeroCopy)]
#[repr(C)]
struct Data
where
E: ByteOrder,
{
name: Ref<str>,
age: u32,
}
let buf = &Buf = /* .. */;
let data: &Data = buf.load_at(0)?;
assert_eq!(endian::from_le(data.age), 35);
Option #3: Why not both?
Maybe the most interesting alternative of them all. Store the archive in multiple orderings and read the one which matches your platform.
🤯
In musli we can do this by storing two references in the header, and make use of
the pick!
macro to decide which one to use. Due to the default value of
Data<E = Native>
the correct byte order is used intrinsically after it’s been
identified since Native
is simply an alias to the system ordering.
use musli_zerocopy::{ByteOrder, Endian, Ref, ZeroCopy};
use musli_zerocopy::endian::{Big, Little, Native};
#[derive(ZeroCopy)]
#[repr(C)]
struct Header {
big: Ref<Data<Big>, Big>,
little: Ref<Data<Little>, Little>,
}
#[derive(ZeroCopy)]
#[repr(C)]
struct Data<E = Native>
where
E: ByteOrder,
{
name: Ref<str, E>,
age: Endian<u32, E>,
}
let buf = &Buf = /* .. */;
let data: &Header = buf.load_at(0)?;
let data: Ref<Data> = endian::pick!(header.big, header.little);
println!("Name: {}", buf.load(data.name)?);
println!("Age: {}", *data.age);
On the up, we get to interact with natively ordered data. On the down, we have to store two versions of every byte-order sensitive data structure in the archive. I took great care to ensure that this can be done generically in Müsli so you at least only need to write the code which populates the archive once.
use musli_zerocopy::{ByteOrder, Endian, OwnedBuf};
fn build_archive<E: ByteOrder>(buf: &mut OwnedBuf) -> Data<E> {
let name = buf.store_unsized("John Doe");
Data {
name,
age: Endian::new(35),
}
}
let big = build_archive(&mut buf);
let little = build_archive(&mut buf);
let header = Header {
big,
little,
};
The keen eyed might note that this is a very typical time-memory trade off. We provide the perfect representation for every platform at the cost of increasing the space used.
In theory we could provide one per pointer-size as well, but the number of permutations start to become … unwieldy.
This is a similar approach to multi-architecture binaries which are supported on Apple.
Option #4: Squint hard enough and everything looks like an optimization
This strategy involves converting a stable well-defined archive into an optimized zero-copy archive that can be loaded on subsequent invocations of your applications. It does mean “taking the hit” if the local archive needs to be constructed, but once it has been you’ve got the best of all worlds.
This is how we’ll go about it:
- Use one of Müsli’s portable formats to store the original archive.
- First time the archive is opened, convert it into a zero-copy archive with a native optimized structure and store a local copy.
- Load the zero-copy archive.
This is the final use-case I intend to improve support for in
musli-zerocopy
. But even without explicit support it is rather straight
forward to implement.
use anyhow::Context;
use std::fs;
use std::path::Path;
use musli::Decode;
use musli_zerocopy::{Ref, Buf, ZeroCopy};
use musli_zerocopy::swiss;
#[derive(Decode)]
struct Archive {
map: HashMap<String, u32>,
}
#[derive(ZeroCopy)]
#[repr(C)]
struct ArchiveNative {
map: swiss::MapRef<Ref<str>, u32>,
}
let archive = Path::new("archive.bin");
let archive_native = Path::new("archive-native.bin");
if !archive_native.is_file() || is_outdated(archive, archive_native) {
tracing::trace!("Missing or outdated {}, building {}", archive.display(), archive_native.display());
let archive = fs::read(archive).context("reading archive")?;
let archive: Archive = musli_storage::from_slice(&archive).context("decoding archive")?;
write_native_archive(&archive, archive_native).context("building native archive")?;
}
let buf: &Buf = Buf::new(mmap_archive(archive_native));
One might be tempted to call this the best of all worlds, as long as you’re OK with spending the disk space for it.
This can even be combined with Option #2, but we only bother converting and re-saving the archive if it is the wrong byte order.
Detour: A missed enum optimization
The goal with these APIs are to ensure that any endian-aware operations impose zero cost if the demanded byte order matches the current one. We have all the necessary tools to muck about with the bytes of a type. Any native-to-native swapping be they generic or not should result in a simple move, like this:
use musli_zerocopy::{endian, ZeroCopy};
#[derive(ZeroCopy)]
#[repr(C)]
pub struct Struct {
bits32: u32,
bits64: u64,
bits128: u128,
array: [u8; 16],
}
#[inline(never)]
pub fn ensure_struct_le(st: Struct) -> Struct {
st.swap_bytes::<endian::Little>()
}
#[inline(never)]
pub fn ensure_struct_be(st: Struct) -> Struct {
st.swap_bytes::<endian::Big>()
}
The generated assembly for each function follows:
ensure_struct_le:
mov rax, rdi
movups xmm0, xmmword, ptr, [rsi]
movups xmm1, xmmword, ptr, [rsi, +, 16]
movups xmm2, xmmword, ptr, [rsi, +, 32]
movups xmmword, ptr, [rdi, +, 32], xmm2
movups xmmword, ptr, [rdi, +, 16], xmm1
movups xmmword, ptr, [rdi], xmm0
ret
In little-endian swapping a struct amounts to copying it. This is in fact the same code that we’d generate if we just passed the struct through.
ensure_struct_be:
mov rax, rdi
mov ecx, dword, ptr, [rsi]
mov rdx, qword, ptr, [rsi, +, 8]
mov rdi, qword, ptr, [rsi, +, 24]
mov r8, qword, ptr, [rsi, +, 16]
bswap r8
bswap rdi
bswap rdx
movups xmm0, xmmword, ptr, [rsi, +, 32]
bswap ecx
mov dword, ptr, [rax], ecx
mov qword, ptr, [rax, +, 8], rdx
mov qword, ptr, [rax, +, 16], rdi
mov qword, ptr, [rax, +, 24], r8
movups xmmword, ptr, [rax, +, 32], xmm0
ret
On a big-endian systems we have to do the work of swapping each field while copying it.
Overall this process has proven to be quite smooth. With the exception for one instance. Consider the following function:
pub enum Enum {
Empty,
One(u32),
Two(u32, u64),
}
pub fn move_enum(en: Enum) -> Enum {
match en {
Enum::Empty => Enum::Empty,
Enum::One(first) => Enum::One(first),
Enum::Two(first, second) => Enum::Two(first, second),
}
}
This should be possible to translate to a simple move, where the bits of the enum are copied. Essentially what we expect to see is this:
move_enum:
mov rax, rdi
movups xmm0, xmmword, ptr, [rsi]
movups xmmword, ptr, [rdi], xmm0
ret
Rather worryingly what we get instead is this:
move_enum:
mov rax, rdi
mov ecx, dword, ptr, [rsi]
test rcx, rcx
je .LBB0_1
cmp ecx, 1
jne .LBB0_4
mov ecx, dword, ptr, [rsi, +, 4]
mov dword, ptr, [rax, +, 4], ecx
mov ecx, 1
mov dword, ptr, [rax], ecx
ret
.LBB0_1:
xor ecx, ecx
mov dword, ptr, [rax], ecx
ret
.LBB0_4:
mov ecx, dword, ptr, [rsi, +, 4]
mov rdx, qword, ptr, [rsi, +, 8]
mov dword, ptr, [rax, +, 4], ecx
mov qword, ptr, [rax, +, 8], rdx
mov ecx, 2
mov dword, ptr, [rax], ecx
ret
Rust decides to check the discriminant and generate a branch for each unique variant. It then proceeds to carefully just copy the data for that variant.
This is a unfortunate for us since we use this pattern when byte swapping enums. If we did nothing, it would mean that native-to-native swapping wouldn’t be a no-op:
use musli_zerocopy::{ByteOrder, ZeroCopy};
pub enum Enum {
Empty,
One(u32),
Two(u32, u64),
}
impl ZeroCopy for Enum {
fn swap_bytes<E: ByteOrder>(en: Enum) -> Enum {
match en {
Enum::Empty => Enum::Empty,
Enum::One(first) => Enum::One(first.swap_bytes::<E>()),
Enum::Two(first, second) => Enum::Two(first.swap_bytes::<E>(), second.swap_bytes::<E>()),
}
}
}
Why this happens is a rather unfortunate case of a missed optimization. It never got merged due to fears of a potential performance regression and wasn’t picked up. Hopefully these types of MIR-based optimizations can be revived again, because without them we still might worry about reducing “LLVM bloat” to improve compile times over writing idiomatic Rust.
// Avoid `Option::map` because it bloats LLVM IR.
match self.remove_entry(k) {
Some((_, v)) => Some(v),
None => None,
}
Example form
rust-lang/hashbrown
.
Once the issue was identified the fix was simple enough. I’ve introduced a
method on ByteOrder
which optionally maps a value in case the byte order
differs from Native
ensuring that native-to-native swapping doesn’t do the
extra work dealing with each unique variant individually.
Conclusion
Congratulations on reading this far. I hope you’ve enjoying my explorations into the zero-copy space.
If there’s anything you want to ask or comment on, feel free to post below or on Reddit. Thank you!