How I learned to stop worrying and love byte ordering

28 Oct 2023

In the last post I announced musli-zerocopy. In this one I’ll discuss one of the concerns you’ll need to mitigate to build and distribute archives: endianness.

There’s broadly two distinct ways that we organize numerical types in memory. We either store the least significant byte first in memory, known as “little endian”, or we store them last calling this “big endian”. This is known as the endianness or byte order of a value in memory.

0x42 0x00 0x00 0x00 // 0x42u32 little endian
0x00 0x00 0x00 0x42 // 0x42u32 big endian

Your system has a native byte order. This is the way it expects data to be organized in memory, so when you’re accessing a value like:

let value = 42u32;
dbg!(value);

The value 42 above is stored on the stack using one of the above layouts. Most likely little endian.

The first version of musli-zerocopy ignores byte ordering. You’d be using the byte order that is native to the platform you’re on. This is not much of a problem for simple structures, since primitive fields in simple structs like u32 natively can represent data in any byte order. Rust’s std library comes with primitives for handling conversion, which in byte-order parlance is known as “swapping”.

#[repr(C)]
struct Archive {
    field: u32,
    field2: u64,
}

let archive = Archive {
    field: 42u32.to_be(),
    field2: 42u64.to_le(),
}

A struct living in memory which quite happily uses a mixed byte order across fields.

Now if you don’t know the byte order of the fields in the archive. Reading it would be surprising:

dbg!(archive.field); // 704643072
dbg!(archive.field2); // 42

We get this value (0x2a000000, since 42 is 0x2a) because of the mismatch between the native byte order, and the byte order of the the field. To read it correctly, we need to swap the bytes back. All of these achieve that result on a little-endian system:

dbg!(u32::from_be(archive.field)); // 42
dbg!(archive.field.swap_bytes()); // 42
dbg!(archive.field.to_be()); // 42

It’s worth pondering why u32::to_be works just as well. There’s no inherent byte order in the value being swapped. All a method like to_be does is convert to a no-op in case the native byte order is already big-endian. Otherwise it performs the same swapping operation.

I hope this makes it apparent why a byte-order naive application can’t process archives correctly unless the byte order used in the archive matches the native byte order. In order to interpret the values correctly we somehow need extra information and careful processing.

The dictionary I built this library for demands that archives are distributed. So one way or another it has to be portable with byte ordering in mind. In this post I’ll detail high level strategies to deal with byte ordering, and introduce the tools I’ve added to musli-zerocopy to make it - if not easy - manageable to work with.

Overview:

Why even bother?
Option #1: Don’t bother
Option #2: Care a lot
Option #3: Why not both?
Option #4: Squint hard enough and everything looks like an optimization
Detour: A missed enum optimization

Why even bother?

Let me make a prediction: Your system uses little endian. This is a good prediction, because little endian has essentially won. One reason it’s a great choice is because converting between wide to narrow types is easy because their arrangement overlaps. To go from a u32 to a u16 you simply just read the first two bytes of the u32 easy peasy.

0x42 0x00 0x00 0x00 // 0x42u32
0x42 0x00 ---- ---- // 0x42u16

None of the platforms I intend to distribute my dictionary too are big-endian. Yet it’s still important to make sure that it works:

You don’t want to end up relying on abstractions or designs which end up being a dead end or at worst cause bit rot down the line. Just in case.
Some formats impose arbitrary ordering demands. I’d like to be able to support these.
Big-endian for reasons ended up being the de-facto byte order used in network protocols.

But why is it good for values to be natively byte ordered?

Rust can generate significantly improved code when things align well with the current platform. The following is a difference between two functions - one which loads an [u32; 8] array using u32’s in its native byte order (effectively copying it), and the other converted using a non-native byte order:

use musli_zerocopy::endian;

pub fn convert_le(values: [u32; 8]) -> [u32; 8] {
    endian::from_le(values)
}

pub fn convert_be(values: [u32; 8]) -> [u32; 8] {
    endian::from_be(values)
}

The two functions result in the following difference in assembly:

  mov     rax, rdi
- movups  xmm0, xmmword, ptr, [rsi]
- movups  xmm1, xmmword, ptr, [rsi, +, 16]
- movups  xmmword, ptr, [rdi, +, 16], xmm1
- movups  xmmword, ptr, [rdi], xmm0
+ movdqu  xmm0, xmmword, ptr, [rsi]
+ movdqu  xmm1, xmmword, ptr, [rsi, +, 16]
+ pxor    xmm2, xmm2
+ movdqa  xmm3, xmm0
+ punpckhbw xmm3, xmm2
+ pshuflw xmm3, xmm3, 27
+ pshufhw xmm3, xmm3, 27
+ punpcklbw xmm0, xmm2
+ pshuflw xmm0, xmm0, 27
+ pshufhw xmm0, xmm0, 27
+ packuswb xmm0, xmm3
+ movdqa  xmm3, xmm1
+ punpckhbw xmm3, xmm2
+ pshuflw xmm3, xmm3, 27
+ pshufhw xmm3, xmm3, 27
+ punpcklbw xmm1, xmm2
+ pshuflw xmm1, xmm1, 27
+ pshufhw xmm1, xmm1, 27
+ packuswb xmm1, xmm3
+ movdqu  xmmword, ptr, [rdi, +, 16], xmm1
+ movdqu  xmmword, ptr, [rdi], xmm0
  ret

Despite some neat vectorization, there’s still a significant difference in code size and execution complexity. Work is work, and work that can be avoided means faster applications that use less power.

So how do we accomplish this? In the next four sections I’ll cover some of the available options.

Option #1: Don’t bother

The documentation in musli-zerocopy (Reading data) mentions that there are some pieces of data which are needed to transmit an archive from one system to another, we call it its DNA (borrowed from Blender), these are:

The alignment of the buffer.
The byte order of the system which produced the buffer.
If pointer-sized types are used, the size of a pointer.
The structure of the type being read.

One option is to include a portable header which indicates its format and if there is a mismatch we refuse to read the archive and bail! with an error. We want to avoid the possibility of the application “accidentally” working since this might cause unexpected behaviors. Unexpected behaviors at best causes an error later down the line, at worst a logic bug leading to a security issue.

If security is an application concern, additional safeguards should be put into place, such as integrity checking. But this is not unique to zero-copy serialization.

use anyhow::bail;

use musli_zerocopy::ZeroCopy;
use musli_zerocopy::endian;

/// A marker type for the archive indicating its endianness.
///
/// Since this is a single byte, it is already portable.
#[derive(ZeroCopy)]
#[repr(u8)]
enum Endianness {
    Big,
    Little,
}

/// The header of the archive.
#[derive(ZeroCopy)]
#[repr(C)]
struct Header {
    endianness: Endianness,
    pointer_size: u32,
    alignment: u32,
}

let buf: &Buf = /* .. */;

let header = buf.load_at::<Header>(0)?;

if header.endianness != endian::pick!(Endianness::Big, Endianness::Little) {
    bail!("Archive has the wrong endianness");
}

The pick! macro is a helper provided by musli-zerocopy. It will evaluate to the first argument on big endian systems, and the second on little endian systems. We’ll make more use of this below.

This is a nice alternative in case your application can download a platform suitable archive directly. Its DNA can then be represented as a target string, like dictionary-jp-64-be.bin.

Option #2: Care a lot

This option is a bit more intrusive, but allows for an archive to be used even if there is a DNA mismatch. We can interact with the archive, but systems with a mismatching byte order will simply have to take the performance hit.

In musli-zerocopy this is done by explicitly setting endianness and using wrapping primitives like Endian<T, E>:

use musli_zerocopy::{Ref, Endian};
use musli_zerocopy::endian::Little;

#[derive(ZeroCopy)]
#[repr(C)]
struct Data
where
    E: ByteOrder,
{
    name: Ref<str, Little>,
    age: Endian<u32, Little>,
}

Reading the field can only be done through accessors such as Endian::to_ne. Calling the method will perform the conversion in-place.

use musli_zerocopy::Buf;

let buf = &Buf = /* .. */;

let data: &Data = buf.load_at(0)?;

assert_eq!(data.age.to_ne(), 35);

It isn’t strictly necessary to make use of endian-aware primitives in musli-zerocopy, since most numerical types can inhabit all reflexive bit patterns. It could simply be declared like this and force the bit pattern using one of the provided conversion methods such as endian::from_le.

use musli_zerocopy::{Buf, Ref, Endian};
use musli_zerocopy::endian;

#[derive(Clone, Copy, ZeroCopy)]
#[repr(C)]
struct Data
where
    E: ByteOrder,
{
    name: Ref<str>,
    age: u32,
}

let buf = &Buf = /* .. */;

let data: &Data = buf.load_at(0)?;

assert_eq!(endian::from_le(data.age), 35);

Option #3: Why not both?

Maybe the most interesting alternative of them all. Store the archive in multiple orderings and read the one which matches your platform.

🤯

In musli we can do this by storing two references in the header, and make use of the pick! macro to decide which one to use. Due to the default value of Data<E = Native> the correct byte order is used intrinsically after it’s been identified since Native is simply an alias to the system ordering.

use musli_zerocopy::{ByteOrder, Endian, Ref, ZeroCopy};
use musli_zerocopy::endian::{Big, Little, Native};

#[derive(ZeroCopy)]
#[repr(C)]
struct Header {
    big: Ref<Data<Big>, Big>,
    little: Ref<Data<Little>, Little>,
}

#[derive(ZeroCopy)]
#[repr(C)]
struct Data<E = Native>
where
    E: ByteOrder,
{
    name: Ref<str, E>,
    age: Endian<u32, E>,
}

let buf = &Buf = /* .. */;

let data: &Header = buf.load_at(0)?;
let data: Ref<Data> = endian::pick!(header.big, header.little);

println!("Name: {}", buf.load(data.name)?);
println!("Age: {}", *data.age);

On the up, we get to interact with natively ordered data. On the down, we have to store two versions of every byte-order sensitive data structure in the archive. I took great care to ensure that this can be done generically in Müsli so you at least only need to write the code which populates the archive once.

use musli_zerocopy::{ByteOrder, Endian, OwnedBuf};

fn build_archive<E: ByteOrder>(buf: &mut OwnedBuf) -> Data<E> {
    let name = buf.store_unsized("John Doe");

    Data {
        name,
        age: Endian::new(35),
    }
}

let big = build_archive(&mut buf);
let little = build_archive(&mut buf);

let header = Header {
    big,
    little,
};

The keen eyed might note that this is a very typical time-memory trade off. We provide the perfect representation for every platform at the cost of increasing the space used.

In theory we could provide one per pointer-size as well, but the number of permutations start to become … unwieldy.

This is a similar approach to multi-architecture binaries which are supported on Apple.

Option #4: Squint hard enough and everything looks like an optimization

This strategy involves converting a stable well-defined archive into an optimized zero-copy archive that can be loaded on subsequent invocations of your applications. It does mean “taking the hit” if the local archive needs to be constructed, but once it has been you’ve got the best of all worlds.

This is how we’ll go about it:

Use one of Müsli’s portable formats to store the original archive.
First time the archive is opened, convert it into a zero-copy archive with a native optimized structure and store a local copy.
Load the zero-copy archive.

This is the final use-case I intend to improve support for in musli-zerocopy. But even without explicit support it is rather straight forward to implement.

use anyhow::Context;

use std::fs;
use std::path::Path;

use musli::Decode;
use musli_zerocopy::{Ref, Buf, ZeroCopy};
use musli_zerocopy::swiss;

#[derive(Decode)]
struct Archive {
    map: HashMap<String, u32>,
}

#[derive(ZeroCopy)]
#[repr(C)]
struct ArchiveNative {
    map: swiss::MapRef<Ref<str>, u32>,
}

let archive = Path::new("archive.bin");
let archive_native = Path::new("archive-native.bin");

if !archive_native.is_file() || is_outdated(archive, archive_native) {
    tracing::trace!("Missing or outdated {}, building {}", archive.display(), archive_native.display());

    let archive = fs::read(archive).context("reading archive")?;
    let archive: Archive = musli_storage::from_slice(&archive).context("decoding archive")?;
    write_native_archive(&archive, archive_native).context("building native archive")?;
}

let buf: &Buf = Buf::new(mmap_archive(archive_native));

One might be tempted to call this the best of all worlds, as long as you’re OK with spending the disk space for it.

This can even be combined with Option #2, but we only bother converting and re-saving the archive if it is the wrong byte order.

Detour: A missed enum optimization

The goal with these APIs are to ensure that any endian-aware operations impose zero cost if the demanded byte order matches the current one. We have all the necessary tools to muck about with the bytes of a type. Any native-to-native swapping be they generic or not should result in a simple move, like this:

use musli_zerocopy::{endian, ZeroCopy};

#[derive(ZeroCopy)]
#[repr(C)]
pub struct Struct {
    bits32: u32,
    bits64: u64,
    bits128: u128,
    array: [u8; 16],
}

#[inline(never)]
pub fn ensure_struct_le(st: Struct) -> Struct {
    st.swap_bytes::<endian::Little>()
}

#[inline(never)]
pub fn ensure_struct_be(st: Struct) -> Struct {
    st.swap_bytes::<endian::Big>()
}

The generated assembly for each function follows:

ensure_struct_le:
 mov     rax, rdi
 movups  xmm0, xmmword, ptr, [rsi]
 movups  xmm1, xmmword, ptr, [rsi, +, 16]
 movups  xmm2, xmmword, ptr, [rsi, +, 32]
 movups  xmmword, ptr, [rdi, +, 32], xmm2
 movups  xmmword, ptr, [rdi, +, 16], xmm1
 movups  xmmword, ptr, [rdi], xmm0
 ret

In little-endian swapping a struct amounts to copying it. This is in fact the same code that we’d generate if we just passed the struct through.

ensure_struct_be:
 mov     rax, rdi
 mov     ecx, dword, ptr, [rsi]
 mov     rdx, qword, ptr, [rsi, +, 8]
 mov     rdi, qword, ptr, [rsi, +, 24]
 mov     r8, qword, ptr, [rsi, +, 16]
 bswap   r8
 bswap   rdi
 bswap   rdx
 movups  xmm0, xmmword, ptr, [rsi, +, 32]
 bswap   ecx
 mov     dword, ptr, [rax], ecx
 mov     qword, ptr, [rax, +, 8], rdx
 mov     qword, ptr, [rax, +, 16], rdi
 mov     qword, ptr, [rax, +, 24], r8
 movups  xmmword, ptr, [rax, +, 32], xmm0
 ret

On a big-endian systems we have to do the work of swapping each field while copying it.

Overall this process has proven to be quite smooth. With the exception for one instance. Consider the following function:

pub enum Enum {
    Empty,
    One(u32),
    Two(u32, u64),
}

pub fn move_enum(en: Enum) -> Enum {
    match en {
        Enum::Empty => Enum::Empty,
        Enum::One(first) => Enum::One(first),
        Enum::Two(first, second) => Enum::Two(first, second),
    }
}

This should be possible to translate to a simple move, where the bits of the enum are copied. Essentially what we expect to see is this:

move_enum:
 mov     rax, rdi
 movups  xmm0, xmmword, ptr, [rsi]
 movups  xmmword, ptr, [rdi], xmm0
 ret

Rather worryingly what we get instead is this:

move_enum:
 mov     rax, rdi
 mov     ecx, dword, ptr, [rsi]
 test    rcx, rcx
 je      .LBB0_1
 cmp     ecx, 1
 jne     .LBB0_4
 mov     ecx, dword, ptr, [rsi, +, 4]
 mov     dword, ptr, [rax, +, 4], ecx
 mov     ecx, 1
 mov     dword, ptr, [rax], ecx
 ret
.LBB0_1:
 xor     ecx, ecx
 mov     dword, ptr, [rax], ecx
 ret
.LBB0_4:
 mov     ecx, dword, ptr, [rsi, +, 4]
 mov     rdx, qword, ptr, [rsi, +, 8]
 mov     dword, ptr, [rax, +, 4], ecx
 mov     qword, ptr, [rax, +, 8], rdx
 mov     ecx, 2
 mov     dword, ptr, [rax], ecx
 ret

Try it for yourself on godbolt.

Rust decides to check the discriminant and generate a branch for each unique variant. It then proceeds to carefully just copy the data for that variant.

This is a unfortunate for us since we use this pattern when byte swapping enums. If we did nothing, it would mean that native-to-native swapping wouldn’t be a no-op:

use musli_zerocopy::{ByteOrder, ZeroCopy};

pub enum Enum {
    Empty,
    One(u32),
    Two(u32, u64),
}

impl ZeroCopy for Enum {
    fn swap_bytes<E: ByteOrder>(en: Enum) -> Enum {
        match en {
            Enum::Empty => Enum::Empty,
            Enum::One(first) => Enum::One(first.swap_bytes::<E>()),
            Enum::Two(first, second) => Enum::Two(first.swap_bytes::<E>(), second.swap_bytes::<E>()),
        }
    }
}

Why this happens is a rather unfortunate case of a missed optimization. It never got merged due to fears of a potential performance regression and wasn’t picked up. Hopefully these types of MIR-based optimizations can be revived again, because without them we still might worry about reducing “LLVM bloat” to improve compile times over writing idiomatic Rust.

// Avoid `Option::map` because it bloats LLVM IR.
match self.remove_entry(k) {
    Some((_, v)) => Some(v),
    None => None,
}

Example form rust-lang/hashbrown.

Once the issue was identified the fix was simple enough. I’ve introduced a method on ByteOrder which optionally maps a value in case the byte order differs from Native ensuring that native-to-native swapping doesn’t do the extra work dealing with each unique variant individually.

Conclusion

Congratulations on reading this far. I hope you’ve enjoying my explorations into the zero-copy space.

If there’s anything you want to ask or comment on, feel free to post below or on Reddit. Thank you!

A fresh look on incremental zero copy serialization

19 Oct 2023

It’s been a few months since I released Müsli to the world. An experimental binary serialization framework. I’ve used it quite happily in a few projects. But there was one area which it didn’t exactly fit.

Say you have a database, a dictionary with millions of text entries and cross references. And you want to perform sporadic but fast word lookups on the command line. How do you store the database?

We might opt to reach for some neat binary serialization. But even the fastest framework takes seconds to load the data set, which just isn’t that great of a user experience for a CLI tool. This is understandable, because the job of a deserializer is broadly to take byte array and turn it into a data structure that the Rust application can use. Typically this is done by painstakingly long walk through the data and convert it to boxes or lists or other things that Rust programmers can reason about.

For millions of text entries - those boxes, lists, and other things containing millions of entries that need to be allocated, decoded, arranged, cuddled … work that inevitably takes up resources we’d rather not spend.

This is where zero-copy serialization might come in handy. Instead of loading and processing the entire contents of the file, we might be able to map the file as-is into the application and treat it as Rust data structures directly. In essence a &[u8] to &T translation layer. Take a slice, return a reference to some type we can work with.

Now I can see the look on your face, isn’t that wildly unsafe?

Well yes. Of course it. Super duper unsafe¹. But it doesn’t have to be unsound.

In this article I’ll discuss what it takes to soundly map data directly onto structures which are defined in Rust. And while doing so introduce a library I spent the last couple of weeks working on to make that easier. Namely musli-zerocopy.

There are two directions I want to discuss, soundly converting a data structure into a a slice, and soundly converting a slice into the reference of a data structure.

As a basis there are a few things we should know about &T.

The reference has to point to data which is aligned with respect to T;
All bytes in T except for padding (we’ll talk about this later) have to be initialized to some value, and;
The bit pattern that is stores at &T has to be valid for T.

So if we take it in order. To convert a rust type into an slice we will have to cover the topics of The #[repr(Rust)] representation and alignment and padding. After that when when we discuss converting a slice back into a Rust type we also need to cover the notion of legal bit patterns.

So… onwards!

The #[repr(Rust)] representation

Rust provides very few guarantees what the layout of a data structure should be. While this might seem like a nuisance at first, it’s actually very helpful.

When Rust sees a data structure like this:

struct Yoke {
    a: u8,
    number: u32,
    b: u8,
    c: u8,
    d: u8,
}

assert_eq!(std::mem::size_of::<Yoke>(), 8);

The size of the structure is 8.

If we switch to #[repr(C)], which has a well-defined layout, the size of the structure increases to 12.

#[repr(C)]
struct Yoke {
    a: u8,
    number: u32,
    b: u8,
    c: u8,
    d: u8,
}

assert_eq!(std::mem::size_of::<Yoke>(), 12);

This is because #[repr(C)] is required to order the fields per their definition order, and respect each fields alignment its alignment is the multiple of a memory address that a specific structure wants to be located on, and for u32 that is 4. To achieve this Rust moves the location of the number field. This causes some extra unused space to be used in the struct called padding.

Note An alignment is required to be a power of two, and Yoke itself inherits the largest alignment of its fields which ensures that as long as Yoke is aligned in memory, so too will number.

#[repr(Rust)] on the other hand is free to re-arrange the fields. Yoke might be rewritten to this:

struct Yoke {
    number: u32,
    a: u8,
    b: u8,
    c: u8,
    d: u8,
}

Which neatly means that the alignment of all the fields is respected, and we save space.

In order to ensure that a data structure is valid, we can’t have Rust move things around without telling us. So we’re obligated to use #[repr(C)]. To do this correctly we need to delve deeper into the next topic. A tour into alignment and padding.

A tour into alignment and padding

When Rust sees a structure like this:

#[repr(C)]
struct Yoke {
    a: u8,
    number: u32,
}

It internally treats it as something like this:

#[repr(C)]
struct Yoke {
    a: u8,
    _pad1: [u8; 3],
    number: u32,
}

So if we copy Yoke byte-by-byte, we will catch the value of _pad1 as well. This causes a problem, because these values are not initialized.

First, padding bytes are considered uninitialized.

This is illustrated perfectly by our good friend Miri with this snippet:

use std::slice;
use std::mem::size_of;

#[repr(C)]
struct Yoke {
    a: u8,
    number: u32,
}

fn main() {
    let link = Yoke {
        a: 0x01,
        number: 0x02030405,
    };

    let bytes = unsafe {
        let ptr = (&link as *const Yoke).cast::<u8>();
        slice::from_raw_parts(ptr, size_of::<Yoke>())
    };

    // Touch the first byte in the padding between a and b.
    let b1 = bytes[1];
}

error: Undefined Behavior: using uninitialized data, but this
       operation requires initialized memory
  --> src/main.rs:21:14
   |
21 |     let b1 = bytes[1];
   |              ^^^^^^^^ using uninitialized data, but this
                           operation requires initialized memory

While Miri strictly speaking isn’t authoritative on all matters, it is right here. So does this mean that there’s nothing we can do?

Well yes, we can initialize the padding ourselves. For example, directly where Yoke is stored:

use std::slice;
use std::mem::size_of;

#[repr(C)]
struct Yoke {
    a: u8,
    number: u32,
}

fn main() {
    let mut link = Yoke {
        a: 0x01,
        number: 0x02030405,
    };
    
    unsafe {
        let ptr = (&mut link as *mut Yoke).cast::<u8>();
        // Zero out three bytes of padding at offset 1
        ptr.add(1).write_bytes(0, 3);
    }

    let bytes = unsafe {
        let ptr = (&link as *const Yoke).cast::<u8>();
        slice::from_raw_parts(ptr, size_of::<Yoke>())
    };

    dbg!(&bytes);
}

If that seems like too much, musli-zerocopy has a shorthand for the above called initialize_padding, so the above becomes.

Deserialization seems easy in comparison. As long as we know that the buffer we’re using is initialized.

use std::mem::size_of;

#[derive(Debug)]
#[repr(C)]
struct Yoke {
    a: u8,
    number: u32,
}

fn main() {
    let bytes: [u8; size_of::<Yoke>()] = [1, 0, 0, 0, 5, 4, 3, 2];
    let link: &Yoke = unsafe { &*(&bytes as *const u8).cast::<Yoke>() };
    dbg!(&link);
}

But this is subtly wrong. As mentioned in the beginning &T has to reference a T that is stored at an aligned memory location:

error: Undefined Behavior: accessing memory with alignment 2,
       but alignment 4 is required
  --> src/main.rs:12:25
   |
12 |     let link = unsafe { &*(&bytes as *const u8).cast::<Yoke>() };
   |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |                         accessing memory with alignment 2,
   |                         but alignment 4 is required

The nasty part here is that if we’re unlucky, we might’ve ended up loading Yoke from an aligned memory address on the stack and wouldn’t notice this issue with Miri.

So for deserialization we have at least the problem of alignment. Luckily we have tools for that. Including this hack:

use std::mem::size_of;

#[derive(Debug)]
#[repr(C)]
struct Yoke {
    a: u8,
    next: u32,
}

// Align `T` with respect to `Yoke`.
#[repr(C)]
struct Align<T>([Yoke; 0], T);

fn main() {
    let bytes: &Align<[u8; size_of::<Yoke>()]> = &Align([], [1, 0, 0, 0, 5, 4, 3, 2]);
    let link = unsafe { &*(&bytes.1 as *const u8).cast::<Yoke>() };
    dbg!(&link);
}

This stems from the fact that the alignment of the vector matches the largest alignment of its elements. So even if the vector inside of Align is zero-sized, it still forces the whole structure to be aligned by Yoke. Since it’s zero-sized, we must conclude that the subsequent T is stored at the same location, which luckily for us now is aligned by Yoke.

This is a neat hack. And I do use it in some places to store the output of include_bytes! in an aligned buffer. But as a tool it is rather clunky. We have to decide on the alignment up front at compile time.

If only there was a data structure that understand how to align its memory…

Ah well, that’s for later. First we need to convert another problem with deserialization. What exactly constitutes a legal bit patterns for a structure?

Legal bit patterns

Let’s revisit Yoke, but change the second field to be a char instead of a u32. Of course we’ll also rename it, because otherwise it would be confusing.

#[derive(Debug)]
#[repr(C)]
struct Yoke {
    a: u8,
    character: char,
}

If we try to cast a byte array into this structure, we need to make sure that all bit patterns are inhabitable by it. But what bit patterns can a char inhabit?

Reading the rust reference and char::from_u32 we can intuit that a char is 4 bytes wide, but cannot inhabit all possible bit patterns that those 4 bytes can represent.

Miri seconds this:

struct Align<T>([char; 0], T);

fn main() {
    let values: Align<[u8; 4]> = Align([], 0x110000u32.to_ne_bytes());
    let c = unsafe { *(values.1.as_ptr() as *const u8).cast::<char>() };
}

  |
5 |     let c = unsafe { *(values.1.as_ptr() as *const u8).cast::<char>() };
  |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |                      constructing invalid value: encountered 0x00110000,
  |                      but expected a valid unicode scalar value
  |                      (in `0..=0x10FFFF` but not in `0xD800..=0xDFFF`)

This is how we’d define the legal bit pattern for any type T with a #[repr(C)]:

Each field must be a legal bit pattern at their corresponding offsets.
If the type is an enum, there is a first secret field called the discriminant which determines which variant is used. Then each field of the variant has to have a legal bit pattern at their corresponding offsets.
Padding regions are ignored.

So let’s try to write a validator for Yoke which respects all these requirements:

use std::mem::{align_of, size_of};

#[derive(Debug)]
#[repr(C)]
struct Yoke {
    a: u8,
    character: char,
}

#[derive(Debug)]
enum Error {
    NotAligned,
    BufferUnderflow,
    BadChar,
}

fn validate(bytes: &[u8]) -> Result<&Yoke, Error> {
    // Check that the buffer is aligned.
    if (bytes.as_ptr() as usize) % align_of::<Yoke>() != 0 {
        return Err(Error::NotAligned);
    }

    // Check that the buffer has the right number of bytes.
    if bytes.len() < size_of::<Yoke>() {
        return Err(Error::BufferUnderflow);
    }

    // Since the first field is an u8, it can inhabit *any* bit pattern.

    // Access the character field, at offset 4.
    let character_ptr: *const u32 = unsafe { bytes.as_ptr().add(4).cast::<u32>() };
    // Access the raw representation of character, which is a u32.
    let character_raw = unsafe { character_ptr.read_unaligned() };
    // Test that the raw representation is a valid character.
    if char::from_u32(character_raw).is_none() {
        return Err(Error::BadChar);
    }

    // We've inspected all fields and they've passed, so now we're good to cast the reference.
    Ok(unsafe { &*bytes.as_ptr().cast() })
}

struct Align<T>([Yoke; 0], T);

fn main() {
    let data = Align([], [42, 0, 0, 0, 80, 0, 0, 0]);
    let link = validate(&data.1);
    dbg!(link);
}

That is… a lot. And we still have to align the buffer ourself.

Making it easier

The goal of musli-zerocopy is making the above work easier while providing high level tooling like references, slices or even other collections to build and interact with complex data structures with ease.

This is what it looks like:

use musli_zerocopy::buf;
use musli_zerocopy::{Ref, ZeroCopy};

#[derive(ZeroCopy)]
#[repr(C)]
struct Person {
    age: u8,
    name: Ref<str>,
}

let buf = buf::aligned_buf::<Person>(include_bytes!("author.bin"));
let person = Person::from_bytes(&buf[..])?;

assert_eq!(person.age, 35);
// References are incrementally validated.
assert_eq!(buf.load(person.name)?, "John-John");

And this is the code that wrote author.bin:

use std::fs;
use musli_zerocopy::{OwnedBuf, Ref, ZeroCopy};

fn main() -> Result<()> {
    let mut buf = OwnedBuf::new();

    // Insert an uninitialized value to ensure that the person data is
    // stored at offset 0.
    let person = buf.store_uninit::<Person>();

    let value = Person {
        age: 35,
        name: buf.store_unsized("John-John"),
    };

    buf.load_uninit_mut(person).write(&value);
    fs::write("person.bin", &buf[..])?;
    Ok(())
}

This takes care of padding, alignment, and validation for most Rust types which have a stable well-defined layout. We provide tools to easily build correct, rich applications leveraging zero-copy data structures. It is really fast, and should be memory safe (assuming there are no bugs or false assumptions from my part).

I also want to emphasize the focus on incremental rather than up front validation, which ensures that startup is fast and doesn’t needlessly consume system resources. Incremental validation means that instead of validating the whole structure in one go, we only validate the elements which are accessed, like with buf.load(person.name)? above.

Tackling the dictionary

To loop back to what started me down this path, this is what the index of my dictionary currently looks like using musli-zerocopy:

use musli_zerocopy::swiss::MapRef;
use musli_zerocopy::{Ref, ZeroCopy};

use crate::{PartOfSpeech, Id};

#[derive(ZeroCopy)]
#[repr(C)]
struct Index {
    lookup: MapRef<Ref<str>, Ref<[Id]>>,
    by_pos: MapRef<PartOfSpeech, Ref<[u32]>>,
    by_sequence: MapRef<u32, u32>,
}

This project is also what needed map. So I’ve ported phf and hashbrown (named swiss) to the project. The above uses swiss maps.

The index stores tens of millions of entries. Loading the same index using other Müsli formats and validated rkyv takes a second or two while pinning a CPU core to 100%, which actually isn’t fast enough for a CLI tool to feel responsive.

The dictionary itself also uses a neat trick to perform string de-duplication, by referencing subsets of already stored string by their prefix if they are present. This in itself saves about 1G of space and I believe showcases what you can accomplish when you have access to granular tooling:

let input: Vec<(String, Id)> = /* .. */;
let mut buf: OwnedBuf = /* .. */;

let mut lookup = HashMap::<String, Vec<Id>>::new();
let mut existing = HashMap::<_, usize>::new();
let mut reuse = 0usize;
let mut total = 0usize;

for (index, (key, id)) in input.into_iter().enumerate() {
    if index % 100000 == 0 {
        tracing::info!("Building strings: {}: {key}", index);
    }

    total += 1;

    let (unsize, substring) = if let Some(existing) = existing.get(key.as_ref()) {
        reuse += 1;
        (Ref::with_metadata(*existing, key.len()), true)
    } else {
        let unsize = buf.store_unsized(key.as_ref());
        (unsize, false)
    };

    lookup.entry(unsize).or_default().push(id);

    if !substring {
        // Generate string prefixes and insert into existing map.
        for (n, _) in key.char_indices() {
            let mut s = String::new();

            for c in key[n..].chars() {
                s.push(c);

                if !existing.contains_key(&s) {
                    existing.insert(s.clone(), unsize.offset() + n);
                }
            }
        }

        existing.insert(key.into_owned(), unsize.offset());
    }
}

tracing::info!("Reused {} string(s) (out of {})", reuse, total);

I should note that construction of the dictionary is fairly time consuming, but that doesn’t matter since it’s always done ahead of time.

With this and thanks to incremental validation in musli-zerocopy, querying through the CLI is both instantaneous and fully sound:

So what about this other crate?

Since people are bound to ask how this crate relates to those. musli-zerocopy actually occupies a similar niche to zerocopy, but is fairly dissimilar from rkyv.

This is what I write in the crate documentation:

zerocopy doesn’t support padded types², bytes to reference conversions, or conversions which does not permit decoding types unless all bit patterns can be inhabited by zeroes³. We still want to provide more of a complete toolkit that you’d need to build and interact with complex data structures like we get through the phf and swiss modules. This crate might indeed at some point make use of zerocopy’s traits.
rkyv operates on #[repr(Rust)] types and from this derives an optimized Archived variation for you. This library lets you build the equivalent of the Archived variant directly and the way you interact with the data model doesn’t incur the cost of validation up front. With rkyv it took my computer 100% of a CPU core and about half a second to load 12 million dictionary entries⁴, which is a cost that is simply not incurred by incrementally validating. Not validating is not an option since that would be wildly unsound - your application would be vulnerable to malicious dictionary files.

zerocopy in particular deserves a massive shout out. joshlf has done meticulous work documenting and upstreaming (!) layout guarantees which benefits everyone in the ecosystem.

So I think, and I hope you agree that there is some space to carve out for a more holistic toolkit which operates directly over #[repr(C)] types. If you do, I invite you to take the project for a spin.

Maybe less so once project-safe-transmute makes progress. ↩
This is on zerocopy’s roadmap, but it fundamentally doesn’t play well with the central FromBytes / ToBytes pair of traits ↩
FromBytes extends FromZeroes ↩
udoprog/jpv ↩

Abductive diagnostics for Müsli

22 May 2023

Recently I released Müsli to the world. An experimental binary serialization framework. And the response so far has been great. Clearly there’s an interest in improving the space.

Today I intend to dive into a specific topic I didn’t manage to complete before release. And it is a big one. Error handling and diagnostics.

By the end of this post, I’ll show you a design pattern which can give you rich diagnostics like this:

.field["hello"].vector[0]: should be smaller than 10, but was 42 (at bytes 34-36)

Which works for any Müsli format, has a small and easily optimized Result::Err values, can be disabled with zero overhead, and provides full functionality in no_std environments without an allocator.

If you find this topic interesting, please join the discussion on Reddit.

Table of contents:

The problem
The plan
Abstractions that can melt away
Capturing errors
I was promised no allocations
Restoring what was lost
Diagnostics
The cost of abstractions
Conclusion

The problem

Error handling seems fairly simple and given right? Rust has a Result<T, E> which neatly implements the (unstable) Try trait so we can conveniently use try expressions to propagate errors.

#[derive(Decode)]
struct Object {
    #[musli(decode_with = decode_field)]
    field: u32,
}

fn decode_field<D>(decoder: D) -> Result<u32, D::Error>
where
    D: Decoder
{
    let value = decoder.decode_u32()?;

    if value >= 10 {
        return Err(D::Error::message(format_args!("should be smaller than 10, but was {value}"))));
    }

    Ok(value)
}

There are some of the caveats to this approach:

The return type gets bigger, and is something that needs to be threaded by the Rust compiler all the way to the error handling facility.
Where is the message stored? This implies that D::Error stores it somewhere (sometimes cleverly). If it’s stored in a Box<T>, it needs to allocate. If it’s stored inline in the type using arrayvec all return signatures now become larger by default.
How do we know that the error was caused by Object::field? Should the error store the field name too? The name of the struct?
How do we know what input caused the error? Do we also have to store a data location?

These are all workable problems. In fact, most serialization libraries enriches the diagnostics produced somehow so that it’s more actionable. Unfortunately I’ve found that you have to compromise between tight and complex return values by putting everything in a box, or large error variants which have a nasty habit of getting in the way of the compiler making things efficient.

Now say we want no_std without an allocator. Defining a rich error type is starting to genuinely hurt - not to mention that it’s pretty much unworkable due to its size.

struct ErrorImpl {
    #[cfg(feature = "alloc")]
    message: String,
    #[cfg(not(feature = "alloc"))]
    message: ArrayString<64>,
    // The index where the error happened.
    index: usize,
}

#[cfg(feature = "alloc")]
use alloc::boxed::Box as MaybeBox;

#[cfg(not(feature = "alloc"))]
struct MaybeBox<T> {
    value: T,
}

struct Error {
    err: MaybeBox<ErrorImpl>,
}

So in no_std environments without alloc you tend to compromise. Let’s look at how a libraries provide diagnostics.

serde_json tracks its own line and column using by boxing the error.
prost also provides diagnostics behind a box, alongside the error description.
postcard is a no_std binary serialization format for serde, and they’ve opted to reduce actionable diagnostics by only including error variants. This means no location diagnostics, but it also means smaller error types.

Well, I’ve had it with compromising. I want to try and fly a bit closer to the sun and I’m bringing no_std with me.

The plan

In my Müsli announcement I tried to put emphasis on the experimental nature of the framework. And the change I’m about to propose breaks the mold a little bit. Maybe even a bit too much.

What I’ll propose is a model for:

Emitting dynamic errors and messages. Such as ones produces through format_args!().
Errors can track complex structural diagnostics. Such as which byte offset caused the error or even which field of a struct or index in an array caused it.
All features are compile-time optional. Not by using features or --cfg flags, but through generics. In a single project you can freely mix and match which uses what. And you only pay for what you use.
The pattern works in no_std environments without an allocator, such as micro controllers.

Sounds exciting right? To start things off I briefly want to touch on more generally what it means to build abstractions in Rust that can melt away.

Abstractions that can melt away

Let’s imagine for a second we write a function like this to download a collection of URLs:

fn download_urls<D, I>(downloader: D, urls: I) -> Result<Vec<Website>, Error>
where
    D: Downloader,
    I: IntoIterator<Item = Url>,
{
    let mut vec = Vec::new();

    for website in urls {
        vec.push(downloader.download(url)?);
    }

    Ok(vec)
}

Assuming that urls always contains some items, is there some way we can make this function conditionally become a no-op depending on the implementation of D? That is to say, Rust could correctly decide to remove most of the work. Right now that doesn’t seem to be the case - we are obligated to iterate over the input and somehow produce one Website instance for each url provided¹.

But with some minor changes we can build an abstraction that can be optimized away, or in this case to the bare minimum of consuming the iterator². We make more elements of the function generic over the abstraction. Like this:

trait Downloader {
    type Vec;

    fn new_vec() -> Self::Vec;

    fn download_into(url: Url, output: &mut Self::Vec) -> Result<(), Error>;
}

fn download_urls<D, I>(downloader: D, urls: I) -> Result<D::Vec, Error>
where
    D: Downloader,
    I: IntoIterator<Item = Url>,
{
    let mut vec = D::new_vec();

    for website in urls {
        downloader.download_into(url, &mut vec)?;
    }

    Ok(vec)
}

There’s obviously more than one way to do this, but with this particular trait we can easily build a Downloader which demonstrably does very little. It doesn’t even populate the vector. There is not even a vector.

struct DoNothing;

impl Downloader for DoNothing {
    type Vec = ();

    fn new_vec() {}

    fn download_into(url: Url, output: &mut ()) -> Result<(), Error> {
        Ok(())
    }
}

What I hope to demonstrate is that with a bit of inversion of control we can make it easier for Rust to prove that nothing is going on and simply remove the code. Here by moving implementation details into the Downloader trait.³

Capturing errors

Let’s try and apply this idea to error handling by bringing back our friendly decode_field function from above.

fn decode_field<D>(decoder: D) -> Result<u32, D::Error>
where
    D: Decoder
{
    let value = decoder.decode_u32()?;

    if value >= 10 {
        return Err(D::Error::message(format_args!("should be smaller than 10, but was {value}"))));
    }

    Ok(value)
}

So what if we want to collect the error message without allocating space for the string directly in the error being returned?

To collect it we’d first need somewhere to put it, so let’s introduce a trait for it called Context.

trait Context {
    /// Report a message.
    fn message<T>(&mut self, message: T)
    where
        T: fmt::Display;
}

And as is appropriate, we now have to actually pass in an instance of a Context that we can give our message to:

struct Error;

fn decode_field<C, D>(cx: &mut C, decoder: D) -> Result<u32, Error>
where
    C: Context,
    D: Decoder
{
    let value = decoder.decode_u32()?;

    if value >= 10 {
        cx.message(format_args!("should be smaller than 10, but was {value}"));
        return Err(Error);
    }

    Ok(value)
}

We’re already starting to get somewhere interesting. The error variant is now a zero sized type, which reduces the return size as much as possible. But what exactly is a Context?

Since it’s a trait the caller is responsible for providing an implementation. We need it to somehow capture the reported message. One way to do it is to pack it into an Option<String>.

#[derive(Default)]
struct Capture {
    message: Option<String>,
}

impl Context for Capture {
    fn message<T>(&mut self, message: T)
    where
        T: fmt::Display
    {
        self.message = Some(message.to_string());
    }
}

Converting a captured message back into the original error using this context implementation is pretty straight forward:

let decoder = /* .. */;
let mut cx = Capture::default();

let Ok(value) = decode_field(&mut cx, decoder) else {
    return Err(D::Error::message(cx.message.unwrap()));
};

Do you see what’s going on? All of our error handling and diagnostics - regardless of what it is can be passed out through a pointer to the context. This is why I call the pattern “abductive diagnostics”, because the context argument effectively abducts the error from the function.

But it’s not for free. The cost we’ve imposed on our project is that the context variable needs to be threaded through every fallible function which needs to use it (something an effect system in Rust might someday remedy).

To improve on the cost / benefit of this pattern, let’s add more information to the context:

trait Context {
    /// indicate that n bytes of input has been processed.
    fn advance(&mut self, n: usize);
}

And with that extend our Capture to keep track of this:

#[derive(Default)]
struct Capture {
    message: Option<(usize, String)>,
    at: usize,
}

impl Context for Capture {
    fn advance(&mut self, n: usize) {
        self.at = self.at.wrapping_add(n);
    }

    fn message<T>(&mut self, message: T)
    where
        T: fmt::Display
    {
        self.message = Some((self.at, message.to_string()));
    }
}

Now we can associate byte indexes to diagnostics. We’re really starting to build out our capabilities!⁴ The neat part here is that we added some really powerful capabilities to our system, while keeping the returned error a zero sized type.

Next let’s see how this pattern can help us to capture errors without allocating on the heap.

I was promised no allocations

Right now there’s clearly a String there, which uses allocations! It’s even in the alloc crate.

This is true, but the neat part about having the context abduct our errors is that we gain the necessary control to store them wherever we want.

So let’s build a context which stores errors on the stack instead using arrayvec.⁵

use std::fmt::Write;

use arrayvec::ArrayString;

#[derive(Default)]
struct Capture<const N: usize> {
    message: ArrayString<N>,
    message_at: usize,
    at: usize,
}

impl<const N: usize> Context for Capture<const N: usize> {
    fn advance(&mut self, n: usize) {
        self.at = self.at.wrapping_add(n);
    }

    fn message<T>(&mut self, message: T)
    where
        T: fmt::Display
    {
        use std::fmt::Write;
        self.message_at = self.at;
        self.message.clear();
        let _ = write!(&mut self.message, "{}", message);
    }
}

We can imagine all kinds of ways for storing errors. Müsli comes out of the box with two that uses different strategies:

One which allocates, in AllocContext.
One which stores errors and diagnostics on the stack, in NoStdContext.

But if you intend to integrate it into a strange environment you would very much be encouraged to implement your own Context.

Restoring what was lost

If we pay attention to the method we refactored above, we might note that while we gained the ability to abduct errors through the context, we lost two things as well.

struct Error;

fn decode_field<C, D>(cx: &mut C, decoder: D) -> Result<u32, Error>
where
    C: Context,
    D: Decoder
{
    let value = decoder.decode_u32()?;

    if value >= 10 {
        return Err(Error);
    }

    Ok(value)
}

Do you see it?

Regular errors which contains their own diagnostics can no longer be returned if we wanted to; and
The method doesn’t guarantee that an error has been reported to the context oops.

The latter is no good. When using regular error types we’re forced to somehow produce an Error through some kind of constructor. Here we can just return the marker type and in the worst case forget to provide a message.

So let’s address this by modifying Context further:

trait Context {
    /// The error type produced by the context.
    type Error;

    /// Add ways to construct a `Self::Error`.
    fn message<T>(&mut self, message: T) -> Self::Error
    where
        T: fmt::Display;
}

And the corresponding changes to decode_field looks like this:

fn decode_field<C, D>(cx: &mut C, decoder: D) -> Result<u32, C::Error>
where
    C: Context,
    D: Decoder
{
    let value = decoder.decode_u32()?;

    if value >= 10 {
        return Err(cx.message(format_args!("should be smaller than 10, but was {value}")));
    }

    Ok(value)
}

Now the only way we can return from decode_field is by either producing Ok(u32), or Err(C::Error). And the Context is the only one which can produce C::Error’s, so we don’t accidentally return a blank error marker without providing diagnostics.

In addition, do you remember that the Decoder also produces an error? The call to decode_u32 doesn’t actually compile. We have to handle that somehow as well. To do this, we extend our context further:

type Context {
    type Input;
    type Error;

    /// Add the ability to report an error that can be converted to `Input`.
    fn report<T>(&mut self, input: T) -> Self::Error
    where
        Self::Input: From<T>;
}

We can now specify the Input type as the deserializer error that the context can abduct:

fn decode_field<C, D>(cx: &mut C, decoder: D) -> Result<u32, C::Error>
where
    C: Context<Input = D::Error>,
    D: Decoder
{
    let value = decoder.decode_u32().map_err(|err| cx.report(err))?;

    if value >= 10 {
        return Err(cx.message(format_args!("should be smaller than 10, but was {value}")));
    }

    Ok(value)
}

Of course in the real implementation we just pass along the cx variable to decode_u32. But this showcases how the pattern can be gradually introduced into existing code which was of great help during refactoring.

What exactly the report implementation looks like I leave as an exercise to the reader, but with these additions there are now two more interesting contexts that Müsli provides:

Same which produces the same error (Context::Error) that it consumes (Context::Input) providing full backwards compatibility.
Ignore which simply records that an error has happened, but returns a zero sized marker type like before.

// Behaves just like the original version without `Context` support.
let mut cx = Same::default();
let value = decode_field(&mut cx, decoder)?;

// The error returned is a ZST.
let mut cx = Ignore::default();
let Ok(value) = decode_field(&mut cx, decoder) else {
    // deal with the fact that something went wrong.
};

Diagnostics

Error messages by themselves are cool and all, but what we really want is more diagnostics. While it can be useful to know that something should be smaller than 10, but was 42 this only helps us in the simplest cases to troubleshoot issues. In most cases we don’t just need to know what went wrong, but where it went wrong.

To this end we can add more hooks to Context. And this is where it starts getting really interesting. In Müsli I’ve opted to add support for keeping track of the byte index and fully tracing the type hierarchy being decoded.

trait Context {
    fn advance(&mut self, n: usize);

    fn enter_struct(&mut self, name: &'static str);
    fn leave_struct(&mut self);

    fn enter_enum(&mut self, name: &'static str);
    fn leave_enum(&mut self);

    fn enter_named_field<T>(&mut self, name: &'static str, tag: T)
    where
        T: fmt::Display;
    fn enter_unnamed_field<T>(&mut self, index: u32, tag: T)
    where
        T: fmt::Display;
    fn leave_field(&mut self);

    fn enter_variant<T>(&mut self, name: &'static str, tag: T)
    where
        T: fmt::Display;
    fn leave_variant(&mut self);

    fn enter_map_key<T>(&mut self, field: T)
    where
        T: fmt::Display;
    fn leave_map_key(&mut self);

    fn enter_sequence_index(&mut self, index: usize);
    fn leave_sequence_index(&mut self);
}

So using one of the contexts (like AllocContext) provided above, we can get error messages like this:

.field["hello"].vector[0]: should be smaller than 10, but was 42 (at byte 36)

Or even this, which includes the names of the Rust types which were involved:

Struct { .field["hello"] = Enum::Variant2 { .vector[0] } }: should be smaller than 10, but was 42 (at byte 36)

Or nothing at all! If you don’t use it, you don’t pay for it:

decoding / encoding failed

Conveniently our Encode and Decode derives fills out all these context calls for you. So tracing is something you get for free⁶.

The cost of abstractions

I’m keenly aware that threading the context variable through the entire framework can be a bit messy. It took me almost two days to refactor all the code in Müsli. In the long run the ergonomics of it might simply pan out to not be worth it, or we’ll try to change something to make it easier. For now I don’t know what.

Luckily most users will not interact with the plumbing of the framework. They should primarily focus on using the high level derives available. A smaller number of users will end up writing hooks and that can be harder because there’s a lot of things going on:

use musli::{Context, Mode, Encoder, Decoder};
    
struct MyType {
    /* .. */
}

fn encode<'buf, M, E, C>(my_type: &MyType, cx: &mut C, encoder: E) -> Result<E::Ok, C::Error>
where
    M: Mode,
    C: Context<'buf, Input = E::Error>,
    E: Encoder,
{
    todo!()
}

fn decode<'de, 'buf, M, C, D>(cx: &mut C, decoder: D) -> Result<MyType, C::Error>
where
    M: Mode,
    C: Context<'buf, Input = D::Error>,
    D: Decoder<'de>,
{
    todo!()
}

Two lifetimes, and three generics by default, and a strange Input parameter for Context! Did I mention that Müsli is still an experimental project?

Still, I’m hoping to discover a better way of doing this from an ergonomics perspective. If you have suggestions, I’d love to hear them!

Conclusion

Thank you for reading!

I can confidently say that with this, Müsli has the ability to produce state of the art diagnostics with almost no effort from the user. I think most people who’ve worked with serialization knows that without good diagnostics it can be very hard to figure out what’s going on.

I will round off by displaying this error I’m currently plagued by, which is caused by a binary format using serde. It’s actually the reason why I started to pursue this topic:

Error: ../../changes.gz

Caused by:
    invalid value: integer `335544320`, expected variant index 0 <= i < 11

I don’t know which field, in which struct, being fed what data, caused that error to happen. Needless to say I’ll be switching that project to use Müsli.

The above certainly can be improved, and we can even add tracing to serde. But not without compromises. And I want everything.

Website could support mocking. But maybe that’s not viable either. ↩
Since that is a legitimate effect we should preserve of calling the function. ↩
This isn’t something that comes for free, we now have to live with the cognitive load of dealing with a weird interface. I think that most of the time it might not even be worth it. ↩
We do need to make sure to call advance where appropriate, like for every procedure which reads input. ↩
The exact behavior of writing to an ArrayString at capacity is a bit fuzzy, but at the very least anything that is outside of its capacity will be drop. The exact boundary of which could be improved. ↩
There are more interesting details in the full implementation of Context. ↩

Porting Rust to WebAssembly

19 Feb 2018

I recently spent some effort trying to make reproto run in a browser. Here I want to outline the problems I encountered and how I worked around them. I will also provide a number of suggestions for how things might be improved for future porters.

A big chunk of the wasm story on Rust currently relies on stdweb.

Needless to say, this project is incredible. stdweb makes it smooth to build Rust applications that integrates with a browser.

There’s a ton of things that can be said about that project, but first I want to focus on the porting efforts of reproto.

Getting up to speed and shaking the tree

The best way to port a library is to compile it.

For Rust it’s as simple as installing cargo-web and setting up a project with all your dependencies.

All it really needs is a Cargo.toml declaring your dependencies, and a src/main.rs with extern declarations for everything you want compiled.

For reproto, that process looks like this:

# Cargo.toml

[package]
# skipped

[dependencies]
reproto-core = {path = "../lib/core", version = "0.3"}
reproto-compile = {path = "../lib/compile", version = "0.3"}
reproto-derive = {path = "../lib/derive", version = "0.3"}
reproto-manifest = {path = "../lib/manifest", version = "0.3"}
reproto-parser = {path = "../lib/parser", version = "0.3"}
reproto-backend-java = {path = "../lib/backend-java", version = "0.3"}
reproto-backend-js = {path = "../lib/backend-js", version = "0.3"}
reproto-backend-json = {path = "../lib/backend-json", version = "0.3"}
reproto-backend-python = {path = "../lib/backend-python", version = "0.3"}
reproto-backend-rust = {path = "../lib/backend-rust", version = "0.3"}
reproto-backend-reproto = {path = "../lib/backend-reproto", version = "0.3"}

stdweb = "0.3"
serde = "1"
serde_json = "1"
serde_derive = "1"

//! src/main.rs

extern crate serde;
#[macro_use]
extern crate serde_derive;
extern crate serde_json;
#[macro_use]
extern crate stdweb;

extern crate reproto_backend_java as java;
extern crate reproto_backend_js as js;
extern crate reproto_backend_json as json;
extern crate reproto_backend_python as python;
extern crate reproto_backend_reproto as reproto;
extern crate reproto_backend_rust as rust;
extern crate reproto_compile as compile;
extern crate reproto_core as core;
extern crate reproto_derive as derive;
extern crate reproto_manifest as manifest;
extern crate reproto_parser as parser;

fn main() {
    stdweb::initialize();
}

Finally we want to add a Web.toml, which will allow us to specify the default target so we won’t have to type it out all the time:

# Web.toml

default-target = "wasm32-unknown-unknown"

Now the project should build by running cargo web build.

For tracing down where dependencies come from, I relied heavily on cargo-tree.

When you do encounter a problem cargo tree can quickly determine how a given package was pulled in:

cargo tree
[dependencies]
├── reproto-backend-java v0.3.13 (file://<home>/repo/reproto/lib/backend-java)
│   [dependencies]
│   ├── genco v0.2.6 (*)
│   ├── log v0.3.9
│   │   [dependencies]
│   │   └── log v0.4.1
│   │       [dependencies]
│   │       └── cfg-if v0.1.2
... snip

Big numbers

My project is structured into many different modules, each loosely responsible for one aspect of the solution.

The first module where I encountered problems was core.

The num crate by default pulls in rustc-serialize, which fails like this:

     |
853  |     fn encode<S: Encoder>(&self, s: &mut S) -> Result<(), S::Error>;
     |     ---------------------------------------------------------------- `encode` from trait
...
1358 | impl Encodable for path::Path {
     | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ missing `encode` in implementation

error[E0046]: not all trait items implemented, missing: `decode`
    --> <registry>/src/github.com-1ecc6299db9ec823/rustc-serialize-0.3.24/src/serialize.rs:1382:1
     |
904  |     fn decode<D: Decoder>(d: &mut D) -> Result<Self, D::Error>;
     |     ----------------------------------------------------------- `decode` from trait
...
1382 | impl Decodable for path::PathBuf {
     | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ missing `decode` in implementation

There appears to be a trait implementation missing.

Opening up the specified line (1382) reveals that rust-serialize has platform-specific serialization for PathBuf:

impl Decodable for path::PathBuf {
    #[cfg(target_os = "redox")]
    fn decode<D: Decoder>(d: &mut D) -> Result<path::PathBuf, D::Error> {
        // ...
    }
    #[cfg(unix)]
    fn decode<D: Decoder>(d: &mut D) -> Result<path::PathBuf, D::Error> {
        // ...
    }
    #[cfg(windows)]
    fn decode<D: Decoder>(d: &mut D) -> Result<path::PathBuf, D::Error> {
        // ...
    }
}

Interestingly enough, this is something I’ve touched on in a previous post as a portability concern.

In rustc-serialize, paths are not portable because not all platforms have serializers defined for them! impls being missing for a specific platform wouldn’t be a big deal if platform-specific bits were better tucked away. As it stands here, we end up with an incomplete impl Decodable which is a compiler error.

If the entire impl block was conditional it would simply be overlooked as another missing implementation. In practice this would mean that you wouldn’t be able to serialize PathBuf easily, but this can be worked around be simply not using it.

Due to the current state of affairs, the easiest way to deal with it was simply to disable the default features for the num-* crates. A bit tedious to add everywhere, but not a big deal:

[package]
# ...

[dependencies]
num-bigint = {version = "0.1", default_features = false}
num-traits = {version = "0.1", default_features = false}
num-integer = {version = "0.1", default_features = false}

There is no filesystem

I lie, there kind of is. Or at least there can be. But for our current target wasm32-unknown-unknown there isn’t.

This means that all of your sweet code using std::fs simply won’t work.

//! src/main.rs

#[macro_use]
extern crate stdweb;

use std::fs;

fn test() -> String {
    fs::File::create("hello.txt").expect("bad file");
    "Hello".to_string()
}

fn main() {
    stdweb::initialize();

    js! {
        Module.exports.test = @{test};
    }
}

Taking this for a spin, would result in an unfortunate runtime error:

wasm + fs

To work around this I introduced a layer of indirection. My very own fs.rs. I then ported all code to use this so that I can swap out the implementation at runtime. This wasn’t particularly hard, seeing as I already pass around a Context to collect errors. Now it just needed to learn a new trick.

Finally I ported all code that used Path to use relative-path instead. This guarantees that I won’t be tempted to hit any of those platform-specific APIs like canonicalize, which requires access to a filesystem.

With this in place I can now capture the files written to my filesystem abstraction directly into memory!

Ruining Christmas with native libraries

Anything involving native libraries will ruin your day in one way or another.

My repository component uses ring for calculating Sha256 checksums. The first realization is that repositories won’t work the same - if at all - on the web. We don’t have a filesystem! At some point it might be possible to plug in a solution that communicates with a service to fetch dependencies. But that is currently not the goal.

This realization made the solution obvious: web users don’t need a repository. I moved the necessary trait (Resolver) from repository to core, and provided a no-op implementation for it. The result is that I no longer depend on the repository crate to have a working system, sidestepping the native dependency entirely in the browser.

Neat!

Revenge of the Path

I thought I had seen the last of Path. But url decided to blow up in my face like this:

error[E0425]: cannot find function `path_to_file_url_segments` in this scope
    --> <registry>/src/github.com-1ecc6299db9ec823/url-1.6.0/src/lib.rs:1934:32
     |
1934 |         let (host_end, host) = path_to_file_url_segments(path.as_ref(), &mut serialization)?;
     |                                ^^^^^^^^^^^^^^^^^^^^^^^^^ did you mean `path_to_file_url_segments_windows`?
error[E0425]: cannot find function `file_url_segments_to_pathbuf` in this scope
    --> <registry>/src/github.com-1ecc6299db9ec823/url-1.6.0/src/lib.rs:2057:20
     |
2057 |             return file_url_segments_to_pathbuf(host, segments);
     |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ did you mean `file_url_segments_to_pathbuf_windows`?
error: aborting due to 2 previous errors
error: Could not compile `url`.

Yet again. Paths aren’t portable.

The url crate wants to translate file segments into real paths for you. It does this by hiding implementations of file_url_segments_to_pathbuf behind platform-specific gates. Obviously there there are no wasm implementations for this.

An alternative here is to use something like hyper::Uri, but that would currently mean pulling in all of hyper and its problematic dependencies. I settled for just adding more indirection and isolating the components that needed HTTP and URLs into their own modules.

"All problems in computer science can be solved by another level of indirection" — David Wheeler

What’s the time again?

chrono is another amazing library, I used it in my derive component to detect when a string looks like a datetime.

Unfortunate for me, chrono depends on time. Another platform-specific dependency!

error[E0432]: unresolved import `self::inner`
 --> <registry>/src/github.com-1ecc6299db9ec823/time-0.1.39/src/sys.rs:3:15
  |
3 | pub use self::inner::*;
  |               ^^^^^ Could not find `inner` in `self`
error[E0433]: failed to resolve. Could not find `SteadyTime` in `sys`
   --> <registry>/src/github.com-1ecc6299db9ec823/time-0.1.39/src/lib.rs:247:25
    |

/// snip...

error: aborting due to 10 previous errors
error: Could not compile `time`.

Because the derive feature was such a central component in what I wanted to port I started looking for alternatives instead of isolating it.

My first attempt was the iso8601 crate, a project using nom to parse ISO 8601 timestamps. Perfect!

error[E0432]: unresolved import `libc::c_void`
  --> <registry>/src/github.com-1ecc6299db9ec823/memchr-1.0.2/src/lib.rs:19:5
   |
19 | use libc::c_void;
   |     ^^^^^^^^^^^^ no `c_void` in the root
error[E0432]: unresolved import `libc::c_int`
  --> <registry>/src/github.com-1ecc6299db9ec823/memchr-1.0.2/src/lib.rs:21:12
   |
/// ...

error: build failed

On no…

Ok, it’s time to pull out cargo-tree.

$ cargo tree

# ...
├── iso8601 v0.2.0
│   [dependencies]
│   └── nom v3.2.1
│       [dependencies]
│       └── memchr v1.0.2
│           [dependencies]
│           └── libc v0.2.36
# ...

So nom depends on memchr, an interface to the memchr libc function. That makes sense. nom wants to scan for characters as quickly as possible. Unfortunately that makes nom and everything depending on it unusable in wasm right now.

The easiest route ended up being to write my own function here.

Make things better

In the following sections I try to summarize how we can improve the experience for future porters.

Make platform detection a first class feature of Rust

If you look over the error messages encountered above, you can see that they have one thing in common: The are all unique.

This is unfortunate, since they all relate to the same problem: there is a component X that is hidden behind a platform gate. When that component is no longer provided, the project fails to compile.

Wouldn’t it be better if the compiler error looked like this:

error[EXXXX]: section unavailable for platform (target_arch = "wasm32", target_os = "unknown"):
  --> <registry>/src/github.com-1ecc6299db9ec823/rustc-serialize-X.X.X/src/serialize.rs:148:1
    |
148 |         #[platform(target_arch, target_os)] {
    |                    ^^^^^^^^^^^^^^^^^^^^^^ - platform defined here
    |             impl Decodable for path::PathBuf {
    |                 #[cfg(target_os = "redox")]
    |                 fn decode<D: Decoder>(d: &mut D) -> Result<path::PathBuf, D::Error> { .. }
    |
    |                 #[cfg(target_os = "unix")]
    |                 fn decode<D: Decoder>(d: &mut D) -> Result<path::PathBuf, D::Error> { .. }
    |
    |                 #[cfg(target_os = "windows")]
    |                 fn decode<D: Decoder>(d: &mut D) -> Result<path::PathBuf, D::Error> { .. }
    |             }
    |         }
}

It works by detecting when your platform has a configuration which does not match any existing gates, providing you with contextual information of why it failed. This means that the matching would either have to be exhaustive (e.g. provide a default fallback), or fail where the matching actually occurs.

This is much better than the arbitrary number of compiler errors caused by missing elements.

Transitive feature flags

This is the first exercise I’ve had in explicitly disabling default features. Sometimes it can be hard. Dependency topologies are complex, and mostly out of your hand.

This suggestion is heavily influenced by use flags in Gentoo, and would be a controversial change.

The gist here is that I can enable a feature for a crate and all it’s dependencies. Not just directly on that crate. This way it doesn’t matter that a middle layer forgot to forward a feature flag, you can still disable it without too much hassle.

Different crates might use the same feature flag for different things. But it begs the question: are we more interested in patching all crates to forward feature flags correctly, than we are patching crates which use the wrong flags?

Conclusion

This was a lot of work. But much less than I initially expected. The wasm ecosystem in Rust is really on fire right now, and things are improving rapidly!

I’m actually really excited over how well it works. Apart from a small number of expected, and even smaller number of unexpected dependencies. Things just work.

In summary:

Avoid native dependencies.
When something breaks, it’s probably because of a gate.
Abstract the filesystem.
Avoid using Path/PathBuf and APIs which have platform-specific behaviors. Consider relative-path as an alternative.

So to leave you, feel free to try out reproto in your browser using the new eval app:

https://reproto.github.io.

UPDATE #1: Here is the relevant PR in chrono to make time an optional dependency. Please make your voice heard if this is something you want! For nom, memchr support for wasm was merged in December, unfortunately that leaves existing libraries behind until they are upgraded.

Advent of Rust Day 10 - Persistent iterators

10 Dec 2017

This is a series where I’ll be discussing interesting Rust tidbits that I encounter when solving Advent of Code 2017.

You can find the complete (spoiler) solution here: udoprog/rust-advent-of-code-2017

I’ve realized that a powerful alternative to doing (often unsafe!) arithmetics and indexing is to make use of iterators.

One pattern I’ve found myself moving to is reusing a mutable iterator, like this:

// store an iterator
let mut skip = 0..;

for i in 1..4 {
    for (n, skip) in (0..i).zip(&mut skip) {
        println!("n = {}, skip = {}", n, skip);
    }
}

Playground Link

This would print:

n = 0, skip = 0
n = 0, skip = 1
n = 1, skip = 2
n = 0, skip = 3
n = 1, skip = 4
n = 2, skip = 5

Notice how the skip iterator is shared across all iterations.

Older Newer