Abductive diagnostics for Müsli
Recently I released Müsli to the world. An experimental binary serialization framework. And the response so far has been great. Clearly there’s an interest in improving the space.
Today I intend to dive into a specific topic I didn’t manage to complete before release. And it is a big one. Error handling and diagnostics.
By the end of this post, I’ll show you a design pattern which can give you rich diagnostics like this:
.field["hello"].vector[0]: should be smaller than 10, but was 42 (at bytes 34-36)
Which works for any Müsli format, has a small and easily optimized Result::Err
values, can be disabled with zero overhead, and provides full functionality in
no_std
environments without an allocator.
If you find this topic interesting, please join the discussion on Reddit.
Table of contents:
- The problem
- The plan
- Abstractions that can melt away
- Capturing errors
- I was promised no allocations
- Restoring what was lost
- Diagnostics
- The cost of abstractions
- Conclusion
The problem
Error handling seems fairly simple and given right? Rust has a Result<T, E>
which neatly implements the (unstable) Try trait so we can conveniently use
try expressions to propagate errors.
#[derive(Decode)]
struct Object {
#[musli(decode_with = decode_field)]
field: u32,
}
fn decode_field<D>(decoder: D) -> Result<u32, D::Error>
where
D: Decoder
{
let value = decoder.decode_u32()?;
if value >= 10 {
return Err(D::Error::message(format_args!("should be smaller than 10, but was {value}"))));
}
Ok(value)
}
There are some of the caveats to this approach:
- The return type gets bigger, and is something that needs to be threaded by the Rust compiler all the way to the error handling facility.
- Where is the message stored? This implies that
D::Error
stores it somewhere (sometimes cleverly). If it’s stored in aBox<T>
, it needs to allocate. If it’s stored inline in the type using arrayvec all return signatures now become larger by default. - How do we know that the error was caused by
Object::field
? Should the error store the field name too? The name of the struct? - How do we know what input caused the error? Do we also have to store a data location?
These are all workable problems. In fact, most serialization libraries enriches the diagnostics produced somehow so that it’s more actionable. Unfortunately I’ve found that you have to compromise between tight and complex return values by putting everything in a box, or large error variants which have a nasty habit of getting in the way of the compiler making things efficient.
Now say we want no_std
without an allocator. Defining a rich error type is
starting to genuinely hurt - not to mention that it’s pretty much unworkable due
to its size.
struct ErrorImpl {
#[cfg(feature = "alloc")]
message: String,
#[cfg(not(feature = "alloc"))]
message: ArrayString<64>,
// The index where the error happened.
index: usize,
}
#[cfg(feature = "alloc")]
use alloc::boxed::Box as MaybeBox;
#[cfg(not(feature = "alloc"))]
struct MaybeBox<T> {
value: T,
}
struct Error {
err: MaybeBox<ErrorImpl>,
}
So in no_std
environments without alloc
you tend to compromise. Let’s look
at how a libraries provide diagnostics.
serde_json
tracks its own line and column using by boxing the error.prost
also provides diagnostics behind a box, alongside the error description.postcard
is ano_std
binary serialization format forserde
, and they’ve opted to reduce actionable diagnostics by only including error variants. This means no location diagnostics, but it also means smaller error types.
Well, I’ve had it with compromising. I want to try and fly a bit closer to the
sun and I’m bringing no_std
with me.
The plan
In my Müsli announcement I tried to put emphasis on the experimental nature of the framework. And the change I’m about to propose breaks the mold a little bit. Maybe even a bit too much.
What I’ll propose is a model for:
- Emitting dynamic errors and messages. Such as ones produces through
format_args!()
. - Errors can track complex structural diagnostics. Such as which byte offset caused the error or even which field of a struct or index in an array caused it.
- All features are compile-time optional. Not by using features or
--cfg
flags, but through generics. In a single project you can freely mix and match which uses what. And you only pay for what you use. - The pattern works in
no_std
environments without an allocator, such as micro controllers.
Sounds exciting right? To start things off I briefly want to touch on more generally what it means to build abstractions in Rust that can melt away.
Abstractions that can melt away
Let’s imagine for a second we write a function like this to download a collection of URLs:
fn download_urls<D, I>(downloader: D, urls: I) -> Result<Vec<Website>, Error>
where
D: Downloader,
I: IntoIterator<Item = Url>,
{
let mut vec = Vec::new();
for website in urls {
vec.push(downloader.download(url)?);
}
Ok(vec)
}
Assuming that urls
always contains some items, is there some way we can make
this function conditionally become a no-op depending on the implementation of
D
? That is to say, Rust could correctly decide to remove most of the work.
Right now that doesn’t seem to be the case - we are obligated to iterate over
the input and somehow produce one Website
instance for each url
provided1.
But with some minor changes we can build an abstraction that can be optimized away, or in this case to the bare minimum of consuming the iterator2. We make more elements of the function generic over the abstraction. Like this:
trait Downloader {
type Vec;
fn new_vec() -> Self::Vec;
fn download_into(url: Url, output: &mut Self::Vec) -> Result<(), Error>;
}
fn download_urls<D, I>(downloader: D, urls: I) -> Result<D::Vec, Error>
where
D: Downloader,
I: IntoIterator<Item = Url>,
{
let mut vec = D::new_vec();
for website in urls {
downloader.download_into(url, &mut vec)?;
}
Ok(vec)
}
There’s obviously more than one way to do this, but with this particular trait
we can easily build a Downloader
which demonstrably does very little. It
doesn’t even populate the vector. There is not even a vector.
struct DoNothing;
impl Downloader for DoNothing {
type Vec = ();
fn new_vec() {}
fn download_into(url: Url, output: &mut ()) -> Result<(), Error> {
Ok(())
}
}
What I hope to demonstrate is that with a bit of inversion of control we can
make it easier for Rust to prove that nothing is going on and simply remove the
code. Here by moving implementation details into the Downloader
trait.3
Capturing errors
Let’s try and apply this idea to error handling by bringing back our friendly
decode_field
function from above.
fn decode_field<D>(decoder: D) -> Result<u32, D::Error>
where
D: Decoder
{
let value = decoder.decode_u32()?;
if value >= 10 {
return Err(D::Error::message(format_args!("should be smaller than 10, but was {value}"))));
}
Ok(value)
}
So what if we want to collect the error message without allocating space for the string directly in the error being returned?
To collect it we’d first need somewhere to put it, so let’s introduce a trait
for it called Context
.
trait Context {
/// Report a message.
fn message<T>(&mut self, message: T)
where
T: fmt::Display;
}
And as is appropriate, we now have to actually pass in an instance of a
Context
that we can give our message to:
struct Error;
fn decode_field<C, D>(cx: &mut C, decoder: D) -> Result<u32, Error>
where
C: Context,
D: Decoder
{
let value = decoder.decode_u32()?;
if value >= 10 {
cx.message(format_args!("should be smaller than 10, but was {value}"));
return Err(Error);
}
Ok(value)
}
We’re already starting to get somewhere interesting. The error variant is now a
zero sized type, which reduces the return size as much as possible. But what
exactly is a Context
?
Since it’s a trait the caller is responsible for providing an implementation. We
need it to somehow capture the reported message. One way to do it is to pack it
into an Option<String>
.
#[derive(Default)]
struct Capture {
message: Option<String>,
}
impl Context for Capture {
fn message<T>(&mut self, message: T)
where
T: fmt::Display
{
self.message = Some(message.to_string());
}
}
Converting a captured message back into the original error using this context implementation is pretty straight forward:
let decoder = /* .. */;
let mut cx = Capture::default();
let Ok(value) = decode_field(&mut cx, decoder) else {
return Err(D::Error::message(cx.message.unwrap()));
};
Do you see what’s going on? All of our error handling and diagnostics - regardless of what it is can be passed out through a pointer to the context. This is why I call the pattern “abductive diagnostics”, because the context argument effectively abducts the error from the function.
But it’s not for free. The cost we’ve imposed on our project is that the context variable needs to be threaded through every fallible function which needs to use it (something an effect system in Rust might someday remedy).
To improve on the cost / benefit of this pattern, let’s add more information to the context:
trait Context {
/// indicate that n bytes of input has been processed.
fn advance(&mut self, n: usize);
}
And with that extend our Capture
to keep track of this:
#[derive(Default)]
struct Capture {
message: Option<(usize, String)>,
at: usize,
}
impl Context for Capture {
fn advance(&mut self, n: usize) {
self.at = self.at.wrapping_add(n);
}
fn message<T>(&mut self, message: T)
where
T: fmt::Display
{
self.message = Some((self.at, message.to_string()));
}
}
Now we can associate byte indexes to diagnostics. We’re really starting to build out our capabilities!4 The neat part here is that we added some really powerful capabilities to our system, while keeping the returned error a zero sized type.
Next let’s see how this pattern can help us to capture errors without allocating on the heap.
I was promised no allocations
Right now there’s clearly a String
there, which uses allocations! It’s even
in the alloc
crate.
This is true, but the neat part about having the context abduct our errors is that we gain the necessary control to store them wherever we want.
So let’s build a context which stores errors on the stack instead using
arrayvec
.5
use std::fmt::Write;
use arrayvec::ArrayString;
#[derive(Default)]
struct Capture<const N: usize> {
message: ArrayString<N>,
message_at: usize,
at: usize,
}
impl<const N: usize> Context for Capture<const N: usize> {
fn advance(&mut self, n: usize) {
self.at = self.at.wrapping_add(n);
}
fn message<T>(&mut self, message: T)
where
T: fmt::Display
{
use std::fmt::Write;
self.message_at = self.at;
self.message.clear();
let _ = write!(&mut self.message, "{}", message);
}
}
We can imagine all kinds of ways for storing errors. Müsli comes out of the box with two that uses different strategies:
- One which allocates, in
AllocContext
. - One which stores errors and diagnostics on the stack, in
NoStdContext
.
But if you intend to integrate it into a strange environment you would very much
be encouraged to implement your own Context
.
Restoring what was lost
If we pay attention to the method we refactored above, we might note that while we gained the ability to abduct errors through the context, we lost two things as well.
struct Error;
fn decode_field<C, D>(cx: &mut C, decoder: D) -> Result<u32, Error>
where
C: Context,
D: Decoder
{
let value = decoder.decode_u32()?;
if value >= 10 {
return Err(Error);
}
Ok(value)
}
Do you see it?
- Regular errors which contains their own diagnostics can no longer be returned if we wanted to; and
- The method doesn’t guarantee that an error has been reported to the context oops.
The latter is no good. When using regular error types we’re forced to somehow
produce an Error
through some kind of constructor. Here we can just return the
marker type and in the worst case forget to provide a message.
So let’s address this by modifying Context
further:
trait Context {
/// The error type produced by the context.
type Error;
/// Add ways to construct a `Self::Error`.
fn message<T>(&mut self, message: T) -> Self::Error
where
T: fmt::Display;
}
And the corresponding changes to decode_field
looks like this:
fn decode_field<C, D>(cx: &mut C, decoder: D) -> Result<u32, C::Error>
where
C: Context,
D: Decoder
{
let value = decoder.decode_u32()?;
if value >= 10 {
return Err(cx.message(format_args!("should be smaller than 10, but was {value}")));
}
Ok(value)
}
Now the only way we can return from decode_field
is by either producing
Ok(u32)
, or Err(C::Error)
. And the Context
is the only one which can
produce C::Error
’s, so we don’t accidentally return a blank error marker
without providing diagnostics.
In addition, do you remember that the Decoder
also produces an error? The call
to decode_u32
doesn’t actually compile. We have to handle that somehow as
well. To do this, we extend our context further:
type Context {
type Input;
type Error;
/// Add the ability to report an error that can be converted to `Input`.
fn report<T>(&mut self, input: T) -> Self::Error
where
Self::Input: From<T>;
}
We can now specify the Input
type as the deserializer error that the context
can abduct:
fn decode_field<C, D>(cx: &mut C, decoder: D) -> Result<u32, C::Error>
where
C: Context<Input = D::Error>,
D: Decoder
{
let value = decoder.decode_u32().map_err(|err| cx.report(err))?;
if value >= 10 {
return Err(cx.message(format_args!("should be smaller than 10, but was {value}")));
}
Ok(value)
}
Of course in the real implementation we just pass along the cx
variable to
decode_u32
. But this showcases how the pattern can be gradually introduced
into existing code which was of great help during refactoring.
What exactly the report
implementation looks like I leave as an exercise to
the reader, but with these additions there are now two more interesting contexts
that Müsli provides:
Same
which produces the same error (Context::Error
) that it consumes (Context::Input
) providing full backwards compatibility.Ignore
which simply records that an error has happened, but returns a zero sized marker type like before.
// Behaves just like the original version without `Context` support.
let mut cx = Same::default();
let value = decode_field(&mut cx, decoder)?;
// The error returned is a ZST.
let mut cx = Ignore::default();
let Ok(value) = decode_field(&mut cx, decoder) else {
// deal with the fact that something went wrong.
};
Diagnostics
Error messages by themselves are cool and all, but what we really want is more
diagnostics. While it can be useful to know that something should be smaller
than 10, but was 42
this only helps us in the simplest cases to troubleshoot
issues. In most cases we don’t just need to know what went wrong, but where
it went wrong.
To this end we can add more hooks to Context
. And this is where it starts
getting really interesting. In Müsli I’ve opted to add support for keeping track
of the byte index and fully tracing the type hierarchy being decoded.
trait Context {
fn advance(&mut self, n: usize);
fn enter_struct(&mut self, name: &'static str);
fn leave_struct(&mut self);
fn enter_enum(&mut self, name: &'static str);
fn leave_enum(&mut self);
fn enter_named_field<T>(&mut self, name: &'static str, tag: T)
where
T: fmt::Display;
fn enter_unnamed_field<T>(&mut self, index: u32, tag: T)
where
T: fmt::Display;
fn leave_field(&mut self);
fn enter_variant<T>(&mut self, name: &'static str, tag: T)
where
T: fmt::Display;
fn leave_variant(&mut self);
fn enter_map_key<T>(&mut self, field: T)
where
T: fmt::Display;
fn leave_map_key(&mut self);
fn enter_sequence_index(&mut self, index: usize);
fn leave_sequence_index(&mut self);
}
So using one of the contexts (like AllocContext
) provided above, we can
get error messages like this:
.field["hello"].vector[0]: should be smaller than 10, but was 42 (at byte 36)
Or even this, which includes the names of the Rust types which were involved:
Struct { .field["hello"] = Enum::Variant2 { .vector[0] } }: should be smaller than 10, but was 42 (at byte 36)
Or nothing at all! If you don’t use it, you don’t pay for it:
decoding / encoding failed
Conveniently our Encode
and Decode
derives fills out all these context
calls for you. So tracing is something you get for free6.
The cost of abstractions
I’m keenly aware that threading the context variable through the entire framework can be a bit messy. It took me almost two days to refactor all the code in Müsli. In the long run the ergonomics of it might simply pan out to not be worth it, or we’ll try to change something to make it easier. For now I don’t know what.
Luckily most users will not interact with the plumbing of the framework. They
should primarily focus on using the high level derives
available. A smaller
number of users will end up writing hooks and that can be harder because there’s
a lot of things going on:
use musli::{Context, Mode, Encoder, Decoder};
struct MyType {
/* .. */
}
fn encode<'buf, M, E, C>(my_type: &MyType, cx: &mut C, encoder: E) -> Result<E::Ok, C::Error>
where
M: Mode,
C: Context<'buf, Input = E::Error>,
E: Encoder,
{
todo!()
}
fn decode<'de, 'buf, M, C, D>(cx: &mut C, decoder: D) -> Result<MyType, C::Error>
where
M: Mode,
C: Context<'buf, Input = D::Error>,
D: Decoder<'de>,
{
todo!()
}
Two lifetimes, and three generics by default, and a strange Input
parameter
for Context
! Did I mention that Müsli is still an experimental project?
Still, I’m hoping to discover a better way of doing this from an ergonomics perspective. If you have suggestions, I’d love to hear them!
Conclusion
Thank you for reading!
I can confidently say that with this, Müsli has the ability to produce state of the art diagnostics with almost no effort from the user. I think most people who’ve worked with serialization knows that without good diagnostics it can be very hard to figure out what’s going on.
I will round off by displaying this error I’m currently plagued by, which is
caused by a binary format using serde
. It’s actually the reason why I started
to pursue this topic:
Error: ../../changes.gz
Caused by:
invalid value: integer `335544320`, expected variant index 0 <= i < 11
I don’t know which field, in which struct, being fed what data, caused that error to happen. Needless to say I’ll be switching that project to use Müsli.
The above certainly can be improved, and we can even add tracing to serde
. But
not without compromises. And I want everything.
-
Website
could support mocking. But maybe that’s not viable either. ↩ -
Since that is a legitimate effect we should preserve of calling the function. ↩
-
This isn’t something that comes for free, we now have to live with the cognitive load of dealing with a weird interface. I think that most of the time it might not even be worth it. ↩
-
We do need to make sure to call
advance
where appropriate, like for every procedure which reads input. ↩ -
The exact behavior of writing to an
ArrayString
at capacity is a bit fuzzy, but at the very least anything that is outside of its capacity will be drop. The exact boundary of which could be improved. ↩ -
There are more interesting details in the full implementation of
Context
. ↩