Custom allocators in Rust

Lately I have been looking into custom allocators in the Rust. Recent internet chatter about the Zig programming language has brought allocators back to my attention, but custom allocators are an old topic. I remember playing with that a long while back in my C++ days after watching a talk by Andrei Alexandrescu. I found it a satisfying thing to hack on because you can start off with simple things that work quickly, build them out of composable parts and of course the rabbit hole is deep enough that after spending a lot of time you get to feel great about pulling advanced stunts. That said, I'm already digressing. I am not going to write about building memory allocators today, just about writing data structures that can accept different allocators. This isn't about replacing the global allocator either, mind you, but making it possible for different parts of a program to chose their memory allocation strategies. I'll first superficially ramble a bit about the the motivation and state of affairs in the Rust ecosystem, then I'll go over an experiment I made to integrate custom allocators in lyon's tessellator in a few different ways.

The ubiquitous global allocator

I've long had a mild dislike for the "one and only global allocator" strategy. It's a pretty simple and decent default, to a point that a lot of programmers forgot that there are alternative ways to work with memory. I look at performance profiles a lot at work and allocation/deallocation is at the same time always prominent and also kind of elusive. It is hard to distill down simple benchmarks for allocation overhead, because of much of that overhead comes from cache misses and contention between multiple threads using the same allocator.

To mitigate allocation performance issues, it is very common to build algorithms in ways that allow keeping the data structures that hold intermediate state around. This way subsequent runs can reuse the allocations from the previous ones. lyon's tessellator is written like that for example. In WebRender we also try to keep temporary allocations around. I'll refer to this approach as "recycling allocations" and it is basically about accepting that interacting with allocators is slow and making contraptions to avoid it.

Another option is to take a step back and think a bit about whether memory allocations really need to be slow. What are the different allocation patterns? Are they all best served by a general purpose allocator? Bump allocators are very good at providing very fast short-lived allocations. They also can work very well with groups of allocations with a common and well defined scope, for example a frame in a video game or a graphics rendering engine. Most allocations stay in the thread they are created with. Using a simpler thread-local allocator wherever it makes sense can provides some wins. Of course there are also long-lived allocations with mostly unpredictable lifetimes backing resources that may or may not move to different threads. These are well served by a global allocator, are they the majority of allocations in majority of programs? I don't think so and I'm not the only one. Some languages (Zig, Jai and others) have made that it a pillar of their library design to make it possible and convenient to pick the right allocation strategy everywhere allocation needs to happen.

Using custom allocators to avoid paying the cost of memory allocations for temporary data sounds like more work than recycling for about the same benefits, but the benefits can go beyond merely skipping the allocator overhead. When keeping around an allocation for later use, we take memory that is nice and warm in the cache hierarchy after its use and enforce that only future uses of this specific code will be able to use it again. Something else that runs immediately after could have benefited from that memory for its own temporary allocations, but it will have to reach for other pieces of memory that may not be as readily available. When running on a tight memory budgets, hoarding temporary data structures can also occupy precious resources.

Of course all of these advantages require understanding allocation lifetime patterns and picking the right strategy for the each part of the code. It's not like slapping a bump allocator under an algorithm will always make it faster. That algorithm could be made slower if the underlying allocator is fast but fails to reuse memory that is warm in CPU caches. It's an extra mental load for the programmer for sure and it may not be the right amount of effort for everyone. Whether or not this option is good for everyone, I think that we will be in a good place when that option is easily reachable.

Custom allocators and Rust

Rust, like most languages chose to start with the stance that most people should not have to think about how memory allocation is happening. All allocations go through a default general-purpose global allocator. The good news is that the standard library provisioned for evolving into eventually letting you opt into custom allocators.

There are two proposals for adding/exposing support for custom memory allocators. The one I am most interested in is the allocator-api feature which can be used with nightly versions of rustc today. The allocator-api2 crate can be used as a polyfill on stable Rust while waiting for the feature to be stabilized. There is also the storage proposal which aims to provide, a superset of allocator-api's features, mainly to be able to express things like SmallVec directly in the Vec type using a custom storage.

In a nutshell, the main conceptual differences between the two proposals are:

allocator-api adds a fairly straightforward Allocator trait with methods for allocating, deallocating and reallocating memory. You ask that trait to allocate memory and it gives you a pointer.
The storage proposal instead wants to replace the pointer, so that it can express inline allocations (that can be moved along with the data structure).

In my humble opinion I prefer the allocator-api proposal, because it is much simpler while covering the majority of the important features. I understand the appeal of a more general solution, but the added features don't quite pull the weight of the extra complexity for me. SmallVec (and equivalents for other data structures) can be a separate implementation, it does not need to be unified into Vec. Of course, that's a very personal view, I don't think I'm going to sway anyone that's already emotionally invested in the debate with this argument, and that's not the aim of this blog post.

Allowing custom allocators in a Rust data structure.

Apologies for the overlong preamble, here comes the part that this post is really about.

There are a few different ways to go about making data structures accept custom allocators in rust. In this post I am focusing on the particular case of temporary data structures. The stuff that is tied to the internal needs of an algorithm and typically have very short lifetimes. Keep this in mind because the mentioned trade-offs would be different for the longer lived data structures that represent the interesting state of your programs. First I am going to simplify the problem down to an imaginary Tessellator struct containing a single vector so that the code snippets remain approachable. I leave extrapolating to a more realistic/complex case as an exercise to the reader (or if you feel adventurous, go through the branches in lyon's repository that are linked later in the post). Then I'll briefly expand on how various approaches played out with lyon's fill tessellator.

Hard coded custom allocator

struct Tessellator {
    data: Vec<u32, CustomAllocator>,
}

This is as simple as it gets. Not very flexible, sure, but flexibility isn't always the goal. When creating small modular crates it can be hard to pick this approach, but for code that belongs to an application this is typically the most efficient and least amount of work.

Generic parameter on the data structure

struct Tessellator<A: Allocator = Global> {
    data: Vec<u32, A>,
}

This version is the most flexible: it abstracts over whether the data structure is owned or borrowed, sendable, etc. It takes advantage of the common case where the Allocator type is zero-sized (like Global, for example).

I would default to this approach with data structures that are already generic and when I don't want to enforce constraints about things like Send and Sync.

Adding a generic parameter to a complex structure that otherwise did not have any can be a bummer, though.

It adds noise in the code and documentation.
For something like custom memory allocators that are likely not going to be used by the majority of users of this API, it isn't very satisfying (to me at least) to put this advanced parameter under everyone's nose by making it show at the type level.
For things that contain a large amount of code, it is easy to bloat the binary size by (accidentally) causing multiple instantiations of the entire thing. Not that in this case the power is still in the user's hands, they can decide to always use the same allocator or always use a &dyn Allocator as the allocator type.

On the performance side the pros and cons are:

pro: sometimes static dispatch produces faster code than dynamic dispatch (the compiler has more information work with after all).
con: sometimes static dispatch produces slower code than dynamic dispatch. Yes. Slower code. That's actually the case with lyon's tessellator and the impact in this case is hard to ignore. More on that at the end of the post.

Borrowed trait object

Static dispatch is not the only way to interact with allocators. Lyon's fill tessellator already treats allocation and deallocation as slow operations, so the overhead of dynamic dispatch should not matter for performance. In this context, solutions that don't involve generics can be appealing.

pub struct Tessellator<'a> {
    data: Vec<u32, &'a dyn Allocator>,
}

impl<'a> Tessellator<'a> {
    pub fn new_in(allocator: &'a dyn Allocator) -> Tessellator<'a> {
        Tessellator { data: Vec::new_in(allocator) }
    }
}

impl Tessellator<'static> {
    // A good default. The static lifetime is functionally equivalent to not having
    // a lifetime constraint which means the borrow checker will be off our back for
    // objects coming out of this constructor.
    pub fn new() -> Tessellator<'static> {
        Tessellator::new_in(&Global as &'static dyn Allocator)
    }
}

// An alias for the common case can also help making things look a bit simpler for the
// common case.
type SimpleTessellator = Tessellator<'static>;

// This one can move around freely.
let mut a = Tessellator::new();

// This one is tied to the allocator handle on the stack.
let allocator = SomeBumpAllocator::new(); 
let mut b = Tessellator::new_in(&allocator);

Again, adding a lifetime parameter for purpose of custom allocators may or may not rub you the wrong way, just like the generic parameter of the previous solution. It does bother me a bit in this case, even with the type alias. For the most part I can live with the extra noise of the lifetime parameters in the code, but I care deeply about how public facing APIs are presented to the users, in particular in the generated documentation. Conceptually I would like to have something that looks like Tessellator and TessellatorWithAllocator<'alloc> where the simple type signature describes the simple common case while the more complicated signature describes a more advanced usage. I could have the complicated TessellatorWithAllocator<'a> be the real structure and the simpler Tessellator be the alias, but as far as I know rustdoc can only show the methods on the actual struct, so the advanced parameter is forced on the user in a way that I am unhappy with.

Note that since we can default to 'static (and that works with Global), the lifetime does not mean we suddenly made our type difficult to move around, I think that it's pretty neat.

An added constraint of this scheme is that the data structure must decide up front whether it needs to be Send. For example the one in the snipped above is not Send.

Another important thing to note is that when using a trait object in the allocator field, types like Vec and Box (that no doubt your complex data structure uses in multiple places) put on some weight (a trait object contains two pointers so 16 bytes on most things Rust runs on these days) compared to Global's zero bytes. Whether that's an issue depends on how many of these find their way into your data structures and how much the performance is sensitive to the size of the structures. It did not make a measurable difference in the context of lyon's fill tessellator. This potential bloat issue assumes your data structure is built upon types like the standard containers which store the allocator, but nothing prevents you from using equivalent containers that do not store the allocator. More on this soon, but first I want to address the extra lifetime parameter issue.

Using two structures

The documentation issue I had with the generic and borrowed trait object approaches can be solved by using two structs, one being a thin wrapper around the other. This gives full control over what the doc looks like at the cost of more boilerplate.

struct Tessellator {
    inner: TessellatorWithAllocator<'static>,
}
// or
struct Tessellator {
    inner: TessellatorWithAllocator<Global>,
}

Static trait object

Dynamic dispatch was fine but for some reason you do not want to make the structure parameterized over a lifetime. No problem, you can just force the lifetime to be static:

struct Tessellator {
    data: Vec<u32, &'static dyn Allocator>
}

impl Tessellator {
    // A good default.
    pub fn new() -> Tessellator {
        Tessellator::new_in(&Global as &'static dyn Allocator)
    }

    /// Using a custom allocator.
    ///
    /// Note: Users of this type can decide at their own risk to cast away the static
    /// lifetime. This will not cause problem with this type as long as the provided
    /// allocator outlives the data structure.
    pub fn new_in(allocator: &'static dyn Allocator) -> Tessellator {
        Tessellator { data: Vec::new_in(allocator) }
    }
}

This version is actually a subset of the borrowed trait object restricted to the static lifetime and with nicer syntax (no <'a> everywhere).

This one is probably closest to what one would write in languages like C++. The data structure just assumes the allocator will outlive it, and it is up to the user to either use an static allocator, or pretend to using unsafe code to cast away the lifetime while making sure that the allocator outlives the data structure without the compiler's support.

I know many in the Rust community will frown upon this approach, but to be honest I don't think that it is a terrible solution in the context of an advanced feature like custom allocation strategies. Tessellator does not expose an unsafe API, but it documents that if its users were to break the rules they'd simply have to make sure the allocator outlives the data structure. In any other language with this kind of control over memory management, this contract would have to be manually upheld by users of the API and it is considered normal.

Trait objects à la Zig

I wrote earlier about how the extra space occupied by the allocator field in the vectors, boxes and maps that make your data structure may or may not be an issue. What can we do about it? Well, we could use a similar container types that do not store their own allocator. Matklad wrote recently about how Zig provides a vector container that takes the allocator as a parameter of all of the methods that may need to allocate or deallocate memory. There are tradeoffs to this approach. In Rust, almost all of these allocating methods have to be marked unsafe since memory safety depends on the user correctly passing the same allocator the container was created with. In the context of a larger data structure that's a pretty trivial contract to uphold, the outer data structure would probably just store the allocator at its root and pass it to the internal containers. Let's have a look:

struct Tessellator<'a> {
    allocator: &'a dyn Allocator,
    data: RawVector<u32>,
}

impl Tessellator<'static> {
    // A good default.
    pub fn new() -> Tessellator<'static> {
        Tessellator::new_in(&Global as &'static dyn Allocator)
    }
}

impl<'a> Tessellator<'a> {
    pub fn new_in(allocator: &'a dyn Allocator) -> Tessellator<'a> {
        Tessellator {
           data: RawVector::with_capacity(128, allocator),
           allocator,
        }
    }
}

impl<'a> Drop for Tessellator<'a> {
    fn drop(&mut self) {
        // Don't forget to free the memory!
        unsafe {
            self.data.deallocate(self.allocator);
        }
    }
}

If you would like to play with an unsafe manually allocated vector like this, I wrote one in the shared_vector crate. This wasn't the initial goal of that crate, but it contains a vector type mostly similar to the standard library's and it was easy for me to refactor it into a RawVector<T> that contains most of the code and takes the allocator explicitly on top of which the Vector<T, Allocator> type is implemented as a thin wrapper. That code is rather fresh so there may be some bugs but fuzzing isn't finding issues so far.

Results

I wanted to experiment with custom allocators in a non-committal way. To do so I tried Some of the approaches I described earlier in lyon's fill tessellator. It is the perfect size for this experiment: big enough to show how things would play out in a non-trivial chunk of code, but small enough to fit in a manageable amount of time while trying a couple of approaches.

The fill tessellator implements an algorithm that generates a triangle mesh from a vector path (the kind of paths you would find in a canvas-style API or an SVG document). Internally it needs to allocate memory for a lot of different things, which I won't bore you with in this post. As a mentioned earlier, the tessellator is a structure instead of a pure function so that it can retain memory allocations in subsequent uses.

The code for the experiment can be found in the following branches: - Generic parameter: alloc-generic branch - Borrowed trait object: alloc-trait-object branch - Zig-style manually managed trait object: alloc-raw branch

The most adventurous readers might find the courage to read and compare all of that code, I wouldn't blame you if you don't. Here are my thoughts after this little experiment:

Adding support for custom allocators can be a bit tedious, but it is rather easy.
I believe that the yet-to-be-stabilized Allocator trait is the right abstraction.
I had been waiting patiently for the nightly-only allocator-api feature to stabilize, but since that's just a library thing anyone can use the stable polyfill from the allocator-api2 crate today in stable rust. I'm happy I stumbled upon it just in time for this experiment.
The fill tessellator is not just one struct like in the simplified example from earlier there are a number of structs only visible internally. Adding a generic or lifetime parameter to all of them is tedious and adds noise that I am not very pleased with. In an ideal world, adding support for custom allocators to an algorithm is so simple that it becomes an idiomatic thing that crate implementors just do even when they don't necessarily need it, just like implementing the Default trait. In this ideal world there may even be a clippy lint that nudges you towards using custom allocators in some places.
Generalizing to a whole program, If everything that allocates memory needs to have generic or lifetime parameters, it may be too tedious as well, or perhaps I have to get used to reading <things> everywhere.
At the API level, I am not super fond of adding a generic or lifetime parameter either for an advanced knob that isn't at the forefront of the tessellator's functionality.

Performance

I mentioned somewhat surprising performance results for the generic version of the tessellator's allocator parameter. What's up with that?

The tessellator recycles its allocations, so using a faster or slower allocator should only significantly affect the performance the first time the tessellator runs. The repository contains a few benchmarks which reuse the same allocator many times so it would make sense for this experiment to not affect these benchmarks. Yet, I have been bitten in the past by surprising performance difference between static and dynamic dispatch so I made sure to run the benchmarks again as a sanity check. In particular I there was a significant performance increase a few years ago when the tessellator's output parameter was converted from a generic parameter to a trait object (making the tessellator entire free of generic parameters). I observed the same effect when comparing the different approaches: Adding a generic parameter for the allocator regresses the benchmarks by a solid 5 to 8 percents. Every other approach performed roughly the same (as expected since allocator performance should not be a factor of these benchmarks).

It's difficult to tell what's going on here. There if a fair amount of code involved, so eyeballing the generated code is too much hassle for me to commit to right now. Hopefully there exists some tools for looking at how compiler optimizations played out at a large-ish scale. I can't make particularly educated guesses here, all I know is that:

The benchmarks use a single instance of the generic parameter, so if bloat is to blame it comes from poor inlining decisions and not multiple generic instantiations.
It doesn't seem to matter what parameter is made generic. Any generic parameter on the tessellator appears to throw a wrench into some heuristics and snowballs into slower code for this particular code.
I don't have reasons to believe that this generalizes to other code. All I would gather from that is that contrary to popular belief, static dispatch does not always produce faster code than dynamic dispatch (especially at a larger scale than simple code snippets). If performance matters, don't make assumption, try things, measure and then decide.

If you are interested, here are a couple of performance profiles (the link will probably go stale after a while):

Generic allocator parameter: (profile link)

test fill_events_01_logo                ... bench:      36,856 ns/iter (+/- 6,452)
test fill_events_02_logo_pre_flattened  ... bench:      22,195 ns/iter (+/- 503)
test fill_events_03_logo_with_tess      ... bench:      87,390 ns/iter (+/- 1,464)
test fill_tess_01_logo                  ... bench:      71,749 ns/iter (+/- 1,279)
test fill_tess_03_logo_no_intersections ... bench:      82,140 ns/iter (+/- 379)
test fill_tess_05_logo_no_curve         ... bench:      34,081 ns/iter (+/- 857)
test fill_tess_06_logo_with_ids         ... bench:      70,721 ns/iter (+/- 1,166)```

Non-generic: (profile link)

test fill_events_01_logo                ... bench:      29,859 ns/iter (+/- 835)
test fill_events_02_logo_pre_flattened  ... bench:      14,986 ns/iter (+/- 437)
test fill_events_03_logo_with_tess      ... bench:      72,470 ns/iter (+/- 5,095)
test fill_tess_01_logo                  ... bench:      56,008 ns/iter (+/- 1,347)
test fill_tess_03_logo_no_intersections ... bench:      66,380 ns/iter (+/- 2,144)
test fill_tess_05_logo_no_curve         ... bench:      28,297 ns/iter (+/- 171)
test fill_tess_06_logo_with_ids         ... bench:      54,785 ns/iter (+/- 2,835)

Conclusion

Fast software is written with memory allocations in mind. Recycling allocations is an easy way to paper over allocator overhead but picking the right allocator for specific parts of the code is another option which can be even more effective. Ideally developers would have the possibility to easily reach for both and do what makes most sense in context.

I think that custom allocators are a pretty cool feature for a performance-focused library to add. Very few crates do today, probably in part because the standard library hasn't stabilized custom allocators yet, but also because it is very easy and convenient to not think about it at all. There is certainly a niche to fill in the ecosystem for specialized custom allocators. There is blink-alloc and probably a few others. As this niche fills up and the the allocator API stabilizes, I hope that more people will join the fun. Will I integrate custom allocators in my own crates? I think so. I still need to ponder some more on the different ways to do it. I don't think that one way is strictly superior to the others. For some things a generic parameter is the obvious choice while for others I find that making an entire chunks of a program generic for the sake of supporting custom allocators is not sustainable. Dynamic dispatch for custom allocators can work well in many cases. In lyon I am leaning towards using a generic parameter in the Path data structure and dynamic dispatch in the tessellator.

And there is this thorny generics performance situation, but that's more anecdotal. The point is that making code generic over something can have somewhat unexpected consequence (good or bad) the only right thing to do is to validate assumptions. I still hope that static dispatch can produce faster code more often than not.

Comments/reactions/discussions about this post on reddit

Post scriptum

Here are few notes following some of the reddit discussions.

A discussion with one of the storage proposal authors provided me with more context and put the proposal in a better light:

The storage API does not need to be as complicated as the code I read in one of its versions.
Even though there is an unresolved issue with Box coercion, that problem only affects some specific cases. The proposal could be stabilized without necessarily resolving it.
Ideally the Storage and Allocator traits would both exist and some Storage implementations could leverage Allocator ones. Storage would mainly allow distinguishing between inline and allocator provided memory.

I was also reminded of the bumpallo crate, which also implements support for the unstable Allocator trait. The crate does not rely on allocator-api2 for polyfilling the Allocator trait. This underlines the need for stabilizing the Allocator trait to let a healthy ecosystem develop on common foundations.

Another interesting highlight: okaoka allows switching between allocators using a hook on the global allocator. It does not provide the type of granularity that I was after, it allows injection a different allocator for third party code that does not provide options. None of the approaches discussed in this post let you do that.

Posted on: Thu 06 April 2023

Category: rust