In this post I'll write about an piece of the low level details of an hypothetical rust 2d graphics crate built on top of gfx-hal. Gfx provides a vulkan-like interface implemented on top of vulkan, d3d12, metal or flavors of OpenGL. just like the previous post this is in the context of recent discussions about a 2d graphics crate in rust.
I won't actually write much about 2d graphics specific things this time, because a lot of these low level concerns are agnostic to whether the rendered content is in two or three dimensions. I'll mostly focus on memory management and command submission.
This low level thing I am going to talk about isn't a 2d graphics API that most users would play with but rather a base component on top of which various rendering techniques could be implemented (for example using lyon, pathfinder or some other approach), and this base component would be independent of these rendering techniques.
Organizing the input of drawing commands on the GPU
Let's have a look at WebRender. Most drawing primitives in WebRender use very simple geometry (axis-aligned rectangles) with shaders of varying complexity that compute directly on the GPU what happens to the pixels covered by the geometry. If you've ever done serious work with GPU, one of your primary concerns most likely was to batch rendering commands and avoid state changes. To render a few thousand rectangles, it would be terribly inefficient to have a loop on the CPU that sets some parameters and kicks a drawing command for each rectangle. So what we do is write a lot of parameters into buffers which are sent to the GPU, and kick a few drawing commands, that will each render maybe thousands of rectangles in one go using instancing. The drawing parameters we write into the buffer contain information such as the position and size of the rectangle in layout space, a transformation to go from layout space to screen space, some flags about whether some anti-aliasing must be done, the positions and sizes of some source images in a texture atlas if need be (for example to apply a texture or a mask), the z-index of the primitive to write into the depth buffer, etc. The vertex shader uses an instance id to find the right information in a per-instance parameter buffer, transforms the geometry (a unit quad) into the right rectangle on screen, forwards some data to the fragment shader and the latter executes the per pixel logic (read from the source textures, apply some effect, write output color, etc.).
This approach is fairly generic, not particular to WebRender (a lot of games do this sort of thing) and works quite well. What's interesting here is that if you get to write the shaders yourself you have a lot of flexibility in how the input data is organized in GPU memory. You can put all of the parameters for a given instance contiguous together or add levels of indirection, share some common parameters among many instances or even devise your own compression scheme for your data. All that matters is for the CPU code to know how to writes bytes and the shader to know how to find the data in GPU memory and interpret it the right way, and then it is up to you to decide what trade-off to make about simplicity, memory usage and data locality.
I like to think of the problem of submitting commands to the GPU as some sort of encoding/decoding problem. Got data on one side, some binary representation in GPU buffers and need to read that correctly in the shader. As such, submitting commands is very tied to GPU memory management since it amounts to writing the correct information in the right place. In the purpose of a 2d graphics library, what kind of representation would we want? Well we don't know because the shape the data you write into GPU memory may depend on the details of the rendering technique itself and I mentioned earlier that I want this low level component to be agnostic to that. Some other layer above will deal with it and for now we can focus providing a way to write data into these GPU buffers.
Dealing with GPU memory allocation
Let's first have a look at some properties I want from a low level GPU memory management system:
- Be able to write commands from several threads in parallel.
- Be able to separate memory that is rarely updated from memory that we update often in order to minimize transfer and choose the right kind of memory heap.
- The system should be embeddable in another application that uses gfx.
- Ideally the shader should not have to care about whether a particular property is static or dynamic.
Legacy graphics APIs had a lot of memory management magic done in the driver. Modern APIs such as vulkan move this responsibility to the user of the API (us). Simply allocating a lot of small memory segments has a lot overhead. AMD's recommendation is to allocate large chunks of memory segments of 256 MiB and manage sub-allocations within these coarse chunks, others talk about 128 MiB blocks, nVidia's recommendation is to manage all allocations out of as few as possible large contiguous memory chunks.
So we can decide to have these large memory blocks with various allocation schemes within them, from a simple bump allocator to a general purpose fully-featured allocator.
For per-frame data which is overwritten each time, a bump allocator looks like a good fit: It is simple, fast and we can easily write a lock-free one that is usable in parallel. For completely static (as in very long-lived) allocations, bump allocators can also work well since we don't need to worry about fragmentation. The area in the middle is a bit tricker. I don't have a precise definitive allocation strategy to provide here, other than that in my opinion, nesting simple and efficient allocators is the way to go. Group memory allocations into blocks of the same lifetime that get deallocated in one go, and try to keep the complicated parts of memory managements for fewer coarser chunks, also try to group allocations by update frequency. In other words, use specific, simple and deterministic allocation strategies rather than relying on a single complicated one-size-fits-all allocation interface like if you had jemalloc for your GPU. Making such a general purpose allocator is for one very hard, and ensuring good and deterministic performance characteristics is even harder.
This idea of nested allocators is a good fit for something that can be embedded into another graphics engine. For example the coarse allocations could be requested to the embedder and the 2d renderer would manage its own allocations inside of them. For multi-threading this is also good: Some thread-safe allocator can allocate chunks in parallel, and these chunks could be managed independently in single-thread fashion but each on their single thread by a nested allocator that is hard to make thread-safe.
Some allocations are best managed manually (explicit allocation and deallocation), while for some other things, using a cache can be more interesting: Each frame, the CPU side requests that the allocations that are used within the cache stay alive, and if an allocation isn't used for a long time, the cache implicitly deallocates it. The next time the allocation is requested, the cache will politely tell the source of the request that the allocation doesn't exist anymore, a new allocation is made and the data is transfered again. A lot the GPU memory management in WebRender is done that way. Having the right heuristics about when to expire cache entries is not always simple.
So far I talked about allocations in the sense of figuring out how many bytes at which offset to reserve in GPU memory for this and that, but we also want to write into these allocations. Often times people think of allocating and writing into memory as single thing. When writing let five = Box::new(5)
you both ask the allocator to figure out where the value will be and write the value in memory. But sending data to the GPU isn't that simple. In general you can't assume that the memory you are writing to on the CPU is the one that gets read in the shaders. There are several memory heaps with different characteristics (CPU-visible, GPU-visible, fast/slow to read/write on the CPU/GPU, etc.). In practice this means that for a lot things the data is first written into a staging buffer that is CPU-visible, then copied from there into memory that is fast to read from the shaders.
One strategy could be to work with a large GPU buffer for the shader to read and generally smaller staging buffers into which we write only the parts that have changed. In this scheme the shader gets to read from a single contiguous buffer containing static and dynamic data alike and doesn't have to be aware of that. This is at the cost of copying from the staging buffer for data that we now will only read for a single frame. If a lot of data needs to be updated and read only once each frame, using another type of GPU memory heap that is both accessible to the CPU and the GPU, is slower to read in the shader but depending on how the data is read it might still be faster than the copy of the staging buffer. An example given for this type of memory heap is particle positions in a game, where by definition we know all of it will change each frame. The downside is that the shader can't pretend it is in a unified address space with the rest, so it isn't agnostic to what is animated and what is not. This might not be a good fit for cases where, say, some of the positions are animated while others are not and it all goes through the same code path in the shader. Another thing to be careful about is that some types of memory heaps use write-combined memory which is ideal to fill the staging buffer but can perform poorly if we don't pay attention to how we write into it. So we have to allocate full aligned multiple-of-cacheline sized chunks and avoid random access.
What's next?
Phew. That was a lot of words just for pushing some bytes to the GPU! I merely presented some challenges, proposed certain directions from very far away and haven't even really talked about rendering (the fancy tricks that goes into those shaders, generating the geometry and all).
But I believe this is an important part of the foundation. It is built upon mostly independent pieces and that's already some code that needs writing. Fortunately, this weekend was the first in a while that I had time to sit down and do that so I took one of gfx-hal examples and started writing a few things around it. So far I only have a simple bump allocator, some glue to use it and write into memory from multiple threads, a simple and dumb retained allocator for coarse allocations, some reexports of the gfx types without generics all over the place (since the backend is selected with feature flag), and various small utilities. That's not much, I'm still spending time getting familiar with gfx, but that's a humble start.
I'll end this post with hand-wavey overview of how I see this stuff fitting into the bigger picture:
From bottom to top there is:
- A gfx-device that is owned by an embedder.
- The embedder can be the default one or some glue to integrate with a specific engine or app).
- The low level components contain anything that is mostly independent of the rendering technique, for example a lot of the memory management code that I wrote about in this post. The goal of the low level components is to make it easy to write efficient rendering components without having each of them reinvent the common pieces. Rendering components do the more interesting stuff: They register passes, shaders, and implement fun rendering techniques. An example of rendering component could be a glyph renderer using pathfinder. Another one could be a tessellated polygon renderer that uses lyon, another one could be something that is optimized for rendering many axis-aligned rectangles as is common for UIs. Rendering components inform the command submission system which draw calls are order-independent as well as some global requirements such as depth/stencil behavior, etc. This lets the low level pieces figure out an efficient batching strategy.
- You see on top of that several "API" boxes. These take care of high level logic and providing user facing APIs. Each API would have to make different tradeoffs and focus on different use cases, picking whatever rendering components they want to use. As much as possible I would like to avoid compromises outside of the "API" boxes which will be potentially high level abstractions.
I'm intentionally describing this in very broad terms. The idea behind separating things this way is to be able to experiment with various rendering techniques without starting from scratch each time, and also aim for an extendable architecture if things work out. There is a lot of pieces missing and the boundaries will move as real code gets written but this sets a general direction I am happy with for a low level GPU based 2d graphics library.