2018-04-03

Batching draw calls to improve performance

In graphics programming, one of the classic ways to improve performance is to reduce the number of draw calls in a frame.

Quake 3's BSP map format was not designed for this need. When I originally wrote a Quake 3 map viewer in OpenGL, my algorithm was to simply iterate through all of the visible "faces" of the map, bind their decal and lightmap textures, and draw. Here's an example of the draw calls made while rendering my first map, seen through a RenderDoc capture of the following frame:

What's important to notice is that I'm making lots of draw calls, each one drawing about 3 or 6 indices, or 1 to 2 triangles. Can something be done to reduce the number of calls, and have each one be more meaningful (e.g draw more)? The key is understanding the data you are rendering.

Before making a draw call in my map rendering, I need a face's decal texture, lightmap texture, number of vertices to draw, and an offset into the map's vertex buffer from which to start drawing. If I encounter any kind of texture change while iterating through faces, I need to bind a new texture (which can be expensive), and make another draw call. Unfortunately, BSP files don't sort their faces according to their textures, but rather the BSP clusters they reside in (see paragraph below) which usually leads to a lot of binding and draw calls.

There are, of course, ways to avoid drawing every face in the map. The appeal of BSP is being able to partition a 3D space into leaf "clusters" that can be used to trivially compute whether one can see another at runtime. If a cluster cannot be viewed by the one the "camera" is in, all faces in that cluster can be disabled from rendering. Additionally, clusters have axis-aligned bounding boxes, making it easy to disable their faces if they lie outside the camera's view frustum.

Despite these optimizations, it was still possible to reduce the number of calls even further. It simply required me to abandon of lot of said optimizations and group my faces into what I call "map render blocks". I starting by creating a new data structure.

typedef struct
{
	int texture;
	int lm_index;
	int vertex;
	int n_vertexes;
	int meshvert;
	int n_meshverts;
	bboxf bounding_box;
	int visible;
	int start_range;
	int end_range;
} bsp_render_block_t;

The important details to notice are the texture indices and number of vertices to draw (meshverts are indices into the index buffer). BSP faces also have this data, but my aim is to have way fewer blocks than faces.

Since I need textures and a vertex range to make a draw call, I first started by sorting my faces according to their texture and lightmap indices. After this, I allocate enough render blocks to match the number of texture changes while iterating through the sorted list. I then iterate through the list a second time, this time keeping count of the number of vertices that are used by each face. Once I hit a texture change, I set the number of vertices and offset for the current block, then start counting vertices for the next one. This required lots of debugging, where I encountered and handled edge cases like render blocks that consist of one face, or a texture change on the very last face of the sorted list.

I then made a new map rendering function to easily compare to my old one. In the new function, I iterate through all of my map blocks and render them the same way I do my faces - bind the block's textures, and set the appropriate vertex range. I then used RenderDoc to confirm whether it was working. I captured the same frame from the one above. Indeed, my maps were looking correct, and being drawn differently - now, draw calls were rendering more.

You can see this better in the following gifs - first is the traditional "iterate through faces" method and second is the "iterate through blocks" method. Both were recorded from the preview window of RenderDoc while I was moving through my draw calls. In the first gif, I had my finger on the down key while recording. I couldn't do this in the second gif as it would have ended too quickly!

RenderDoc also has a "mesh output" tab where you can see exactly what is being drawn in a call. In the following gif, you can see that faces from all across the map can be drawn in individual calls.

Of course, this meant that frustum culling was less useful here - the bounding boxes of blocks could be so large that the player was almost always looking at them!

Did all this work actually result in better performance? Interestingly, I was only able to measure this in my unfinished Vulkan renderer on my laptop, which was the only configuration that showed an uncapped framerate in RenderDoc's text overlay. 2 maps were tested, each with "single" and "batched" rendering tested.

A roughly 30-60 FPS increase. Not noticeable on my machine, but this could potentially be a valuable boost on some maps, on some machines. I'm happy to have done it.