std::deque 4K Bytes high memory cost on iOS, C++

Preface

My iOS build of my game Dragons High was using about 150MB of live memory.

I used the Allocations instrument to find out that I had more than 10,000 allocations of the size of 4K Bytes.

All those allocations were of the same class, std::deque<unsigned int>

In Dragons High there is a terrain of the world. The terrain has collision geometry data.

Basically I test the dragon or other moving characters against the triangles of the terrain geometry so they won’t go through it.

In order to not go over all the triangles I have a grid of lists(std::deque) with indices to the triangles, so if the character is inside a certain cell in the grid I just test against the triangles in that cell and not against all the triangles in the mesh.

std::deque Cost

It turns out that for every item in std::vector<std::deque<unsigned int>>, std::deque was allocating 4K Bytes even if it had less than 20 items.

For my grid I allocated 128×128 cells, so as a result I had more than 50MB usage just for this std::vector.

For every std::deque<unsigned int> in the std::vector in my iOS game I had 4K allocated even if there were very few elements in the std::deque.

My solution was to use my own custom Data Structure which used dynamically allocated unsigned int * pointer.

This way I was able to shave about 50MB of memory usage.

EDIT:

A better alternative to making your own custom class with a pointer is to use std::vector.

std::vector takes care of allocations for you, so it’s safer and it does not have a noticeable memory footprint(for a small amount of items).

Boosting the FPS: GC (Part 1)

I was getting annoying long stutters on the render rate of my 3D Android RPG game Heroes Of Honesty(http://www.PompiPompi.net).

It was weird to me because normally I would see most frames time were bellow 33ms which means I would need to get at least a solid 30 FPS. As can be seen in this diagram:

Frames bellow 33ms

After a few more measurements and profiling, I realized what was one of the issues responsible for this stutter. It turns out the Garbage Collector implementation on Android for Java is not very good.

You can see, in the following diagram, there are gaps in the GLThread(the render thread) whenever the GC kicks in:

GC Stutters

(Notice the GC kicks in about every 300ms in the diagram above)

GC saves programmers the hassle of managing memory allocation and release. However, the more you allocate memory at run time the harder the GC will have to work to clean after you. So in order to prevent the GC from working hard we need to minimize the amount of allocations we make.

We do this by “recycling” memory. We use a memory pool, a chunk of pre allocated memory or an array of pre allocated memory we allocate once and reuse many times for different data.

Let’s say you need to set a Projection matrix to an OpenGLES2.0 shader. You might allocate the matrix before each time you send it, it might be even inside a render loop used for many different 3D objects.

Instead of allocating the matrix for each object, we can allocate it once outside the loop and reuse the same memory chunk for all the 3D objects in the render list.

After minimizing some of the memory allocations in the render thread, I got the following results:

Better performance of GC with memory recycling.

Notice the GC occurs once every 800 ms.

There is another issue. What I have shown up until now was a scene with static resources. I couldn’t load the whole map of Heroes Of Honesty into memory, so I had to dynamically load map patches only when they are required.

This means that in some of the frames I put an extra effort in loading resources into VBOs. I also happen to allocate a lot of memory in the process of generating the index and vertex data before sending them to the GPU.

The following diagram shows what happens when we are in the middle of dynamically loading resources:

Dynamic resource loading and GC

On the left side of the diagram you can see there is a big gap between the two times the GC get to work. However, on the right side the GC is being called many times.

Notice that not all of the stalls are due to the GC’s fault, some of it is because of the extra work we need to do to actually generate the dynamic resource.(more on that on the second part)

The following diagram shows the performance after using memory pools and recycling memory for the index and vertex buffers that need to be dealt while loading a dynamic resource:

GC performance is now better with dynamic resources

In the diagram above you can see the GC doesn’t work any harder when dynamic resources are being created. You can see where resources are being create because the GetMesh method is marked with little wedges.

Notice that there are still gaps in the render thread and we will still get stutters even though we solved the GC issue.

For the sake of completion here is a closer look at the frames timelines while loading dynamic resources:

Single thread resource loading.

In the diagram, a “normal” render frame would consist of a mostly black part and a pink part. The black part is mostly the OpenGLES draw calls, and the pink part is the game logic update. Currently they are both done in the same render thread.

Some frames have a green part, the green part is CPU time “wasted” on loading the resource. As you can see the many green parts make the frame time longer and thus the frame rate will drop once in a while.

You may also notice that the GC doesn’t do work this whole time. In the next part I will explain how I improved the render thread’s rate and the overall performance(with multi threading).