Simple yet important tip for passing std::string by reference.

Using C++ I quickly learned that I should pass large std::vector by reference.

Passing a std::vector by reference (using &) as a parameter for a function will save you from the operation of copying the std::vector into a local copy.

For my next game I was getting bad performance parsing jsons.

It turns out that nearly the entire CPU time spent on parsing the json was due to passing std::string by value in internal functions.

Passing std::string by reference will save you from a copy operation of that string.

OpenGL Texture Size and performance? Texture Cache? (Android\iOS GLES 2)

I am working on my dragon simulation mobile game and I am at the stage of adding a terrain to the game.

I optimized the terrain render for my iOS devices but on my Android device I was getting bad performance.

Understanding texture size and performance is kind of elusive. I was being told many times that big textures are bad but I was never able to find the correlation between texture size and performance.

I am still not sure what is the correlation between texture size, bandwidth and performance on mobile devices and I am sure it’s a complex one.

However, I did find out something else.

As I mentioned, my first attempt to copy the OpenGLES shaders from my iOS code to my Android code gave me poor results. The same scene that ran at 60 FPS on my iPod was running on 25 FPS on my Android phone.

This is how the scene looked like on my Android phone:

Slow Terrain Render

Scene rendered at 25 FPS(40 ms per frame)

For the terrain I am using two 2048×2048 ETC1 compressed textures. One for the grass and one for the rocky mountain.

Maybe my phone’s performance is really not as good as my iPod? But then, something was missing.

On my iPod I was already using mipmapped textures while on the first attempt of the Android version I didn’t use mipmapped textures.

Mipmapped textures are texture which not only contain the texture itself but also all (or some) of the smaller versions of the same texture image.

If you have a texture of size 16×16 pixels then a mipmapped texture will contain both the 16×16 image but also the 8×8, 4×4, 2×2 and 1×1 resolutions of the same image.

This is useful because it’s hard to scale down a texture on the GPU without losing details. The mipmapped images are precalculated offline and may use the best algorithms to reduce the image.

When rendering with mipmapped textures the GPU selects the mipmapped image that is the most suitable for the current scaling in the scene.

But apart from looking better, there is another advantage. Performance.

The same scene using mipmapped version of the 2048×2048 textures runs a lot faster than before. I could get a scene render at about 50 to 60 FPS.

The reason for that is that textures have a 2D spatial cache.

In this scene the mountain and grass textures are scaled down considerabley. This in turn makes the GPU sample the textures in texel(texture pixels) that are far from each other making no use of the cache.

In order to make use of the cache the sampling of the texture must have spatial proximity.

When using the mipmapped version of the texture a much smaller layer of the 2048×2048 texture was sampled and thus it was possible to make use of the cache for this specific image.

For the sake of completion, here is the scene with the mipmapped textures:

Runs at about 50-60 FPS(17-20 ms)

Runs at about 50-60 FPS(17-20 ms)

resize performance for std::vector in C++

I am working on an iOS port of my Android game Concussion Boxing.

Inside my rendering function I have a certain loop that use the same struct for each mesh that is drawn. This struct has std::vector variables of various kinds in it.

Whenever I want to use this struct to pass parameters to a function I would resize its std::vector variables.

Allocating memory is a CPU demanding operation that is why resize can be used to save on these allocation operations.

A good idea would be to declare the struct outside of the loop so we can reuse the memory that was allocated in the previous iteration of the loop.

However, there is a case in which resize won’t work so well. If you have a std::vector<std::vector<float>> then resize will have a performance hit.

I don’t know what is going inside resize in this case but I found that simpley having a pointer to float or a struct of a constant amount of floats(in an array) will be a lot more efficient rather than using a vector of a vector.

For reference, I improved my game’s frame CPU time from 27 ms to 22 ms by using less vectors of vectors.

Boosting the FPS: GC (Part 1)

I was getting annoying long stutters on the render rate of my 3D Android RPG game Heroes Of Honesty(

It was weird to me because normally I would see most frames time were bellow 33ms which means I would need to get at least a solid 30 FPS. As can be seen in this diagram:

Frames bellow 33ms

After a few more measurements and profiling, I realized what was one of the issues responsible for this stutter. It turns out the Garbage Collector implementation on Android for Java is not very good.

You can see, in the following diagram, there are gaps in the GLThread(the render thread) whenever the GC kicks in:

GC Stutters

(Notice the GC kicks in about every 300ms in the diagram above)

GC saves programmers the hassle of managing memory allocation and release. However, the more you allocate memory at run time the harder the GC will have to work to clean after you. So in order to prevent the GC from working hard we need to minimize the amount of allocations we make.

We do this by “recycling” memory. We use a memory pool, a chunk of pre allocated memory or an array of pre allocated memory we allocate once and reuse many times for different data.

Let’s say you need to set a Projection matrix to an OpenGLES2.0 shader. You might allocate the matrix before each time you send it, it might be even inside a render loop used for many different 3D objects.

Instead of allocating the matrix for each object, we can allocate it once outside the loop and reuse the same memory chunk for all the 3D objects in the render list.

After minimizing some of the memory allocations in the render thread, I got the following results:

Better performance of GC with memory recycling.

Notice the GC occurs once every 800 ms.

There is another issue. What I have shown up until now was a scene with static resources. I couldn’t load the whole map of Heroes Of Honesty into memory, so I had to dynamically load map patches only when they are required.

This means that in some of the frames I put an extra effort in loading resources into VBOs. I also happen to allocate a lot of memory in the process of generating the index and vertex data before sending them to the GPU.

The following diagram shows what happens when we are in the middle of dynamically loading resources:

Dynamic resource loading and GC

On the left side of the diagram you can see there is a big gap between the two times the GC get to work. However, on the right side the GC is being called many times.

Notice that not all of the stalls are due to the GC’s fault, some of it is because of the extra work we need to do to actually generate the dynamic resource.(more on that on the second part)

The following diagram shows the performance after using memory pools and recycling memory for the index and vertex buffers that need to be dealt while loading a dynamic resource:

GC performance is now better with dynamic resources

In the diagram above you can see the GC doesn’t work any harder when dynamic resources are being created. You can see where resources are being create because the GetMesh method is marked with little wedges.

Notice that there are still gaps in the render thread and we will still get stutters even though we solved the GC issue.

For the sake of completion here is a closer look at the frames timelines while loading dynamic resources:

Single thread resource loading.

In the diagram, a “normal” render frame would consist of a mostly black part and a pink part. The black part is mostly the OpenGLES draw calls, and the pink part is the game logic update. Currently they are both done in the same render thread.

Some frames have a green part, the green part is CPU time “wasted” on loading the resource. As you can see the many green parts make the frame time longer and thus the frame rate will drop once in a while.

You may also notice that the GC doesn’t do work this whole time. In the next part I will explain how I improved the render thread’s rate and the overall performance(with multi threading).

Improving Animations’ Key Frame Performance on the CPU

For my 3D Android RPG game Heroes Of Honesty( I was making 3D animated characters.

There are numerous ways to animate a 3D characters and one of the ways is to have an artist set animation key frames.

A key frame is a state of an object or bone of the character set in a specific time of the animation timeline. By setting many key frames you can animate the characters into all sort of pre made behaviors.

When drawing an animated character in a 3D game, you would select a specific time in the animation timeline and calculate the posture of the character in that specific time. The posture is then calculated by the nearest keyframes in the timeline and interpolate two adjacent key frames if there is no key frame at that specific time.

How would you find which key frame is the closest to the frame time?

Linear search

Since the keyframes are ordered from the earliest to the latest we can go from the beginning to the end, one step at a time, and stop when we find the first key frame that is bigger than our current frame time.

This sort of search has O(n) complexity and it is pretty slow when we have a lot of key frames.

We might have many key frames if our animation software bakes the keyframes for things such as Inverse Kinematics.

Binary search

We can do better than linear search.

Since our key frames are ordered in the time line from small to big, we can do a binary search.

The binary search algorithm always check half of the relevant key frames. If we know a specific key frame is bigger than our frame time then all the key frames above it are irrelevant for us. We then continue to check the second half of the first half(the quarter) we found is relevant and so on.

The following code is a binary search in C++:

	unsigned int
	Mesh::AnimationNode::CoreFindPosition (aiNodeAnim * n, unsigned int First, unsigned int Last, double t)
		if (First+1>=Last)
			return First;
		unsigned int Index = (Last+First)/2;
		if (n->mPositionKeys[Index].mTime==t)
			return Index;
		if (n->mPositionKeys[Index].mTime<t)
			return CoreFindPosition(n, Index, Last, t);
		return CoreFindPosition(n, First, Index, t);

For the binary search we get a complexity of O(log(n)).

Baked Search

We can do even better. Our game render rate is limited. Most games run at no more than 60 Frames Per Second. This means that at the very least the animation time steps would be 16.67 ms.

We can create an array that will be the size of the animation timeline divided by the rendering frame time(16.67 ms). This array will have indices to the nearest key frame to the frame time each entry represents.

Now in order to find the closest key frames we simply access one array cell that corresponds to the animation’s frame time and get the index of the closest key frame.

This algorithm is of O(1) complexity.

Full baking?

A character usually consists a hierarchy of objects that compose all the parts of the character. In order to animate the character we need to find each object’s relative position key frame, and then calculate its absolute position up to the root object.

Instead of baking indices of key frames in the baked search, we can bake the absolute position of the object in that specific time. Since the rendering FPS is limited, we can bake the absolute position dense enough so we won’t need to interpolate.

This will save us all the hierarchical calculation of the animation tree. However, this way we won’t be able to mix this specific animation with other animation or mix it with dynamic elements such as rag doll animation.