What is “Performance is limited by CPU vertex processing”?

I was working on a different OpenGLES2 technique of particles to draw fire for my new(WIP) game Dragons High.

The thing I noticed is that whenever the particles effect is enabled I get stalled for 14 ms on glDrawElements of that specific particles VBO.

In this particle technique I am rendering the triangles of a certain mesh instanced 500 times.

Each instance has it’s own time and a few other parameters.

The vertex structure I used was:

typedef struct _ParticleVertex
    float3 Pos;
//    float Alpha;
    float3 Velocity;
    float Time;
    float4 color;
} ParticleVertex;

Velocity is the particle’s initial velocity.
Time is the time the particle first spawns, and color is the particle’s color tint and transparency(with the alpha component).
Instead of color I can use Alpha which only has the alpha component of color.

The thing I noticed is that if I used Alpha instead of color, the CPU will no longer stall on the particle’s draw call and it would take less than 1 ms to complete the command on the CPU side.

Both methods rendered the particles correctly:



However, when I analyzed a frame with color in xCode, I found the following warning:

“Performance is limited by CPU vertex processing”

If you know how GLES2 works it would sound weird that the vertices are being processed on the CPU. The vertex shader in GLES2 runs on the GPU and the analyzer cannot know if I updated the VBO with vertex data that was preprocessed on the CPU.

When running the OpenGL ES Analyzer instrument in the xCode profiler I found the following error:

“GL Error:Invalid Value”

This error pointed to glVertexAttribPointer with all the parameters set to 0.

It turns out that when creating the VBO there was a boolean parameter(mUseColor) that I used to signify that the vertex struct has a color data.

When creating the VAO for this vertex struct I was adding 4 to the array length for mUseColor==true instead of 1.

This resulted in the VAO array have 3 more entires than it should. Since those entries were not initialized they were left with 0 values.

Why were the particles rendered correctly eventhough the VAO was incorrect? And why the rendering cost so much CPU? It’s hard to tell, but it’s possible the iOS driver did something to compensate on the structural error of the VAO.

I guess a good practice is to clean all the errors before you profile… but it could be interesting to understand what was going under the hood and what was the CPU vertex processing xCode warned about.


Mipmaps and GL_REPEAT Artifacts in OpenGL ES 2.

I am working on a new racing game and I encountered some odd artifacts in rendering the textures of the track.

I verified that the artifacts are not ZBuffer fighting related and I couldn’t tell what caused it.

The track has a repeating texture which means it use one texture and repeat it along the track segments.

Track Start Artifacts

Track Start Artifacts

Guard Rail Artifact

Guard Rail Artifact

In the first image notice the patch of asphalt just before the car. It has artifacts while the patch further away does not.

In the second image look at the guard rails.

This phenomena happened by using either a raw mipmapped texture or a compressed mipmapped texture(pvr on iOS).

However, it only happened on textures that used the GL_REPEAT flag and not on other mipmapped textures.

So what was it?

It turned out that there was an issue with calculating the level of the mipmap.

Notice on the rail guard image that the artifact happen at a specific band of depth range which is the transition between one mipmap level to a lower level.

Look at the UV mapping of the track on the bottom right window:

Long UV mapping

Long UV mapping

The UV mapping for this track stretch way beyond the U texture mapping coordinate .

If the UV coordinates repeat on the range of [0..1]x[0..1] then the mapping on this object reach a U coordinate of about 50.

This is done to make use of the repeating texture but it is also messing up with the OpenGL ES 2 internal calculation of the mipmap.

With large U coordinates I am guessing there is a floating point accuracy issue with the mipmap calculation. I don’t exactly know how the mipmap calculations are done but I assumed this is the cause of the artifacts.

Notice that in the first artifacts screenshots the car is at the beginning of the track, which means the first patch you see on the screenshot is actually the last patch on the track model as the track is cyclic.

The artifacts get worse the further you go on the UV mapping as the values are bigger and cause more floating point accuracy issues.

Since the texture is repeating it doesn’t matter if the UV mapping is repeating the same UV area.

I made a more compact UV mapping version of the same model:

Compact UV mapping

Compact UV mapping

Using this version of the mesh made all the artifacts disappear!

Start Fixed

Start Fixed

Guard Rail Fixed

Guard Rail Fixed

In conclusion:

If you verified that you don’t have ZBuffer issues.

If you see depth related artifacts on a specific range band of a texture with mipmaps.

And if the artifacts do not appear on small UV coordinates but become more severe the more the UV coordinates are away from [0..1]x[0..1], then there is a good chance you have a UV mapping mipmap related artifacts.

Solving this issue might be as simple as remapping the UVs to have values closer to [0..1].

There is more to say why this happens and what exactly are the floating point inaccuracies that happen in the mipmap calculation done by OpenGL but this is beyond the scope of this article.

I hope you find this article useful.


glDepthRangef and Depth Clipping on OpenGL ES 2.

I am working on a new racing game for mobile devices.

In this game I have a plane for the ground which stretch to the horizon and a spherical sky panorama.

As you may know you can’t really render into infinite in OpenGL and even if you could the rasterization and resolution limitation would still make the horizon look jaggy and aliased.


In order to make the ground plane blend with the sky sphere I needed to make the ground pixels near the edge of the render frustum to blend with the background.

However, most of the ground does not need to blend with the background but rather a thin stripe near the far end of the view.

We can render the ground in two phases. One with regular shading where there is no fading and one from where the fading begins.

The fading is dependent on depth so we need to split the rendering based on depth.

Notice: Splitting the render based on depth can be useful performance wise or can be useful to simplfy shaders. In my case there wasn’t a big difference in performance in rendering the ground in two pieces(with and without blending).

So what we need is Depth Clipping.

Our frustum box already does clipping. It clips whatever we project into it that is outside the box of [-1..1]x[-1..1]x[-1..1]

How would we clip based on depth that is smaller than 1?

When rendering the scene in the racing game we are using a perspective projection matrix which is created with the following parameters:

Field of View angle(on Y axis), Aspect ratio, Near plane and Far plane.

For instance we can have a perspective camera with a field of view of 45 degrees, an aspect ratio of 4:3, a near plane of 1 and a far plane of 500.

Lets say we want to render an object in the scene but clip it on depth of 450 instead of 500.

We can create a separate projection matrix with the exact same parameters as in the example but with 450 in the far parameter instead of 500.

This will render the object clipped to 450.

However, notice that we said the frustum only clips at -1 and 1 on the depth axis.

With the original projection matrix we projected 500 into 1. With the current matrix we project everything the same(the x and y into [-1..1]x[-1..1]) but the depth is projected from 450 to 1.

While rendering the object to the screen with the 450 depth matrix is rendered correctly, it’s depth values are rendered inconsistently with the depth values of the original matrix(with the far set to 500).

This will cause the Z Buffer to behave incorrectly.


In order to fix this we change the range of the depth values rendered into the ZBuffer from [-1..1] into [a..b] where we choose the new a, b.

We do this by using the function glDepthRangef.

glDepthRangef accepts values between 0 and 1. Our frustum box depth values are between -1 and 1.

Which means in glDepthRangef 0 is mapped to -1 and 1 is mapped to 1.

How do we choose the range for glDepthRangef so the depth values with the 450 depth projection will match all the other objects with the original projection?

We use the original matrix to calculate where 450 is projected into the frustum. Like so:


vector4 p = OriginalProjection.MulPosition (vector3(0, 0, 450));

float newFar = p.z/p.w;

newFar = 0.5*(newFar+1.0);

glDepthRangef (0, newFar);




Another thing to remember:

glDepthRangef change the depth values written to the Z Buffer.

This means the vertex shader still need to write into depth values of [-1..1].

In addition, the fragment shader will still see the depth values between -1 and 1 and not the ones glDepthRangef set them to.

OpenGL Texture Size and performance? Texture Cache? (Android\iOS GLES 2)

I am working on my dragon simulation mobile game and I am at the stage of adding a terrain to the game.

I optimized the terrain render for my iOS devices but on my Android device I was getting bad performance.

Understanding texture size and performance is kind of elusive. I was being told many times that big textures are bad but I was never able to find the correlation between texture size and performance.

I am still not sure what is the correlation between texture size, bandwidth and performance on mobile devices and I am sure it’s a complex one.

However, I did find out something else.

As I mentioned, my first attempt to copy the OpenGLES shaders from my iOS code to my Android code gave me poor results. The same scene that ran at 60 FPS on my iPod was running on 25 FPS on my Android phone.

This is how the scene looked like on my Android phone:

Slow Terrain Render

Scene rendered at 25 FPS(40 ms per frame)

For the terrain I am using two 2048×2048 ETC1 compressed textures. One for the grass and one for the rocky mountain.

Maybe my phone’s performance is really not as good as my iPod? But then, something was missing.

On my iPod I was already using mipmapped textures while on the first attempt of the Android version I didn’t use mipmapped textures.

Mipmapped textures are texture which not only contain the texture itself but also all (or some) of the smaller versions of the same texture image.

If you have a texture of size 16×16 pixels then a mipmapped texture will contain both the 16×16 image but also the 8×8, 4×4, 2×2 and 1×1 resolutions of the same image.

This is useful because it’s hard to scale down a texture on the GPU without losing details. The mipmapped images are precalculated offline and may use the best algorithms to reduce the image.

When rendering with mipmapped textures the GPU selects the mipmapped image that is the most suitable for the current scaling in the scene.

But apart from looking better, there is another advantage. Performance.

The same scene using mipmapped version of the 2048×2048 textures runs a lot faster than before. I could get a scene render at about 50 to 60 FPS.

The reason for that is that textures have a 2D spatial cache.

In this scene the mountain and grass textures are scaled down considerabley. This in turn makes the GPU sample the textures in texel(texture pixels) that are far from each other making no use of the cache.

In order to make use of the cache the sampling of the texture must have spatial proximity.

When using the mipmapped version of the texture a much smaller layer of the 2048×2048 texture was sampled and thus it was possible to make use of the cache for this specific image.

For the sake of completion, here is the scene with the mipmapped textures:

Runs at about 50-60 FPS(17-20 ms)

Runs at about 50-60 FPS(17-20 ms)

Location 0? location -1? glGetUniformLocation, C++ and bugs. (Android\iOS GLES 2)

In OpenGLES 2 glGetUniformLocation receives the program id and a string as parameters. It then attempts to return a location int that can be used to set uniform GLSL shader variables.

If the variable is found it will return a 0 or positive value. If it fails to find the uniform variable it will return -1.

In C++ we should initialize the location ints in the ctr. If we don’t initialize the locations we might have garbage values when in Release mode.

Using the locations with garbage values might overwrite uniform variables with values we did not intend them to have.

So what we should initialize the locations with? One might think that 0 is a good value to initialize but it is not.

Remember! 0 is a valid shader uniform variable location. If we set all the locations to 0 we might overwrite the uniform variable at location 0.

We should initialize the location ints with -1.

We should do this because -1  is the value that is returned in case the uniform variable was not found and setting a value at location -1 will be ignored.

glDepthFunc and GL_LEQUAL for background drawing in an OpenGL ES 2 Android\iOS game.

For the (semi secret)game I am currently working on(http://www.PompiPompi.net/) I had to implement the background or backdrop of the 3D world.

A simple method to draw the background is having a textured sphere mesh surround the camera.

The issue with this method is that the sphere is not perfect so there are distortions and it doesn’t completely fit the viewing frustum. Which means some geometry that is inside our frustum will be occluded by the sphere.

A different method is to draw a screen space quad at the exact back end of the frustum. This quad will have coordinates in screen space and won’t require transformation.

The back of the plane in screen space coordinates is 1.

You could disable writing into the ZBuffer with glDepthMask(false) and draw the background first. All the geometry that renders afterwards will overwrite the background.

However, what if we want to draw the background last and save on fill rate by using the ZBuffer?

Just drawing the background last should have done that but instead it might not render at all.

We usually clear the depth buffer part of the render buffer into 1.0 which is the highest value for the depth buffer. But our background screen mesh is also rendered into 1!

It turns out the default depth test function in OpenGLES is GL_LESS. Since the ZBuffer is already 1.0 our background screen mesh won’t render.

What we can do is before rendering the screen mesh, we can set the depth test function  into less or equal by calling: glDepthFunc(GL_LEQUAL);

This way our background buffer will not draw at pixels that we have drawn geometry before but will draw on the rest of the “blank” pixels.

GLES 2 and unsigned byte attribute (ivec4?)

I was trying to have a VBO with bone indices in my Android GLES2 3D game.

The issue is that most of the vertex buffer contains float while the indices only require unsigned bytes.

At first I was just using floats as indices for the bones but I wanted to do better. So I made a vertex buffer that have both floats and unsigned chars.

In order to tell GLES2.0 you want 4 indices of unsigned chars you need to call:


You will need to make sure you pass a ByteBuffer with the correct format to GLES2 with GLES20.glBufferData.

Lastly, and this is the tricky part and reason why I wrote this article. How to refer to unsigned bytes attributes inside the GLSL shader code?

Well it turns out GLES2.0 does not have ivec4 which is a vector of 4 ints. However, in GLES2.0 you can use “attribute vec4” for unsigned bytes even though the bytes are not 32 bit floats.

This will result in a float data type but you can convert it into int using int(YourVar).