What is “Performance is limited by CPU vertex processing”?

I was working on a different OpenGLES2 technique of particles to draw fire for my new(WIP) game Dragons High.

The thing I noticed is that whenever the particles effect is enabled I get stalled for 14 ms on glDrawElements of that specific particles VBO.

In this particle technique I am rendering the triangles of a certain mesh instanced 500 times.

Each instance has it’s own time and a few other parameters.

The vertex structure I used was:

typedef struct _ParticleVertex
    float3 Pos;
//    float Alpha;
    float3 Velocity;
    float Time;
    float4 color;
} ParticleVertex;

Velocity is the particle’s initial velocity.
Time is the time the particle first spawns, and color is the particle’s color tint and transparency(with the alpha component).
Instead of color I can use Alpha which only has the alpha component of color.

The thing I noticed is that if I used Alpha instead of color, the CPU will no longer stall on the particle’s draw call and it would take less than 1 ms to complete the command on the CPU side.

Both methods rendered the particles correctly:



However, when I analyzed a frame with color in xCode, I found the following warning:

“Performance is limited by CPU vertex processing”

If you know how GLES2 works it would sound weird that the vertices are being processed on the CPU. The vertex shader in GLES2 runs on the GPU and the analyzer cannot know if I updated the VBO with vertex data that was preprocessed on the CPU.

When running the OpenGL ES Analyzer instrument in the xCode profiler I found the following error:

“GL Error:Invalid Value”

This error pointed to glVertexAttribPointer with all the parameters set to 0.

It turns out that when creating the VBO there was a boolean parameter(mUseColor) that I used to signify that the vertex struct has a color data.

When creating the VAO for this vertex struct I was adding 4 to the array length for mUseColor==true instead of 1.

This resulted in the VAO array have 3 more entires than it should. Since those entries were not initialized they were left with 0 values.

Why were the particles rendered correctly eventhough the VAO was incorrect? And why the rendering cost so much CPU? It’s hard to tell, but it’s possible the iOS driver did something to compensate on the structural error of the VAO.

I guess a good practice is to clean all the errors before you profile… but it could be interesting to understand what was going under the hood and what was the CPU vertex processing xCode warned about.