What is “Performance is limited by CPU vertex processing”?

I was working on a different OpenGLES2 technique of particles to draw fire for my new(WIP) game Dragons High.

The thing I noticed is that whenever the particles effect is enabled I get stalled for 14 ms on glDrawElements of that specific particles VBO.

In this particle technique I am rendering the triangles of a certain mesh instanced 500 times.

Each instance has it’s own time and a few other parameters.

The vertex structure I used was:

typedef struct _ParticleVertex
    float3 Pos;
//    float Alpha;
    float3 Velocity;
    float Time;
    float4 color;
} ParticleVertex;

Velocity is the particle’s initial velocity.
Time is the time the particle first spawns, and color is the particle’s color tint and transparency(with the alpha component).
Instead of color I can use Alpha which only has the alpha component of color.

The thing I noticed is that if I used Alpha instead of color, the CPU will no longer stall on the particle’s draw call and it would take less than 1 ms to complete the command on the CPU side.

Both methods rendered the particles correctly:



However, when I analyzed a frame with color in xCode, I found the following warning:

“Performance is limited by CPU vertex processing”

If you know how GLES2 works it would sound weird that the vertices are being processed on the CPU. The vertex shader in GLES2 runs on the GPU and the analyzer cannot know if I updated the VBO with vertex data that was preprocessed on the CPU.

When running the OpenGL ES Analyzer instrument in the xCode profiler I found the following error:

“GL Error:Invalid Value”

This error pointed to glVertexAttribPointer with all the parameters set to 0.

It turns out that when creating the VBO there was a boolean parameter(mUseColor) that I used to signify that the vertex struct has a color data.

When creating the VAO for this vertex struct I was adding 4 to the array length for mUseColor==true instead of 1.

This resulted in the VAO array have 3 more entires than it should. Since those entries were not initialized they were left with 0 values.

Why were the particles rendered correctly eventhough the VAO was incorrect? And why the rendering cost so much CPU? It’s hard to tell, but it’s possible the iOS driver did something to compensate on the structural error of the VAO.

I guess a good practice is to clean all the errors before you profile… but it could be interesting to understand what was going under the hood and what was the CPU vertex processing xCode warned about.


std::deque 4K Bytes high memory cost on iOS, C++


My iOS build of my game Dragons High was using about 150MB of live memory.

I used the Allocations instrument to find out that I had more than 10,000 allocations of the size of 4K Bytes.

All those allocations were of the same class, std::deque<unsigned int>

In Dragons High there is a terrain of the world. The terrain has collision geometry data.

Basically I test the dragon or other moving characters against the triangles of the terrain geometry so they won’t go through it.

In order to not go over all the triangles I have a grid of lists(std::deque) with indices to the triangles, so if the character is inside a certain cell in the grid I just test against the triangles in that cell and not against all the triangles in the mesh.

std::deque Cost

It turns out that for every item in std::vector<std::deque<unsigned int>>, std::deque was allocating 4K Bytes even if it had less than 20 items.

For my grid I allocated 128×128 cells, so as a result I had more than 50MB usage just for this std::vector.

For every std::deque<unsigned int> in the std::vector in my iOS game I had 4K allocated even if there were very few elements in the std::deque.

My solution was to use my own custom Data Structure which used dynamically allocated unsigned int * pointer.

This way I was able to shave about 50MB of memory usage.


A better alternative to making your own custom class with a pointer is to use std::vector.

std::vector takes care of allocations for you, so it’s safer and it does not have a noticeable memory footprint(for a small amount of items).

Pitfalls and Errors When Using AVAssetWriter to Save the OpenGL Backbuffer Into a Video.

For my current game I need to record the OpenGL back buffer of GLKView into a video using AVAssetWriter.

I will not go over all the code to do this(which is not that much).

However, I will go quickly over some of the pitfalls that might make you go scratching your head when you try to create a video on iOS.

The first thing to consider is that AVAssetWriter might fail without throwing an exception and without returning any value pointing that out.

To test if there was an error within AVAssetWriter methods you need to read the status and error properties of AVAssetWriter.

You can print the status and error like so:

NSLog(@"%ld %@", (long)videoWriter.status, videoWriter.error);

The first pitfall I encountered is that after startWritting my AVAssetWriter status change to AVAssetWriterStatusFailed.

(Or the number 3)

The reason I was getting this error is because I didn’t delete the file I tried to write into from the previous run session.

That was an easy error but after that I was getting an error when trying to append the first(and following) CVPixelBuffer.

I got AVAssetWriterStatusFailed again but the error value was Unknown Error or the OSStatus was -12902.

After a while I have found out that nothing was wrong with my code except for the fact that something was wrong with the resolution.

Since I was running my app on an iPad3 the resolution was 2048×1536. I don’t know why but the codec or AVAssetWriter couldn’t handle this resolution.

Simply reducing the resolution to something smaller allowed me to write the video without getting any error.


For the sake of completion here is part of the code I was using:


- (void)glkView:(GLKView *)view drawInRect:(CGRect)rect
    @synchronized(self) {
        if (Recording)
 //           sample = [self sampleBufferFromCGImage:[self snapshotRenderBuffer]];
            GLubyte * data = [self snapshotRenderBufferToData];
            CVPixelBufferRef pixelBuffer = [self pixelBufferFromBytes: &data withSize:CGSizeMake(videoWidth, videoHeight) withSourceSize: CGSizeMake(screenWidth, screenHeight)];
            if (adaptor.assetWriterInput.readyForMoreMediaData)
                [adaptor appendPixelBuffer:pixelBuffer withPresentationTime:CMTimeMake(RecordFrame*10, 600)];
            //                [writerInput appendSampleBuffer: sample];
            free (data);
            NSLog(@"%ld %@", (long)videoWriter.status, videoWriter.error);
            //            NSLog(@"%ld", (long)videoWriter.status);
    if (DoneRecord)
    @synchronized(self) {
        NSArray *paths = NSSearchPathForDirectoriesInDomains(NSDocumentDirectory, NSUserDomainMask, YES);
        NSString *documentsDirectory = [paths objectAtIndex:0];
        NSString *sourcePath = [documentsDirectory stringByAppendingPathComponent:@"video.mov"];
        if (Recording)
            [writerInput markAsFinished];
            NSLog(@"%ld", (long)videoWriter.status);
            [videoWriter finishWritingWithCompletionHandler:^{
            DoneRecord = YES;
            videoWidth = screenWidth/2;
            videoHeight = screenHeight/2;
            NSError * error = nil;
            NSFileManager *fileManager = [NSFileManager defaultManager];
            BOOL success = [fileManager removeItemAtPath:sourcePath error:&error];
            if (success) {

                NSLog(@"Could not delete file -:%@ ",[error localizedDescription]);
            RecordFrame = 0;
            videoWriter = [[AVAssetWriter alloc] initWithURL:
                                          [NSURL fileURLWithPath:sourcePath] fileType:AVFileTypeQuickTimeMovie
            NSLog(@"%ld", (long)videoWriter.status);

            NSDictionary *videoSettings = [NSDictionary dictionaryWithObjectsAndKeys:
                                           AVVideoCodecH264, AVVideoCodecKey,
                                           [NSNumber numberWithInt:videoWidth], AVVideoWidthKey,
                                           [NSNumber numberWithInt:videoHeight], AVVideoHeightKey,
            writerInput = [AVAssetWriterInput
                                                outputSettings:videoSettings]; //retain should be removed if ARC

            NSParameterAssert([videoWriter canAddInput:writerInput]);
            writerInput.expectsMediaDataInRealTime = YES;
            [videoWriter addInput:writerInput];

            adaptor = [AVAssetWriterInputPixelBufferAdaptor assetWriterInputPixelBufferAdaptorWithAssetWriterInput:writerInput sourcePixelBufferAttributes:nil];

            [videoWriter startWriting];
//            NSLog(@"%ld %@", (long)videoWriter.status, videoWriter.error);
            [videoWriter startSessionAtSourceTime:kCMTimeZero];
        Recording = !Recording;

Getting wrong screen resolution on iPod touch or iPhone 5? Launch image woes… (iOS)

I was working on improving my existing game Concussion Boxing and adding a new block move.

Part of it was replacing the splash screen and icon into a better looking ones.

Newer generations of iPhones and iPods have a 4 inch display and a non retina resolution of 320×568 while older devices with a 3.5 inch screen have only 320×480.

In AppDelegate I would load the correct xib file to match the screen resolution.

I would use: [[UIScreen mainScreen] bounds].size.height to figure out which screen resolution my device has.

However, for my iPod touch 5 with iOS 7 on it I would get the wrong resolution(320×480) from the bounds property instead of the 4 inch resolution.

It turns out that not having the correct Launch images set up for the application could prevent your app from running with the correct resolution for the device.

Setting the launch images is a bit tricky, you need to set the exact correct file names in order for this to work.

This post show you the correct naming for the launch images:


Even after naming correctly your launch images you may get a black splash screen instead of the image you have set.

This is due to the fact that PNG files have several types of formats and not all of them are displayed correctly.

In my case a PNG image with transparency(32 bit pixels) was not  displayed while saving the image in 24 bit with Paint.Net worked and displayed correctly.

OpenGL Texture Size and performance? Texture Cache? (Android\iOS GLES 2)

I am working on my dragon simulation mobile game and I am at the stage of adding a terrain to the game.

I optimized the terrain render for my iOS devices but on my Android device I was getting bad performance.

Understanding texture size and performance is kind of elusive. I was being told many times that big textures are bad but I was never able to find the correlation between texture size and performance.

I am still not sure what is the correlation between texture size, bandwidth and performance on mobile devices and I am sure it’s a complex one.

However, I did find out something else.

As I mentioned, my first attempt to copy the OpenGLES shaders from my iOS code to my Android code gave me poor results. The same scene that ran at 60 FPS on my iPod was running on 25 FPS on my Android phone.

This is how the scene looked like on my Android phone:

Slow Terrain Render

Scene rendered at 25 FPS(40 ms per frame)

For the terrain I am using two 2048×2048 ETC1 compressed textures. One for the grass and one for the rocky mountain.

Maybe my phone’s performance is really not as good as my iPod? But then, something was missing.

On my iPod I was already using mipmapped textures while on the first attempt of the Android version I didn’t use mipmapped textures.

Mipmapped textures are texture which not only contain the texture itself but also all (or some) of the smaller versions of the same texture image.

If you have a texture of size 16×16 pixels then a mipmapped texture will contain both the 16×16 image but also the 8×8, 4×4, 2×2 and 1×1 resolutions of the same image.

This is useful because it’s hard to scale down a texture on the GPU without losing details. The mipmapped images are precalculated offline and may use the best algorithms to reduce the image.

When rendering with mipmapped textures the GPU selects the mipmapped image that is the most suitable for the current scaling in the scene.

But apart from looking better, there is another advantage. Performance.

The same scene using mipmapped version of the 2048×2048 textures runs a lot faster than before. I could get a scene render at about 50 to 60 FPS.

The reason for that is that textures have a 2D spatial cache.

In this scene the mountain and grass textures are scaled down considerabley. This in turn makes the GPU sample the textures in texel(texture pixels) that are far from each other making no use of the cache.

In order to make use of the cache the sampling of the texture must have spatial proximity.

When using the mipmapped version of the texture a much smaller layer of the 2048×2048 texture was sampled and thus it was possible to make use of the cache for this specific image.

For the sake of completion, here is the scene with the mipmapped textures:

Runs at about 50-60 FPS(17-20 ms)

Runs at about 50-60 FPS(17-20 ms)

Location 0? location -1? glGetUniformLocation, C++ and bugs. (Android\iOS GLES 2)

In OpenGLES 2 glGetUniformLocation receives the program id and a string as parameters. It then attempts to return a location int that can be used to set uniform GLSL shader variables.

If the variable is found it will return a 0 or positive value. If it fails to find the uniform variable it will return -1.

In C++ we should initialize the location ints in the ctr. If we don’t initialize the locations we might have garbage values when in Release mode.

Using the locations with garbage values might overwrite uniform variables with values we did not intend them to have.

So what we should initialize the locations with? One might think that 0 is a good value to initialize but it is not.

Remember! 0 is a valid shader uniform variable location. If we set all the locations to 0 we might overwrite the uniform variable at location 0.

We should initialize the location ints with -1.

We should do this because -1  is the value that is returned in case the uniform variable was not found and setting a value at location -1 will be ignored.

glDepthFunc and GL_LEQUAL for background drawing in an OpenGL ES 2 Android\iOS game.

For the (semi secret)game I am currently working on(http://www.PompiPompi.net/) I had to implement the background or backdrop of the 3D world.

A simple method to draw the background is having a textured sphere mesh surround the camera.

The issue with this method is that the sphere is not perfect so there are distortions and it doesn’t completely fit the viewing frustum. Which means some geometry that is inside our frustum will be occluded by the sphere.

A different method is to draw a screen space quad at the exact back end of the frustum. This quad will have coordinates in screen space and won’t require transformation.

The back of the plane in screen space coordinates is 1.

You could disable writing into the ZBuffer with glDepthMask(false) and draw the background first. All the geometry that renders afterwards will overwrite the background.

However, what if we want to draw the background last and save on fill rate by using the ZBuffer?

Just drawing the background last should have done that but instead it might not render at all.

We usually clear the depth buffer part of the render buffer into 1.0 which is the highest value for the depth buffer. But our background screen mesh is also rendered into 1!

It turns out the default depth test function in OpenGLES is GL_LESS. Since the ZBuffer is already 1.0 our background screen mesh won’t render.

What we can do is before rendering the screen mesh, we can set the depth test function  into less or equal by calling: glDepthFunc(GL_LEQUAL);

This way our background buffer will not draw at pixels that we have drawn geometry before but will draw on the rest of the “blank” pixels.