GPUImageKuwaharaFilter blocking the main thread?

8 replies [Last post]
mabene
User offline. Last seen 8 years 31 weeks ago. Offline
Joined: 10/10/2012
Posts:

Hi there,

I have a question about the GPUImageKuwaharaFilter filter.
The documentation says that it is "extremely computationally expensive [...] it can take seconds to render a frame".
I'm fine with this, so long as I can offload that heavy computation to a background thread and keep the UI responsive as the filter processes its inputs.

Unfortunately, it seems that even if I dispatch asynchronously to a background queue all the operations related to this filter (creation, setup, use) the main thread gets somehow blocked and the UI freezes (I have used background queues in several other cases and they work just fine).
In a test I run, creating 16 different GPUImageKuwaharaFilter (different as to parameter values) and applying them to the same "small" 480x480 images totally freezes the UI for 20 seconds on an iPhone 4s.

Attached a screenshot of a profile session of these initial 20 seconds (made with Instruments) where several "CPU blocked by GPU" errors are reported, though I don't think this is the actual issue.

How can this be?

You can observe such UI blocking effect in the FilterShowcase application bundled with the framework. Just pick the GPUImageKuwaharaFilter and watch the transition stutter and the UI to become unresponsive for a couple of seconds.

AttachmentSize
GPUImage-trace.png396.48 KB
Brad Larson
Brad Larson's picture
User offline. Last seen 4 years 22 weeks ago. Offline
Joined: 05/14/2008
Posts:

It's probably not directly stopping the main thread, but instead causing the entire system to grind to a halt. As I mention in the notes there, it is an incredibly inefficient shader and will use the entire GPU when it runs. Because these operations are atomic, nothing else will be able to run on the GPU while it is doing this, including UI compositing and animation. This then can cause a ripple effect downward, halting user interactions or anything else on the main thread because the system is still waiting for the GPU to be freed up.

It can even get so bad that the OpenGL ES driver's watchdog timer will kick in and kill the Kuwahara shader before it has finished processing the entire image. This will lead to odd block artifacts on slower devices or for larger images.

There is no real solution for this, short of processing an image in smaller tiles and blending them or reworking the Kuwahara shader to be a lot more efficient. The latter is the best way to approach this, but it will take a lot of cleanup work.

The "CPU blocked by GPU" warnings are due to how long this operation takes to finish. I need to use a glFinish() call to wait for the GPU to finish processing before reading pixels back (or it will read pixels back while the GPU hasn't finished rendering), but that will be running on a background thread if that's where you're capturing the image from the filter.

mabene
User offline. Last seen 8 years 31 weeks ago. Offline
Joined: 10/10/2012
Posts:

I see.
So I guess all the other filters have the same "issue"... but they are computed sufficiently fast to avoid blocking the UI in a perceptible way? I wonder whether CIFilters are also atomically computed.

Suppose I decide to "process the image in smaller tiles and blend them"; how would I do that exactly?
Say I split the input image in four tiles (2x2 sub-images). I process them separately, then I merge the results.
This is not giving a seamless result I guess, given the way shaders compute their output images...

I'll take a look at the Kuwahara shader code to see if there's something I can do, but I'm no expert in OpenGL.

BTW, the "atomic operations" you mention are the application of the filter, the compilation of the shader, or both?

mabene
User offline. Last seen 8 years 31 weeks ago. Offline
Joined: 10/10/2012
Posts:

I just noticed an interesting thing.

The very same code we are talking about *doesn't* block the UI at all when run on the iPhone simulator (as opposed to the actual device).
I guess in this case OpenGL is falling back to the CPU because for some reason it is not able to use the GPU of my laptop. This implies that the operations are not atomic anymore, so the background GPUImage queue does what it is supposed to do and the UI stays interactive and fluid.

It seems the time it takes to compute a kuwahara filter on my laptop CPU is roughly comparable to the time it takes to compute the same filter on my iPhone's GPU.

That said, is there a way in the GPUImage framework to force the use of the CPU on actual devices instead of the GPU for some (or all) filters?

I realize this is against the very nature of GPUImage, but in my case the rationale is that I'm more interested in not blocking the UI than in having a fast filter computation.
I.e.: 2 seconds with blocked UI for a GPU-accelerated filter is worst than 4 seconds with a responsive UI for the same CPU-computed filter.

Brad Larson
Brad Larson's picture
User offline. Last seen 4 years 22 weeks ago. Offline
Joined: 05/14/2008
Posts:

The Kuwahara shader is orders of magnitude slower than the others, so these others don't monopolize the GPU to the extent that it starts interfering with other things. What I meant by an atomic operation was a single shading pass within a filter. Even multipass filters get broken up into a few distinct operations, with room in between for something else to run.

Yes, the Simulator emulates many shader functions in software, and leaves others on the GPU. This causes it to be significantly slower than the newer iOS devices in many cases, even on my new Retina MacBook Pro. It makes sense that the UI updates would still be GPU accelerated while many of the shader functions were emulated on the CPU, so I could see how that would remain responsive on a multicore Mac.

Unfortunately, there is no easy way to run any of this CPU-side on the device, because all of my code is simply for OpenGL ES. It would need to be completely rewritten with a parallel path using Accelerate or low-level NEON operations to run on the CPU. I'm not going to dedicate the weeks or months to do this. In any case, it's probably best to try to fix the core problem of a slow shader first.

For improvements, I'd first look for a way to lock in different filter radii as distinct shaders, remove the for loops, and find and eliminate the redundant texture reads within the Kuwahara shader. Someone on the PowerVR forums had done this and reported much faster filter times as a result, but I don't have a link for that.

mabene
User offline. Last seen 8 years 31 weeks ago. Offline
Joined: 10/10/2012
Posts:

Thank you Brad for all the answers.

mabene
User offline. Last seen 8 years 31 weeks ago. Offline
Joined: 10/10/2012
Posts:

Anyway, just for the record, here is the time it takes to apply different filters to the same still 480x480 image on iPhone 4s:

"GPUImageSepiaFilter", step "filtering" completed in 0.05s
"GPUImageSoftEleganceFilter", step "filtering" completed in 0.09s
"GPUImageAmatorkaFilter", step "filtering" completed in 0.04s
"GPUImageMissEtikateFilter", step "filtering" completed in 0.03s
"GPUImagePosterizeFilter", step "filtering" completed in 0.03s
"GPUImageSmoothToonFilter", step "filtering" completed in 0.07s
"GPUImageKuwaharaFilter - radius 3", step "filtering" completed in 0.39s
"GPUImageKuwaharaFilter - radius 5", step "filtering" completed in 0.85s
"GPUImageKuwaharaFilter - radius 6", step "filtering" completed in 1.07s
"GPUImagePixellateFilter", step "filtering" completed in 0.03s
"GPUImagePolarPixellateFilter", step "filtering" completed in 0.04s
"GPUImagePolkaDotFilter", step "filtering" completed in 0.03s
"GPUImageHalftoneFilter", step "filtering" completed in 0.03s
"GPUImageCrosshatchFilter", step "filtering" completed in 0.03s
"GPUImageSketchFilter", step "filtering" completed in 0.04s

This does not include the setup time (alloc/init/set-parameters) that can be quite significant in some cases.

mabene
User offline. Last seen 8 years 31 weeks ago. Offline
Joined: 10/10/2012
Posts:

Hi Brad,

I reimplemented this filter without using the GPU.
It works fine, but have you any idea why I have to use half the radius I specify to your filter to obtain the same visual result?

E.g., my filter with radius=3 produces the same effect as your filter with radius=6.
I'm working on a 480x480 image (UIImage with scale=1, so pixels and points coincide).

May it have anything to do with that hardcoded "const vec2 src_size = vec2 (768.0, 1024.0);" in the shader code?

Many thanks

mabene
User offline. Last seen 8 years 31 weeks ago. Offline
Joined: 10/10/2012
Posts:

Hi Brad,

I've open-sourced my CPU-based implementation of the Kuwahara filter (that I needed to work around the blocking effect of the GPU-based one):

https://github.com/mabene/kuwahara

I've yet to benchmark properly my implementation, but it looks like it's fast; at least 2 times faster than GPUImage for any given image size and radius. And, it is fully non-blocking w.r.t. the main thread. :)

Code seems quite long (UIImage+kuwahara.m is the only interesting bit), but most of it is not specifically related to Kuwahara; I just wanted to make the category self-contained.

best,

m

Syndicate content