Branching is one of the most time consuming operation in CUDA ( in general as well).
For example, See below CPU code and GPU code …
void CPUCode( int* input, int* output, int length) { for ( int i = 0; i < length; ++i ) { output[ i ] = input[ i ] + 2 * input[ i + 1 ]; } }
GPU code:
__global__ void GPUCode( int* input, int* output, int length) { int idx = __umul24( blockDim.x, blockIdx.x) + threadIdx.x; if ( idx < length ) { output[ idx ] = input[ idx ] + 2 * input[ idx + 1 ]; } }
In the above GPU code, there is a if condition which is executed by each thread. If every thread executes the same instruction at the same time, then that execution is very fast. i.e., the kernel code (or __global__ function code) should be serial, no branching in side it.
Look at the modified GPU version code…
__global__ void GPUCode( int* input, int* output, int length) { int idx = __umul24( blockDim.x, blockIdx.x) + threadIdx.x; idx = max( idx, 0); idx = min( idx, length); output[ idx ] = input[ idx ] + 2 * input[ idx + 1 ]; }
No branching in the above modifed GPU version code. The Kernel code is serial. Every thread is executes the same instruction at a time. This type of code is executes very fast on GPU. The above techinique is applicable for general CPU code as well. |