Avoiding if condition in CUDA

Branching is one of the most time consuming operation in CUDA ( in general as well).

For example, See below CPU code and GPU code …

void  CPUCode( int* input, int* output, int length)
{
                for ( int  i = 0; i < length; ++i )
                {
                      output[ i ] = input[ i ] + 2 * input[ i + 1 ];
               }
}

GPU code:

__global__
void  GPUCode( int* input, int* output, int length)
{
               int idx = __umul24( blockDim.x, blockIdx.x) + threadIdx.x;
 
               if ( idx < length )
                {
                      output[ idx ] = input[ idx ] + 2 * input[ idx + 1 ];
               }
}

In the above GPU code, there is a if condition which is executed by each thread. If every thread executes the same instruction at the same time, then that execution is very fast. i.e., the kernel code (or __global__ function code) should be serial, no branching in side it.

Look at the modified GPU version code…

__global__
void  GPUCode( int* input, int* output, int length)
{
               int idx = __umul24( blockDim.x, blockIdx.x) + threadIdx.x;
 
               idx = max( idx, 0);
               idx = min( idx, length); 
             
              output[ idx ] = input[ idx ] + 2 * input[ idx + 1 ];              
}

No branching in the above modifed GPU version code. The Kernel code is serial. Every thread is executes the same instruction at a time. This type of code is executes very fast on GPU.

The above techinique is applicable for general CPU code as well.

150 150 Burnignorance | Where Minds Meet And Sparks Fly!