Branching is one of the most time consuming operation in CUDA ( in general as well).
For example, See below CPU code and GPU code …
void CPUCode( int* input, int* output, int length)
{
for ( int i = 0; i < length; ++i )
{
output[ i ] = input[ i ] + 2 * input[ i + 1 ];
}
}
GPU code:
__global__
void GPUCode( int* input, int* output, int length)
{
int idx = __umul24( blockDim.x, blockIdx.x) + threadIdx.x;
if ( idx < length )
{
output[ idx ] = input[ idx ] + 2 * input[ idx + 1 ];
}
}
In the above GPU code, there is a if condition which is executed by each thread. If every thread executes the same instruction at the same time, then that execution is very fast. i.e., the kernel code (or __global__ function code) should be serial, no branching in side it.
Look at the modified GPU version code…
__global__
void GPUCode( int* input, int* output, int length)
{
int idx = __umul24( blockDim.x, blockIdx.x) + threadIdx.x;
idx = max( idx, 0);
idx = min( idx, length);
output[ idx ] = input[ idx ] + 2 * input[ idx + 1 ];
}
|
No branching in the above modifed GPU version code. The Kernel code is serial. Every thread is executes the same instruction at a time. This type of code is executes very fast on GPU. The above techinique is applicable for general CPU code as well. |