CUDA is well suited Image processing in parallel threads.
The pixels of a image can be directly mapped to parallel threads.
For efficient image processing with CUDA one need to understand the location and scope along with access pattern of pixel data .
CUDA has different kinds of memories i.e. shared, constant, texture, local and global memories along with large no. of registers.
Each of these memories has its own advantages and limitations. Shared memory is high speed on-chip cache that is suitable for image tiling for fastest image processing on GPU but it has limited size (16kb for Quadro CX).
Texture memory on the other hands gives arbitrary scatter access of image data which is faster than global and local memories.
In some cases – each thread can be used to write more than one pixel.
Between various reads and writes of shared memory by threads within a block it is good to always use barrier synchronization that insures that each read/write is done after completion of all the previous corresponding write/ read .
The major differences between CPU and CUDA threads are – CUDA threads are extremely light weighted, very little creation overhead, instant switching. CUDA uses 1000s of threads where as multi-core CPUs can use only run a few threads.
Within the threads CUDA is capable of executing floating point operations asynchronous to int or char operations.
Memory bandwidth between host to device or device to host is much slower then device to device hence it is good practice to read the image data into the device memory as a one time effort.
Similarly after completion of processing output data should be copied at once from device to host.
Once image is in the device memory (global memory), it is helpful to load image tiles from global to shared memory which is processed by current block of threads.
After this reading, threads within block must be synchronize for insuring all threads have copied required image data into tiles.
Threads now operate on pixels in shared memory in parallel and after completions they write tiles back from shared memory to global memory output.
Since in image processing generally each output pixel must have access to neighboring pixels with in certain radius (e.g. as a form of matrix of neighboring pixels ).This means tiles in the shared memory must be expanded with an window that contains neighboring pixels.
Below is the kernel code of RGB to Grayscale convertion:
void RGBTOGrayScale(unsigned char *inRGBImage, unsigned char *outGrayImage, int srcW, int srcH )
// using tiles of shared memory to sore image data for current block
__shared__ unsigned char shInImage[16*16*4]; // threads per block is 16*16 and each pixel has 4 attributes R, G, B, A.
__shared__ unsigned char shoutImage[16*16];
int x = blockIdx.x * blockDim.x + threadIdx.x; //taking index along srcW in source image.
int y = blockIdx.y * blockDim.y + threadIdx.y; //taking index along srcH in source image.
int tidx = threadIdx.x;//taking thread index along x direction within tile.
int tidy = threadIdx.y;//taking thread index along y direction within tile.
int shIndex = (tidy * 16 + tidx)*4;
int srcIndex = (y * srcW + x)*4;
// taking active portation of image from global to shared memory.
shInImage[shIndex + 0] = inRGBImage[ srcIndex + 0];
shInImage[shIndex + 1] = inRGBImage[ srcIndex + 1];
shInImage[shIndex + 2] = inRGBImage[ srcIndex + 2];
shInImage[shIndex + 3] = inRGBImage[ srcIndex + 3];
__syncthreads(); // barrier synchronization for all threads reading is completed.
if( ( x < srcW ) && ( y < srcH ))
shoutImage[tidy * 16 + tidx] = (int)((shInImage[shIndex + 0] * 0.3) + (shInImage[shIndex + 1] * 0.59) + (shInImage[shIndex + 2] * 0.11));
//copy data back to shared to global memory.
outGrayImage[srcIndex] = shoutImage[tidy * 16 + tidx];
outGrayImage[srcIndex+1] = shoutImage[tidy * 16 + tidx];
outGrayImage[srcIndex+2] = shoutImage[tidy * 16 + tidx];
outGrayImage[srcIndex+3] = shoutImage[tidy * 16 + tidx];