OpenCL vs. CUDA

Today's processors have undergone a huge transformation from those of just 10 years ago.  CPU manufacturers Intel and AMD (and up and coming CPU designer ARM) have increased processor speed via greater emphasis on superscalar execution, deeper pipelining, branch prediction, out of order execution, and fast multi-level caches.  This design philosophy has resulted in faster response time for single tasks executing on a processor, but at the expense of increased circuit complexity, high power consumption, and a small number of cores on the die.  On the other hand, GPU manufacturers NVIDIA and ATI have focused their designs on processors with many simple cores that implement SIMD parallelism, which hides latency of instruction execution [1].

While GPUs have been in existence for about 10 years, the software support for these processor have taken years to catch up.  Software developers are still sifting through solutions for programming these processors.  OpenCL and CUDA are frameworks for GPGPU computing.  Each framework comprises a language for expressing kernel code (instructions that run on a GPU), and an API for calling kernels (from the CPU).  While the frameworks are similar, there are some important differences.

CUDA is a proprietary framework. It is not open source, and all changes to the language and API are made by NVIDIA. But, some third-party tools have been built around the framework and it does seem to have a large following in academia.  Unfortunately, CUDA only runs on NVIDIA devices.  While it should be possible to run CUDA code on other platforms using Ocelot, this only works on Linux systems.
 
OpenCL is a standardized framework, and is starting to gain popularity.  Similar to NVIDIA's CUDA C++, OpenCL allows programmers to use the massive parallel computing power of GPU's for general purpose computing.  Unlike CUDA, OpenCL works on any supported GPU or CPU, including Intel, AMD, NVIDIA, IBM, and ARM processors. 
 
Does OpenCL make programming multiple platforms easier?  Is it as fast as CUDA, or does it sacrafice speed for diverse platform support?

To answer these questions, I decided to solve a problem using CUDA (both the Driver API and the Runtime API) and OpenCL, and then compare the solutions with regards to the language, API, development tools, and the frequency and stability of the release of the SDK's.

Methods

The problem involves estimating the value of π using numerical integration of a well-known formula you probably have seen in your college calculus course:

Figure 1. Definition of π.

 

π = 0 1 4 1 + t 2 d t

This formula can be solved a number of ways, including composite trapezoid rule, Runge-Kutta (where it is used to approximate Simpson's Rule), and Monte Carlo.  However, for this problem, I used the Composite Simpson's Rule:

Figure 2. Composite Simpson's Rule.

 

a b f ( x ) dx h / 3 [ f ( x 0 ) + 2 i = 1 n / 2 1 f ( x 2i ) + 4 i = 1 n / 2 f ( x 2i 1 ) + f ( x n ) ] ∫_a^b f(x)dx≈h/3 [f(x_0 )+2∑_(i=1)^(n/2-1) f(x_2i ) +4∑_(i=1)^(n/2) f(x_(2i-1) ) f(x_n )]

Implementation of the CUDA and OpenCL solutions of the Composite Simpson's Rule is not hard.   The computaton of the r.h.s. of the equation in Figure 2 can be done in three steps, each of which can be parallelized and computed on a GPU.

  1. Initialize an array of floating points a[i], where i = 0..n.  a[i] = h * i where h = (1 – 0)/n = 1/n.  The array a[i] is the value of xi in the r.h.s. of the equation in Figure 2, from 0 to 1.
  2. Compute array of floating points fa[i], where i = 0..n.  fa[i] = f(a[i]) where f(p) = c(i) * 1/(1+p2). The coefficients c(i) = [1, 4, 2, 4, 2, 4, …, 1].  This computes the values of each term within the large braces in Figure 2.
  3. Compute s = ∑ fa[i], where i = 0..n. This computes the sum of the terms in Figure 2.

After computing Step 3, the approximate value of π is s * h / 3.

Let's see how the calculation would work for n = 7.  In Step 1, x[] = [ 0, 0.142857, 0.285714, 0.428571, 0.571429, 0.714286, 0.857143, 1].  In Step 2, fa[] = [4, 15.68, 7.396226, 13.51724, 6.030769, 10.59459, 4.611765, 2]. Finally, in Step 3, the sum of fa[] is 63.8306.  π ≈ h/3 * 63.8306 = 63.8306 / 3 / 7 = 3.039552 ≈ 3.141592653589793…

Steps 1 and 2 are MAP operations. For each implementation the thread and block identification (or local and group identification in OpenCL terminology), are converted into a contiguous range of integers within 0 .. n.  

Step 3 is a REDUCE operation, where a binary operator is applied to an array.  Reduce(op, fa[]) is a well-known problem, and a naive PRAM-like solution for the GPU is shown below (Figure 3).  This algorithm uses global memory, modifying the input array while computing the sum.  The result of Reduce(+, fa[i]) is contained in fa[0] at the end of the computation. The size of the array, which is n + 1, should be a power of 2.  If is is not, then it can be padded with zeros to a size that is.

 

Figure 3. GPU Pseudo code for Reduce, using global memory

Reduce(⊕, fa[0..m-1]) =

levels = log2 m

for (level = 1; level  levels; level += 1)

            step = 2level

            linear_to_tile(/ step, &tiles, &tile_size)

            parallel tile for (d = tilest = tile_size)

                        int p = tile_to_linear(dt)

                        int i = step * p

                        fa[i] = fa[i] ⊕ a[i + step / 2]

return [0]

 

The solutions for the problem using CUDA Runtime API, CUDA Driver API, and OpenCL are shown in Figure 4, 5, and 6, respectively.

Figure 4. Implementation of Composite Simpson's Rule for π using the CUDA Runtime API.

#include <stdio.h>
#include <stdlib.h>
#include <sys/timeb.h>
#include <time.h>
#include <iostream>
#include <iomanip>

void make_results(int * blocks, int * threads, int max_dimensionality, dim3 * tile_size, dim3 * tiles)
{
    tiles->x = blocks[0];
    tiles->y = blocks[1];
    tiles->z = blocks[2];
    tile_size->x = threads[0];
    tile_size->y = threads[1];
    tile_size->z = threads[2];
}

void l2t(int size, dim3 * tile_size, dim3 * tiles)
{
    int max_dimensionality = 3;
    int blocks[10];
    for (int j = 0; j < max_dimensionality; ++j)
        blocks[j] = 1;
    int max_threads[] = { 512, 64, 64};
    int max_blocks[] = { 65535, 65535, 65535};
    int threads[10];
    for (int j = 0; j < max_dimensionality; ++j)
        threads[j] = 1;

    int b = size / (max_threads[0] * max_blocks[0]);
    if (b == 0)
    {
        int b = size / max_threads[0];
        if (size % max_threads[0] != 0)
            b++;

        if (b == 1)
            max_threads[0] = size;

        // done. return the result.
        blocks[0] = b;
        threads[0] = max_threads[0];
        make_results(blocks, threads, max_dimensionality, tile_size, tiles);
        return;

    }

    int sqrt_size = sqrt((float)size / max_threads[0]);
    sqrt_size++;

    int b2 = sqrt_size / max_blocks[1];
    if (b2 == 0)
    {
        int b = sqrt_size;

        // done. return the result.
        blocks[0] = blocks[1] = b;
        threads[0] = max_threads[0];
        make_results(blocks, threads, max_dimensionality, tile_size, tiles);
        return;
    }
    throw;
}

__device__ int t2l()
{
    int i = (threadIdx.x
         + blockDim.x * blockIdx.x
         + blockDim.x * gridDim.x * blockDim.y * blockIdx.y
         + blockDim.x * gridDim.x * threadIdx.y);
    return i;
}

template<class T>
__global__ void init(T * v, int n)
{
    int i = t2l();

    if (i > n) // we want v[n] set!
        return;

    if (i < 0)
        return;

    T w = 1.0 / n;
    v[i] = i * w;
}

template<class T>
T func(T * a, int i, int n)
{
    T vv = a[i];
    T w = 1.0L / (1 + vv * vv);
    if (i == 0)
        ;
    else if (i == n)
        ;
    else if (i % 2 == 0)
    {
        w = w * 2;
    }
    else
    {
        w = w * 4;
    }
    return w;
}

template<class T>
__global__ void map(T * v, int n)
{
    int i = t2l();

    if (i > n) // we want v[n] set!
        return;

    if (i < 0)
        return;

    T vv = v[i];

    T w = 1.0 / (1 + vv * vv);
    if (i == 0)
        ;
    else if (i == n)
        ;
    else if (i % 2 == 0)
    {
        w = w * 2;
    }
    else
    {
        w = w * 4;
    }
    v[i] = w;
}

template<class T>
__global__ void kernel_reduce_a(T * a, int n, int s)
{
    int i = t2l();
    i *= s;

    if (i >= n)
        return;

    if (i < 0)
        return;

    a[i] += a[i + s/2];
}

// From http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
int flog2(int v)
{
    int x = 0;
    while ((v = (v >> 1)) != 0)
    {
        x++;
    }
    return (int)x;
}

// Compute the 2 ** v.
int pow2(int v)
{
    int value = 1;
    for ( ; v > 0; v--)
        value <<= 1;
    return value;
}

void CHECK(cudaError_t x)
{
    if (x != 0)
    {
        std::cout << "CUDA error " << x << "\n";
        exit(1);
    }
}

int main()
{
    int np1 = pow2(24);
    int n = np1 - 1;

#define TYPE float
    {
        struct _timeb  t1;
        struct _timeb  t2;
        std::cout << "Starting tests...\n";
        _ftime_s(&t1);

        TYPE * da = 0;
        cudaError_t r1 = cudaMalloc(&da, np1 * sizeof(TYPE));
        CHECK(r1);
        {
            dim3 tile;
            dim3 tiles;

            l2t(np1, &tile, &tiles);

            // set up xi's.
            init<TYPE><<<tiles, tile>>>(da, n);

            // Map.
            map<TYPE><<<tiles, tile>>>(da, n);
        }

        // Reduce.
        int levels = flog2(np1);
        for (int level = 0; level < levels; ++level)
        {
            int step = pow2(level+1);
            int threads = np1 / step;

            dim3 tile;
            dim3 tiles;

            l2t(threads, &tile, &tiles);

            kernel_reduce_a<TYPE><<<tiles, tile>>>(da, n, step);
        }
        TYPE a;
        int rv3 = cudaMemcpy(&a, da, sizeof(TYPE) * 1, cudaMemcpyDeviceToHost);
        cudaFree(da);
        TYPE w = 4.0L / 3 / n;

        _ftime(&t2);
        std::cout << (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))/1000) << " s.\n";
        std::cout << n << " " << std::setprecision(44) << a * w;
        std::cout << "\n";

    }
    return 0;
}

Figure 5(a). Implementation of Composite Simpson's Rule for π using CUDA Driver API.

#include <stdio.h>
#include <string.h>
#include <cuda.h>
#include <stdlib.h>
#include <sys/timeb.h>
#include <time.h>
#include <iostream>
#include <iomanip>

void linear_to_tile(int size, int max_dimensionality, size_t * threads, size_t * blocks)
{
    for (int j = 0; j < max_dimensionality; ++j)
        blocks[j] = 1;
    //int max_threads[] = { 512, 512, 64};
    int max_threads[] = { 512, 64, 64};
    int max_blocks[] = { 65535, 65535, 65535};
    for (int j = 0; j < max_dimensionality; ++j)
        threads[j] = 1;

    int b = size / (max_threads[0] * max_blocks[0]);
    if (b == 0)
    {
        int b = size / max_threads[0];
        if (size % max_threads[0] != 0)
            b++;

        if (b == 1)
            max_threads[0] = size;

        // done. return the result.
        blocks[0] = b;
        threads[0] = max_threads[0];
        return;
    }

    int sqrt_size = sqrt((float)size / max_threads[0]);
    sqrt_size++;

    int b2 = sqrt_size / max_blocks[1];
    if (b2 == 0)
    {
        int b = sqrt_size;

        // done. return the result.
        blocks[0] = blocks[1] = b;
        threads[0] = max_threads[0];
        return;
    }
    throw;
}

void CpuInit(float * v, int n)
{
    float w = 1.0 / n;
    for (int i = 0; i <= n; ++i)
        v[i] = i * w;
}

void CpuMap(float * v, int n)
{
    for (int i = 0; i <= n; ++i)
    {
        float vv = v[i];

        float w = 1.0 / (1 + vv * vv);
        if (i == 0)
            ;
        else if (i == n)
            ;
        else if (i % 2 == 0)
        {
            w = w * 2;
        }
        else
        {
            w = w * 4;
        }
        v[i] = w;
    }
}

void cpu_reduce(float * a, int n)
{
    // note: use double here because of accumulation of truncation error due to adding small numbers to a large number.
    double total = 0;
    for (int i = 0; i <= n; ++i)
        total += a[i];
    a[0] = total;
}

// From http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
int flog2(int v)
{
    int x = 0;
    while ((v = (v >> 1)) != 0)
    {
        x++;
    }
    return (int)x;
}

// Compute the 2 ** v.
int pow2(int v)
{
    int value = 1;
    for ( ; v > 0; v--)
        value <<= 1;
    return value;
}

void CHECK(CUresult x)
{
    if (x != 0)
    {
        std::cout << "Error " << x << "\n";
        throw new std::string("Error " + x);
    }
}

void CpuPi(int np1)
{
    int n = np1 - 1;
    struct _timeb  t1;
    struct _timeb  t2;
    std::cout << "Starting tests...\n";
    _ftime_s(&t1);
    float * a = (float*)malloc(np1 * sizeof(float));
    CpuInit(a, n);
    CpuMap(a, n);
    cpu_reduce(a, n);
    float w = *a * 4.0 / 3 / n;
    free(a);
    _ftime(&t2);
    std::cout << (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))/1000) << " s.\n";

    std::cout << n << " " << std::setprecision(44) << w;
    std::cout << "\n";
}

void TryPlatform(int np1)
{
    try {

        int n = np1 - 1;
        CUresult err;

        cuInit(0);

        int num = 0;
        err = cuDeviceGetCount(&num);
        CHECK(err);

        for (int d = 0; d < num; ++d)
        {
            CUdevice device;
            err = cuDeviceGet(&device, d);
            CHECK(err);

            CUcontext context;
            err = cuCtxCreate(&context, 0, device);
            CHECK(err);

            // create the program
            CUmodule program;
            err = cuModuleLoad(&program, "pi-driver-kernels.ptx");
            CHECK(err);

            struct _timeb  t1;
            struct _timeb  t2;
            std::cout << "Starting tests...\n";
            _ftime_s(&t1);

            // for (int c = 0; c < 7; ++c)
            {
                CUdeviceptr da;
                err = cuMemAlloc(&da, sizeof(float)*np1);
                CHECK(err);

                {
                    size_t tile[3];
                    size_t tiles[3];
                    int max_dimensionality = 3;

                    linear_to_tile(np1, max_dimensionality, tile, tiles);

                    {
                        // execute the kernel
                        CUfunction kernel_init;
                        err = cuModuleGetFunction(&kernel_init, program, "_Z4initIfEvPT_i");
                        CHECK(err);

                        void* args[] = { &da, &n };
                        err = cuLaunchKernel(kernel_init,
                            tiles[0], tiles[1], tiles[2], tile[0], tile[1], tile[2],
                            0, 0, args, 0);
                        CHECK(err);
                    }

                    {
                        // execute the kernel
                        CUfunction kernel_map;
                        err = cuModuleGetFunction(&kernel_map, program, "_Z3mapIfEvPT_i");
                        CHECK(err);

                        void* args[] = { &da, &n };
                        err = cuLaunchKernel(kernel_map,
                            tiles[0], tiles[1], tiles[2], tile[0], tile[1], tile[2],
                            0, 0, args, 0);
                        CHECK(err);
                    }
                }

                // Reduce.
                int levels = flog2(np1);
                for (int level = 0; level < levels; ++level)
                {
                    int step = pow2(level+1);
                    int threads = np1 / step;

                    size_t tile[3];
                    size_t tiles[3];
                    int max_dimensionality = 3;

                    linear_to_tile(threads, max_dimensionality, tile, tiles);

                    // execute the kernel
                    CUfunction kernel_reduce;
                    err = cuModuleGetFunction(&kernel_reduce, program, "_Z15kernel_reduce_aIfEvPT_ii");
                    CHECK(err);

                    void* args[] = { &da, &n, &step };
                    err = cuLaunchKernel(kernel_reduce, tiles[0], tiles[1], tiles[2], tile[0], tile[1], tile[2], 0, 0, args, 0);
                    CHECK(err);
                }

                float a[100];
                int sss = 1;

                // read output array
                err = cuMemcpyDtoH(a, da, sss*sizeof(float));
                CHECK(err);
                err = cuMemFree(da);
                CHECK(err);

                float w = *a * 4.0 / 3 / n;

                _ftime(&t2);
                std::cout << (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))/1000) << " s.\n";
                std::cout << n << " " << std::setprecision(44) << w;
                std::cout << "\n";

                np1 *= 2;
                n = np1 - 1;
            }
        }
    }
    catch(...)
    {
        std::cout << "Whoops!\n";
    }
}

int main()
{
    int np1 = pow2(24);
    int n = np1 - 1;

    printf("np1 = %d\n", np1);

    CpuPi(np1);

    // Only one platform -- CUDA!
    TryPlatform(np1);

    return 0;
}

Figure 5(b). Kernel for the CUDA Driver API solution.

__device__ int t2l()
{
    int i = (threadIdx.x
         + blockDim.x * blockIdx.x
         + blockDim.x * gridDim.x * blockDim.y * blockIdx.y
         + blockDim.x * gridDim.x * threadIdx.y);
    return i;
}

template<class T>
__global__ void init(T * v, int n)
{
    int i = t2l();

    if (i > n) // we want v[n] set!
        return;

    if (i < 0)
        return;

    T w = 1.0 / n;
    v[i] = i * w;
}

template<class T>
T func(T * a, int i, int n)
{
    T vv = a[i];
    T w = 1.0L / (1 + vv * vv);
    if (i == 0)
        ;
    else if (i == n)
        ;
    else if (i % 2 == 0)
    {
        w = w * 2;
    }
    else
    {
        w = w * 4;
    }
    return w;
}

template<class T>
__global__ void map(T * v, int n)
{
    int i = t2l();

    if (i > n) // we want v[n] set!
        return;

    if (i < 0)
        return;

    T vv = v[i];

    T w = 1.0 / (1 + vv * vv);
    if (i == 0)
        ;
    else if (i == n)
        ;
    else if (i % 2 == 0)
    {
        w = w * 2;
    }
    else
    {
        w = w * 4;
    }
    v[i] = w;
}

template<class T>
__global__ void kernel_reduce_a(T * a, int n, int s)
{
    int i = t2l();
    i *= s;

    if (i >= n)
        return;

    if (i < 0)
        return;

    a[i] += a[i + s/2];
}

void fun1()
{
    float * a;
    int n;
    init<float><<<1,1>>>(a, n);
    map<float><<<1,1>>>(a, n);
    kernel_reduce_a<float><<<1,1>>>(a, n, n);
}

Figure 6. Implementation of Composite Simpson's Rule for π using OpenCL (and a sequential CPU implementation).

#include <stdio.h>
#include <string.h>
#include <cl/cl.h>
#include <stdlib.h>
#include <sys/timeb.h>
#include <time.h>
#include <iostream>
#include <iomanip>

char * kernels = "size_t t2l()\
{\
    size_t gsx = get_global_size(0);\
    size_t gsy = get_global_size(1);\
    size_t gsz = get_global_size(2);\
    size_t i = \
        get_global_id(0)\
        + get_global_id(1) * gsx \
        + get_global_id(2) * gsx * gsy\
        ;\
    return i;\
}\
\
__kernel void init(__global float * v, int n)\
{\
    size_t i = t2l();\
    if (i > n)\
        return;\
        \
    if (i < 0)\
        return;\
        \
    float w = 1.0 / n;\
    v[i] = i * w;\
}\
\
__kernel void map(__global float * v, int n)\
{\
    size_t i = t2l();\
         \
    if (i > n)\
        return;\
        \
    if (i < 0)\
        return;\
        \
    float vv = v[i];\
    \
    float w = 1.0 / (1 + vv * vv);\
    if (i == 0)\
        ;\
    else if (i == n)\
        ;\
    else if (i % 2 == 0)\
    {\
        w = w * 2;\
    }\
    else\
    {\
        w = w * 4;\
    }\
    v[i] = w;\
}\
\
__kernel void kernel_reduce_a(__global float * a, int n, int s)\
{\
    size_t i = t2l();\
    \
    i = i * s;\
         \
    if (i >= n)\
        return;\
        \
    if (i < 0)\
        return;\
        \
    a[i] = a[i] + a[i + s/2];\
    if (i == 0)\
       a[1] = s/2;\
}\
";


void get_device_info(cl_device_id device_id, cl_device_info device_info, std::string* value, cl_int * err)
{
    size_t size = 0;

    //  Get all params for the given platform id, first query their size, then get the actual data
    *err = clGetDeviceInfo(device_id, device_info, 0, NULL, &size);
    value->resize(size);
    *err = clGetDeviceInfo(device_id, device_info, size, &((*value)[0]), NULL);
}


void get_platform_info(cl_platform_id platform_id, cl_platform_info platform_info, std::string* value, cl_int * err)
{
    ::size_t size = 0;

    //  Get all params for the given platform id, first query their size, then get the actual data
    *err = clGetPlatformInfo(platform_id, platform_info, 0, NULL, &size);
    value->resize(size);
    *err = clGetPlatformInfo(platform_id, platform_info, size, &((*value)[0]), NULL);
}


cl_program load_binary(cl_context context, cl_platform_id platform, cl_device_id device, char * file_name, cl_int * err)
{
    size_t len;

    // for now, compile inlined source.
    cl_program program = clCreateProgramWithSource(context, 1, (const char **)&kernels, 0, err);
    char **binaries = 0;
    if (*err) 
    {
        return 0;
    }
    *err = clBuildProgram(program, 1, &device, "-cl-opt-disable", NULL, NULL);
    //char log[5000];
    //cl_int err2 = clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL);
    //printf("log = %s\n", log);
    return program;
}

int the_blocksize = 256;

void l2t(int size, int max_dimensionality, size_t * tile_size, size_t * tiles)
{
    for (int j = 0; j < max_dimensionality; ++j)
        tiles[j] = 1;
    int max_threads[] = { the_blocksize, 64, 64};
    int max_blocks[] = { 65535, 65535, 65535};
    for (int j = 0; j < max_dimensionality; ++j)
        tile_size[j] = 1;

    int b = size / (max_threads[0] * max_blocks[0]);
    if (b == 0)
    {
        int b = size / max_threads[0];
        if (size % max_threads[0] != 0)
            b++;

        if (b == 1)
            max_threads[0] = size;

        // done. return the result.
        tiles[0] = b;
        tile_size[0] = max_threads[0];

        // OpenCL uses multiples of tile_size.
        tiles[0] *= tile_size[0];
        return;

    }

    int sqrt_size = sqrt((float)size / max_threads[0]);
    sqrt_size++;

    int b2 = sqrt_size / max_blocks[1];
    if (b2 == 0)
    {
        int b = sqrt_size;

        // done. return the result.
        tiles[0] = tiles[1] = b;
        tile_size[0] = max_threads[0];

        // OpenCL uses multiples of tile_size.
        tiles[0] *= tile_size[0];
//        tiles[1] *= tile_size[1];
        return;
    }
    throw;
}

void CpuInit(float * v, int n)
{
    float w = 1.0 / n;
    for (int i = 0; i <= n; ++i)
        v[i] = i * w;
}

void CpuMap(float * v, int n)
{
    for (int i = 0; i <= n; ++i)
    {
        float vv = v[i];
    
        float w = 1.0 / (1 + vv * vv);
        if (i == 0)
            ;
        else if (i == n)
            ;
        else if (i % 2 == 0)
        {
            w = w * 2;
        }
        else
        {
            w = w * 4;
        }
        v[i] = w;
    }
}


void cpu_reduce(float * a, int n)
{
    // note: use double here because of accumulation of truncation error due to adding small numbers to a large number.
    double total = 0;
    for (int i = 0; i <= n; ++i)
        total += a[i];
    a[0] = total;
}


// From http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
int flog2(int v)
{
    int x = 0;
    while ((v = (v >> 1)) != 0)
    {
        x++;
    }
    return (int)x;
}

// Compute the 2 ** v.
int pow2(int v)
{
    int value = 1;
    for ( ; v > 0; v--)
        value <<= 1;
    return value;
}

void CHECK(cl_int x)
{
    if (x != 0)
    {
        std::cout << "Error " << x << "\n";
        throw new std::string("Error " + x);
    }
}

void CpuPi(int np1)
{
    int n = np1 - 1;
    struct _timeb  t1;
    struct _timeb  t2;
    std::cout << "Starting tests...\n";
    _ftime_s(&t1);
    float * a = (float*)malloc(np1 * sizeof(float));
    CpuInit(a, n);
    CpuMap(a, n);
    cpu_reduce(a, n);
    float w = *a * 4.0 / 3 / n;
    free(a);
    _ftime(&t2);
    std::cout << (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))/1000) << " s.\n";

    std::cout << n << " " << std::setprecision(44) << w;
    std::cout << "\n";
}

void TryPlatform(cl_platform_id platform, int np1)
{
    try {

        int n = np1 - 1;

        // Choose the appropriate platform for this linked DLL.
        cl_device_id dev_id[10];
        cl_uint num;
        cl_int err;
        err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 10, dev_id, &num);
        CHECK(err);

        printf("devices = %d\n", num);

        for (int d = 0; d < num; ++d)
        {
            char tbuf[500];
            size_t sz;

            std::cout << "            Device [" << d << "]" << std::endl;
            std::cout << "                type                          = ";
            {
                cl_device_type type;
                err = clGetDeviceInfo(dev_id[d], CL_DEVICE_TYPE, sizeof(type), &type, NULL);
                CHECK(err);

                if (type & CL_DEVICE_TYPE_DEFAULT       ) std::cout << "CL_DEVICE_TYPE_DEFAULT "    ;
                if (type & CL_DEVICE_TYPE_CPU           ) std::cout << "CL_DEVICE_TYPE_CPU "        ;
                if (type & CL_DEVICE_TYPE_GPU           ) std::cout << "CL_DEVICE_TYPE_GPU "        ;
                if (type & CL_DEVICE_TYPE_ACCELERATOR   ) std::cout << "CL_DEVICE_TYPE_ACCELERATOR ";

                std::cout << std::endl;
            }

            err = clGetDeviceInfo(dev_id[d], CL_DEVICE_NAME, sizeof(tbuf), tbuf, NULL);
            CHECK(err);
            std::cout << "                name                          = " << tbuf << std::endl;

            // Choose device.
            cl_device_id device = dev_id[d];

            // create the OpenCL context on a GPU device
            cl_context_properties props[4];
            props[0] = CL_CONTEXT_PLATFORM;
            props[1] = (cl_context_properties)platform;
            props[2] = 0;
            cl_context context = clCreateContext(props, 1, &device, NULL, NULL, &err);
            CHECK(err);
    
            // create the program
            cl_program program = load_binary(context, platform, device, "pi-opencl-kernels.o", &err);
            CHECK(err);

            struct _timeb  t1;
            struct _timeb  t2;
            std::cout << "Starting tests...\n";
            _ftime_s(&t1);

            // for (int c = 0; c < 7; ++c)
            {
                cl_int r1;
                cl_mem da = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float)*np1, NULL, &r1);
                CHECK(r1);

                size_t asize = sizeof(float);
                float * a = (float*) malloc(asize);
                {
                    size_t tile[3];
                    size_t tiles[3];
                    int max_dimensionality = 3;

                    l2t(np1, max_dimensionality, tile, tiles);

                    {
                        // execute the kernel
                        cl_kernel kernel_init = clCreateKernel(program, "init", &err);
                        CHECK(err);
                        err = clSetKernelArg(kernel_init, 0, sizeof(cl_mem), (void *) &da);
                        CHECK(err);
                        err = clSetKernelArg(kernel_init, 1, sizeof(int), (void *)&n);
                        CHECK(err);
                        // create a command-queue
                        cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &err);
                        CHECK(err);
                        cl_event my_event;
                        err = clEnqueueNDRangeKernel(cmd_queue, kernel_init, max_dimensionality, NULL, tiles, tile, 0, NULL, &my_event);
                        CHECK(err);
                        err = clWaitForEvents(1, &my_event);
                        CHECK(err);
                        err = clReleaseCommandQueue(cmd_queue);
                        CHECK(err);
                        err = clReleaseKernel(kernel_init);
                        CHECK(err);
                    }

                    {
                        // execute the kernel
                        cl_kernel kernel_map = clCreateKernel(program, "map", &err);
                        CHECK(err);
                        err = clSetKernelArg(kernel_map, 0, sizeof(cl_mem), (void *) &da);
                        CHECK(err);
                        err = clSetKernelArg(kernel_map, 1, sizeof(int), (void *)&n);
                        CHECK(err);
                        cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &err);
                        CHECK(err);
                        cl_event my_event;
                        err = clEnqueueNDRangeKernel(cmd_queue, kernel_map, max_dimensionality, NULL, tiles, tile, 0, NULL, &my_event);
                        CHECK(err);
                        err = clWaitForEvents(1, &my_event);
                        CHECK(err);
                        err = clReleaseCommandQueue(cmd_queue);
                        CHECK(err);
                        err = clReleaseKernel(kernel_map);
                        CHECK(err);
                    }
                }

                // Reduce.
                int levels = flog2(np1);
                for (int level = 0; level < levels; ++level)
                {
                    int step = pow2(level+1);
                    int threads = np1 / step;

                    size_t tile[3];
                    size_t tiles[3];
                    int max_dimensionality = 3;

                    l2t(threads, max_dimensionality, tile, tiles);

                    // execute the kernel
                    cl_kernel kernel_reduce = clCreateKernel(program, "kernel_reduce_a", &err);
                    CHECK(err);
                    err = clSetKernelArg(kernel_reduce, 0, sizeof(cl_mem), (void *) &da);
                    CHECK(err);
                    err = clSetKernelArg(kernel_reduce, 1, sizeof(int), (void *)&n);
                    CHECK(err);
                    err = clSetKernelArg(kernel_reduce, 2, sizeof(int), (void *)&step);
                    CHECK(err);
                    cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &err);
                    CHECK(err);
                    cl_event my_event;
                    err = clEnqueueNDRangeKernel(cmd_queue, kernel_reduce, max_dimensionality, NULL, tiles, tile, 0, NULL, &my_event);
                    CHECK(err);
                    err = clWaitForEvents(1, &my_event);
                    CHECK(err);
                    err = clReleaseCommandQueue(cmd_queue);
                    CHECK(err);
                    err = clReleaseKernel(kernel_reduce);
                    CHECK(err);
                }

                // read output array
                cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &err);
                CHECK(err);
                err = clEnqueueReadBuffer(cmd_queue, da, CL_TRUE, 0, sizeof(float), a, 0, NULL, NULL);
                CHECK(err);
                err = clReleaseCommandQueue(cmd_queue);
                CHECK(err);
                err = clReleaseMemObject(da);
                CHECK(err);

                float a1 = *a;
                float a2 = a1 * 4.0;
                float a3 = a2 / 3;
                float a4 = a3 / n;
                float w = ((*a * 4.0) / 3) / n;

                _ftime(&t2);
                std::cout << (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))/1000) << " s.\n";
                std::cout << n << " " << std::setprecision(44) << w;
                std::cout << "\n";

            }
            err = clReleaseContext(context);
            CHECK(err);
            err = clReleaseProgram(program);
            CHECK(err);
        }
    }
    catch(...)
    {
        std::cout << "Whoops!\n";
    }
}

int main(int argc, char *argv[])
{
    // input block size
    argc--; argv++;
    if (argc)
    {
        the_blocksize = atoi(*argv);
    }

//    int np1 = pow2(24);
    int np1 = pow2(24);
    int n = np1 - 1;

    printf("np1 = %d\n", np1);

    CpuPi(np1);

    cl_int err = NULL;
    cl_platform_id plat_id[20];
    cl_uint num;

    err = clGetPlatformIDs(20, plat_id, &num);

    if (err != NULL)
        return err;

    printf("Number of platforms = %d\n", num);

    cl_uint i = 0;
    for (i = 0; i < num; ++i)
    {
        char buf[500];
        size_t sz;
        err = clGetPlatformInfo(plat_id[i], CL_PLATFORM_PROFILE, 500, buf, &sz);
        if (err != NULL)
            return err;
        printf("Platform profile: %s\n", buf);
        err = clGetPlatformInfo(plat_id[i], CL_PLATFORM_VERSION, 500, buf, &sz);
        if (err != NULL)
            return err;
        printf("Platform version: %s\n", buf);
        err = clGetPlatformInfo(plat_id[i], CL_PLATFORM_NAME, 500, buf, &sz);
        if (err != NULL)
            return err;
        printf("Platform name: %s\n", buf);
        char vendor[500];
        err = clGetPlatformInfo(plat_id[i], CL_PLATFORM_VENDOR, 500, vendor, &sz);
        if (err != NULL)
            return err;
        printf("Platform vendor: %s\n", vendor);
        err = clGetPlatformInfo(plat_id[i], CL_PLATFORM_EXTENSIONS, 500, buf, &sz);
        if (err != NULL)
            return err;
        printf("Platform extensions: %s\n", buf);

        TryPlatform(plat_id[i], np1);
    }
    return 0;
}

Discussion

Three parallel solutions for estimating the value of π, along with a sequential solution for comparison, were implemented.  The programs were tested on an NVIDIA GeForce GTX 470, an ATI Radeon HD 6450, and an Intel Q6600 @ 2.51 Ghz (overclocked), 4 G RAM, run on the Windows 7 64-bit OS.  All solutions produced the same results, but the capabilites of these processors couldn't be more different.  The NVIDIA GPU processor has 448 cores.  Threads are executed in groups of 32 called a warp.  The AMD GPU processor has 160 cores.  Threads are executed in groups called wavefronts.  It also uses Very Long Instruction Word (VLIW) instructions to simplify the hardware and decrease instruction latency.  The Intel CPU processor has 4 independent cores.  In this experiment, the CPU is used for sequential computation and OpenCL through the AMD AMP SDK.  For the purpose of the comparison, the block (work group) size was the same for the CUDA and OpenCL solutions. With the given hardware and software (problem size n = 224 – 1, programs built as 32-bit Windows applications in Release configuration), the runtimes are show below (Figure 7). 

Notes: (1) The CPU could run OpenCL through the AMD APP SDK v2.  The Intel OpenCL SDK does not work with this CPU because it does not implement SSE4. (2) The ATI and NVIDIA graphics cards could not co-exist properly because the motherboard does not implement hybid SLI/Crossfire multi-GPU.  In order to get it to work, one board would have to be disabled or removed. Newer boards that contain the Lucid Hydra bypass this problem.

Figure 7. Runtime of the estimation of π for various platforms and implementations

System/Program Time (s)
Sequential implementation, Intel Q6600 @ 2.51 Ghz, 4 G RAM 0.3
CUDA Runtime API with NVIDIA GTX 470 0.08
CUDA Driver API with NVIDIA GTX 470 0.016
OpenCL with NVIDIA GTX 470 0.027
OpenCL with ATI Radeon HD 6450 2.15
OpenCL with Intel Q6600 @ 2.51 Ghz, 4 G RAM 1.05

Not to much of a surprise, the CUDA and OpenCL implementations on the NVIDIA GPU run faster that the sequential implementation on the CPU–for the given problem size.  But, what is a surprise is that the OpenCL implementation on the ATI device is much slower than the CPU sequential implementation.  A larger problem size for n+1 could reverse the standing, but OpenCL runtime errors appear for larger problem sizes. The runtimes for the three parallel solutions vary quite a bit, even for the same device (i.e., NVIDIA GTX 470), with the CUDA runtime API more than twice as slow as the CUDA Driver API and OpenCL solutions.

On the surface, the languages for specifying kernels appear similar.  However, if features and capabilites are compared item by item, the differences are pronounce (Figure 8).

The language for the CUDA framework is a C++ dialect.  It is translated by a stand-alone compiler (nvcc, developed by NVIDIA) into standard C++.  The C++ code is then compiled into an executable using the native C++ compiler (e.g., gcc for Linux, and MSVC++ for Windows).  An advantage of this method is that kernel code can intermingle with host code.  The kernel code is  translated into PTX assembly code, the lingua franca of NVIDIA devices.  PTX code is used directly by the CUDA Driver API.  Kernel calls can be specified using a chevron syntax (e.g., lines 208 of Figure 4), which is translated into CUDA Runtime API calls.  (There is no provision to translate the code into CUDA Driver API calls.)  The chevron syntax reduces the verboseness of kernel calls that one would find with the CUDA Driver framework (lines 191 through 198 of Figure 5(a)) or the OpenCL framework (lines 200 through 216 of Figure 6(a)).  Kernels can be extended in a number of ways, used in templates (e.g., line 65 of Figure 4), use classes, call recursive functions, etc.  However, kernel code in C++ classes is only partially supported.  Classes can contain device code, declared __device__, that kernels can call, but kernels themselves cannot be class members.

The OpenCL language is similar to the CUDA C++, but very restricted.  OpenCL is based on the C99 standard, not C++.  Developers normally write the kernels as a "char *" string.  This code is the compiled using calls to clCreateProgramWithSource and clBuildProgram.  Multiline string literals to specify kernel code is a hassle in C++.  The code can be written as adjacent string literals (each in double quotes), one for each line of text for the kernel.  Alternatively, a trailing backslash (\) at the end of each line can be employed.  Either way, it is quite annoying.  

Unfortunately, there is no stand-alone compiler for OpenCL.  A deverloper can certain write one (see here), but that introduces new problems where you must remember to use the same compiler options when loading the binary object code (see line 130 of Figure 6(b)).  The OpenCL compilation (i.e., clCreateProgramWithSource and clBuildProgram) results in binary code that can be saved to a file, then reloaded for execution later, using clGetProgramInfo().  For NVIDIA, the binary code is PTX, which is essentially text.  For ATI, the code is an ELF binary file.

There are many differences between OpenCL, CUDA Runtime API, and CUDA Driver API.  Item by item, the functions often do not correspond at all to each other.  However, groups of API calls have similar functionality.  The following table is a summary of the differences, similarities, and equivalences (Figure 8).

Figure 8. Comparison of features between OpenCL, CUDA Runtime API, and CUDA Driver API

Term feature OpenCL CUDA Runtime CUDA Driver
Processor hardware Device GPU
Processor hardware Compute unit Multiprocessor
Processor hardware Processing element Scalar core
Memory hardware Global memory Global memory
Memory hardware Local memory Shared memory
Memory hardware Private memory Local memory
Sofware Kernel (or program) Kernel
Sofware Work item Thread
Sofware Work group Block
Sofware NDRange Grid
Language Features OpenCL CUDA Runtime CUDA Driver
Language base C99 C++03
Format of kernel string, e.g., a literal or constructed by reading a file C++ function with __kernel__ qualifier
Qualifier and type of kernel __kernel void __kernel__ void
Qualifier of non-kernel device functions Qualifier cannot appear __device__
Qualifier of host functions Host functions cannot appear in OpenCL code __host__
Qualifiers for kernel formal parameters Address qualifiers must be used (global, local, constant) Address qualifiers CANNOT be used
Global memory variables __global __device__
Shared memory variables __local __shared__
Constant memory variables __constant __constant__
Private (per task) memory variables __private or unqualified Qualifiers cannot appear
Kernel call syntax API call in host code, via clEnqueueNDRangeKernel() Chevron syntax for launch API call in host code, via cuLaunchGrid(), using PTX assembly code in a string variable
Templates available? No Yes
Device memory type in host code Handle Pointer, but cannot be dereferenced Handle
Device memory type in device code Pointer
Double floating point Yes (with #pragma only) Yes
Thread identification get_local_id threadIdx
Block identification get_group_id blockIdx
Block dimension get_local_size blockDim
Grid dimension get_num_groups gridDim
atomic add/sub/… AMD, yes with #pragma; No, NVIDIA Yes
Thread synchronization within a block barrier() __syncthreads()
Thread synchronization within a block No equivalence with CUDA __threadfence()
Thread synchronization within a block mem_fence(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE) __threadfence_block()
Thread synchronization within a block read_mem_fence() No equivalence with OpenCL
Thread synchronization within a block write_mem_fence() No equivalence with OpenCL
API Features OpenCL CUDA Runtime CUDA Driver
Initialization     cuInit()
Kernel creation clGetPlatformIDs(); clGetDeviceIDs(); clCreateProgramWithSource(); clBuildProgram(); clCreateKernel(); (Generated by CUDA compiler: cudaRegisterFatBinary(); cudaRegisterFunction(); cuDeviceGet(); cuCtxCreate(); cuModuleLoad(); cuModuleGetFunction();
Global memory allocation clCreateBuffer() cudaMalloc() cuMemAlloc()
Global memory deallocation clReleaseMemObj() cudaFree() cuMemFree()
CPU to GPU memory copy clEnqueueWriteBuffer() cudaMemcpy cuMemcpyHtoD()
Kernel call from host clSetKernelArg(); clCreateCommandQueue(); clEnqueueNDRangeKernel(); Chevron syntax for launch, generates cudaConfigureCall(); cudaSetupArgument(); cudaLaunch(); cuLaunchKernel();
GPU to CPU memory copy clEnqueueReadBuffer() cudaMemcpy cuMemcpyDtoH()
Device information clGetContextInfo(); clGetDeviceInfo(); cudaGetDeviceProperties() cuDeviceGetProperties()
General capabilities and tools OpenCL CUDA Runtime CUDA Driver
Official, stand-alone compiler No (but can with clcc) Yes Yes
JIT compilation Yes, from OpenCL C Yes, from PTX Yes, from PTX
Current version 1.1 AMD; 1.0 NVIDIA (still on 1.0 after one year since 1.1 spec has been out); 1.1 Intel 4.0
 

Conclusions

OpenCL is a standardized framework for programming heterogenous processors; CUDA is a proprietary framework.  If you want cross-platform support, the choice is clear, use OpenCL.  While some studies show that the OpenCL framework is slower than the CUDA framework [2, 3, 4], this study found that the OpenCL is faster than the CUDA Runtime API.  Further studies should be performed  to determine whether these runtime speeds are consistent.

While CUDA C++ as a language is more advanced than OpenCL, both are quite primitive relative to the proposed C++ Accelerated Massive Parallelism language (C++ AMP) by Microsoft.  In CUDA and OpenCL, kernels must be abstracted into functions.  In C++ AMP, code will look much cleaner because parallel for-loops operate on inlined lambda functions, and the memory model relatively hidden.

The microprocessor industry is at the cusp of a major change.  While discrete GPUs will still exist for several years to come, most manufacturers are releasing integrated CPU/GPU chips.  These processors share a common bus to memory.  So, much of the functionality of CUDA and OpenCL, designed to copy memory from the CPU to/from the GPU, will be obsolete. 

Code

Code for this example is here:

MSVC++ 2010 OpenCL solution

MSVC++ 2010 CUDA Driver API solution

MSVC++ 2010 CUDA Runtime API solution

References

1. Lee, V. W., C. Kim, et al. (2010). Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. Proceedings of the 37th annual international symposium on Computer architecture. Saint-Malo, France, ACM: 451-460.

2. Karimi, K., N. G. Dickson, et al. (2010). "A performance comparison of CUDA and OpenCL." Arxiv preprint arXiv:1005.2581.

3. Danalis, A., G. Marin, et al. (2010). The Scalable Heterogeneous Computing (SHOC) benchmark suite. Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. Pittsburgh, Pennsylvania, ACM: 63-74.

4. Komatsu, K., K. Sato, et al. (2010). Evaluating performance and portability of OpenCL programs.