Performance comparison of CUDA, OpenCL, and C++ AMP

Trying to get information of the underlying design of a GPGPU programming language environment and hardware can be difficult.  Companies will not publish design information because they do not want you or other companies to copy the technology.  But, sometimes you need to know details of a technology that are just not published in order to use it effectively.  If they won’t tell you how the technology works, the only recourse to gain an understanding is experimentation [1, 2].  What is the performance of OpenCL, CUDA, and C++ AMP?  What can we learn from this information?


Performance of OpenCL, CUDA, and C++ AMP is not easy to compare because there is no one-to-one correspondence between the APIs.  For example, there is no direct analog of the OpenCL function clSetKernelArg within CUDA.  In fact, clSetKernelArg can be called once and the kernel called multiple times.  In CUDA, this is impossible because the kernel argument is set when the kernel is called: a kernel call is translated into internal functions cudaSetupArgument and cudaLaunch.

Moreover, each GPGPU programming language environment performs initialization when a function of the API called, which is order dependent. For example, the first call to the CUDA API could be a cudaMalloc or cudaSetDevice depending on whether the programmer accepts the default device.  However, when either function is called, a long wait follows because CUDA is initializing.  But it is not hopeless if the developer implements the solution carefully.

As it turns out, a block of code in a program written in one language can usually be translated into equivalent code in another GPGPU programming language.  For example, code to find and select a device in CUDA has an equivalent block of code in OpenCL.  These blocks of code are called phases.  What are the phases of a program?  This depends, but one possible definition follows:

  1. Setup resources on the GPU for problem solving, e.g., pick the GPU, allocate memory on the GPU, and compile kernel code for the GPU.
  2. Copy data structures from the CPU to the GPU.
  3. Call kernel: Run the kernel code to solve the problem on the GPU.
  4. Copy data structures from the GPU to the CPU.
  5. Release resources on the GPU.
The functions of each language can be grouped into these phases:
Setup cudaGetDeviceCount
allocate vector<
allocate vector::iterator
allocate array<>
allocate extent<>
allocate grid<>
Copy to GPU cudaMemcpy clEnqueueWriteBuffer copy
Call kernel kernel<<<…>>>(…)
Copy to CPU cudaMemcpy clEnqueueReadBuffer copy
Release cudaFree clReleaseMemObject
allocate grid<>
allocate extent<>
allocate array<>
deallocate accelerator_view
deallocate vector<accelerator>
deallocate vector::iterator

Gathering run-time data

How do we collect the run-times for each phase?  There are a few ways to do this:

  • Use a black-box test to gather the run-time for the process.  To do this, run a program within the Bourne shell using the command “time” to display the run-time.  Unfortunately, this will not show the run-time for each phase.
  • Use a white-box test using a profiler to gather the run-time for phases.  But, the programmer must partition the phases into functions because the profiler is phase unaware.
  • Use a white-box test, implemented with a “self-instrumented” program (i.e., a program in which the programmer inserted calls to a clock function) to gather the run-time for the phases.

To make a fair comparison between environments, the problem and the solution must be the same, using the same algorithm, grid/tile size, memory in the GPU, etc.

Let’s take a simple problem:

input: A = array[0..n] of 1’s; integer u; integer i;

output: O = array[0..n] of integers, where O[k] = u * i when(k % u) = 0, otherwise O[k] = 0.

In other words, sum the number of 1’s in an array of n integers (all 1’s), i times, with a block size of u, and place the sum at the beginning of the block.  For example, given an array of 16 integers (n = 16), u = 4, and i = 4, O[0] = O[4] = O[8] = O[12] = 16, and O[1] = O[2] = O[3] = O[5] = … = 0 (for all other indices).  Other parameters to tweak the problem are: x iterations of solving the entire problem; stride s distance between elements summed.

The solution of this problem is here.  The defaults for the problem are: “-u 4 -i 1 -x 1000 -u 8388608 -s 1”.  The grid size is 256 threads.

Test environment

This program was executed on two machines:

  • Asus P5N-D motherboard, Intel Q6600 @ 2.51 GHz (overclocked), 4 GB DDR2-1066 @ 838 MHz, Windows 7 64-bit OS, NVIDIA GeForce GTX 470, an ATI Radeon HD 6450. A list of drivers and versions is here.
  • Gigabyte GA-A75-UD4H, AMD A8-3850 @ 2.9 GHz, 8 GB DDR3-1600 @ 800 MHz, Windows 8 64-bit OS (Developer Preview).


Setup time before copy to GPU

The time associated to compile an OpenCL program accounts for the large setup time for GPUs in the OpenCL environment.  Other environments do not have this problem because the code is pre-compiled to an intermediate language.

38.7 ± 0.4 424 ± 12 76.8 ± 1.9

(Run-times on Intel machine, mean ± S.E. of sample for 5 runs.)

Copying data between GPU and CPU

The run-time for copying data from the CPU to GPU are similar for CUDA, OpenCL, and C++ AMP.  However, copying data from the GPU to CPU seems to be different, and slower on all platforms, with C++ AMP much slower.

To 47.1 ± 1.4 58.2 ± 5.7 64.7 ± 1.1
From 72.5 ± 1.9 74.2 ± 2.7 121 ± 2

(Run-times on Intel machine, mean ± S.E. of sample for 5 runs.)

Kernel Overhead

On both machines, the run-time for a loop of a varying number of kernel calls in CUDA, OpenCL, and C++ AMP fairly similar, with C++ AMP slightly slower than the other two.  However, for small problems, C++ AMP shows an overhead around 70 ms associated with the first call to the kernel, whereas the overhead is negligible for the other two environments.

-i 1 -x 1 0.936 ± 0.005 0.944 ± 0.017 71.6 ± 3.7
-i 1 -x 10 6.34 ± 0.01 6.46 ± 0.01 76.35 ± 0.61
-i 1 -x 100 60.5 ± 0.1 60.5 ± 0.1 145 ± 1
-i 1 -x 1000 602 ± 0.1 601 ± 0.3 713 ± 1
-i 1 -x 10000 6022 ± 0.4 6005 ± 1 6378 ± 3

(Run-times on Intel machine, mean ± S.E. of sample for 5 runs.)

Object code overhead

On both machines, the run-time for one kernel call containing a loop of a varying number of iterations in CUDA, OpenCL, and C++ AMP, are again fairly similar.  CUDA and OpenCL perform the kernel call with a loop of 10000 iterations around 2.2 s, while C++ AMP is slightly slower than the other two at 3.7 s.  These data suggest that performance of the object code for C++ AMP is not as good as for OpenCL or CUDA.

-i 1 -x 1 0.918 ± 0.010 0.946 ± 0.010 71.6 ± 3.7
-i 10 -x 10 2.87 ± 0.01 2.87 ± 0.02 70.7 ± 0.5
-i 100 -x 100 22.5 ± 0.02 22.3 ± 0.03 103 ± 0.3
-i 1000 -x 1000 220 ± 0.1 219 ± 0.1 428 ± 1
-i 10000 -x 10000 2195 ± 0.4 2188 ± 0.1 3671 ± 1

(Run-times on Intel machine, mean ± S.E. of sample for 5 runs.)

Performance comparison between NVIDIA 470 and AMD Llano APU

The relative performance for the GPU’s is NVIDIA 470 > AMD Llano APU >> ATI Radeon HD 6450.  There is no surprise there, considering their cost.

NVIDIA 470 (S) AMD Llano GPU (S) ATI Radeon HD 6450 (S)
1.41 ± 0.01 8.62 ± 0.02 37.1 ± 0.01

(Run-times on Intel and AMD machines.  Program ran with standard parameters, and “-l opencl -q”.  Mean ± S.E. of sample for 5 runs.)

Discussion and conclusion

In order to program efficient code in GPU programming environments, developers need to understand the API. Program design requires information that may not be published, or at best, in developer forums (e.g., and Experimentation helps identify the information needed by the programmer for good design.


1) Wong, H., M. Papadopoulou, et al. (2010). Demystifying GPU microarchitecture through microbenchmarking. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2010, IEEE.

2) Zhang, Y., L. Peng, et al. “Architecture Comparisons between Nvidia and ATI GPUs: Computation Parallelism and Data Communications.”