In CUDA, OpenCL, and C++ AMP, a group is a collection of threads that execute in parallel in lock-step fashion. In CUDA, it is called a block; in OpenCL, it is called a work-group; in C++ AMP, it is called a tile. The purpose of a group is to allow threads within the group to communicate with each other using synchronization and/or shared memory. The size of thread groups is set by the programmer, but hardware constraints limit the maximum size to 512 or 1024. While programmers usually need to tailor algorithms to be aware of thread groups, there are a few tricks that can make programming easier.
Trying to get information of the underlying design of a GPGPU programming language environment and hardware can be difficult. Companies will not publish design information because they do not want you or other companies to copy the technology. But, sometimes you need to know details of a technology that are just not published in order to use it effectively. If they won’t tell you how the technology works, the only recourse to gain an understanding is experimentation [1, 2]. What is the performance of OpenCL, CUDA, and C++ AMP? What can we learn from this information?