We have a very effective image processing framework that utilizes pipes and filters design pattern in our company. Our framework Capra Image Processing Platform lets you write your own image processing filters and combine those with built in real time filters like “Video Motion Detection”, “Feature Tracking”, “Object Recognition”, “People Counting”, “Automatic Number Plate Recognition – ANPR” and many more video analytics algorithms. This way software developers can come up with their custom solutions in medical imaging, video analytics, machine vision, video surveillance and similar applications. Capra uses both task level parallelism (for pipeline architecture) and data level parallelism (for image processing algorithms) and has many installations worldwide owing to its optimized real time performance. During design and implementation period which took years of our team we always had an eye on GPGPU (General Purpose GPU) programming. We even had several implementations on Cuda architecture from NVidia but we were shy to adapt it into our framework as it was only for NVidia and other vendors were working on different architectures and frameworks. Had we decided to adapt Cuda then we would be spending all our time for adapting other vendors systems within time. In the beginning we decided to wait for Larrabee from Intel in order to see its effects in the market but this wait may never end within years of latency on Larrabee project.  During these waiting periods:) Intel suddenly gave up these works and focused on ArBB (Array Building Blocks) to use the power coming from acquisition of RapidMind company.

When I first analyzed RapidMind’s architecture I was very happy to see that someone came up with a solution for utilizing both CPU and GPU power. With the long time that passed after their acquisition by Intel, finally the team came up with ArBB which has a revolutionary idea being Virtual Machine. This abstraction is very important and makes ArBB an important player in the field. Whereas it still lacks answers for where does this framework stand. Should it be only supporting IA (Intel Architecture) and related GPU structures than it would be hard for us to depend on it, they claim opposite for hardware independence but I prefer to wait and see those examples. Choosing performance infrastructures like Intel IPP, TBB and MKL in CPU and multi-core domain against other competitors for my company was not very problematic but when it comes to adding GPU power to our framework we need something in common. Anyway its hard for me to talk directly about ArBB but once it’s ready and mature enough with my above considerations answered I’ll spend plenty of my time for evaluating this beautiful framework. From an engineering view it’s a marvelous framework but our first concern in the company is to ensure that we can work with all GPU structures available. Another issue is we’re a little bit tired of different architectures and frameworks changing every time. We adjusted our framework to use TBB (Threading Building Blocks) efficiently and it’s hard for us to switch to new architectures easily when they just claim that they’re good, maturity is very important for us. Lastly if we decide using ArBB in the future the most important feature will be the determinism that other GPGPU structures lack. I’m planning to write a deep blog on ArBB in the next coming versions after the official version is released.

Besides the usage in company infrastructure another problem that faced me was my PhD thesis. I’m working on neuroimaging field for coming up with answers for both efficiently storing and querying these abundant data. In order to make something efficient than known methods you should either recommend new algorithms or implement the solutions on new architectures like many core hardware (CPU, GPU etc.) Following study is a result of my research for deciding on a structure for a framework which will use parallel programming models inside.

Neuroimaging field produces one of the most extensive dataset structures in medical imaging field. It is very hard to store this dense data for further queries to compare patients functional abilities. Even in compression and data reduction scenarios spatial indexing is necessary for efficient retrieval and query. Spatial indexing has focus on location and neighboring relations of the data besides its abundance. So as the new 3D neuroimaging data is inserted for an individual within different periods it gets slower to query / retrieve on spatial indexing methods.  Since neuroimaging field never involved in clinical applications, speed in data storage and retrieval was never an issue. It’s very obvious that in abundance of such 3D data if parallelism is not considered, sequential methods will fail within time thus leading to a fail in clinical usage.

The framework that’ll be used during parallel implementations of R-Tree like spatial indexing structures gains importance. A clinical system that’ll work on different hardware structures should be free of processor architectures. Today the most important individual (standard desktop computers) high performance computing and parallel programming models depend on multi core CPU programming or GPGPU (General Purpose GPU) programming models. According to the algorithms availability for parallelism one can reach hundreds of multiplier factors on GPU compared to sequential algorithms on CPU. The most important parallelism levels today are:

  • Data level parallelism ( Data / Loop),
  • Task level parallelism,
  • Pipeline level parallelism(variant of task level parallelism but with higher complexity)

Parallelism levels different from the above levels are beyond the scope of multi-core and GPGPU programming models. Even though GPGPU fastens many data level parallel architectures this advantage is lost in applications where high data transfer or messaging levels are reached. Whether it is CPU or GPU every processing unit has its own memory (cache) and shared memory (DRAM). Since it is very hard to transfer data between these structures one should avoid using complex data structures and messaging in their parallel algorithms. As it can be seen in Figure 1, GPU has many ALU (Arithmetic Logic Unit) whereas it lacks the cache space. Owing to these structure differences we can deduce that an algorithm implemented for parallel structures may still work with better performance on a multi-core CPU when it could not be efficiently parallelized on GPU. Studies on spatial indexing structures to make them work on parallel structures sometimes have led to similar results thus a hybrid structure is favored against just the GPGPU model.

Comparison of CPU & GPU Structures

Figure 1: Schematic comparison of CPU and GPU structures.(1)

After the release of CUDA™ by NVidia many software systems upgraded their infrastructure by using this revolutionary GPGPU framework. The biggest problem during this progress was the monopoly danger of NVidia for GPGPU. Years of discussions on “Cross platform programming”, “Cross CPU programming” were suddenly left and suddenly everyone started adapting their infrastructure to GPGPU via NVidia framework CUDA. This was very similar to DirectX monopoly as a part of Microsoft strategy. OpenGL (Open Graphics Library) was supported against DirectX for cross platform compatibility by many hardware and operating system manufacturers. A similar alliance started firstly by Apple and followed by others like IBM, AMD and lastly Intel for keeping up with CUDA using an open standard namely OpenCL (Open Computing Language). Another fact here is I believe Intel plans to use OpenCL support in LarraBee and ArBB this way many GPGPU code will be ready to run on LarraBee after years of latency. A nice discussion on Intel OpenCL strategy reveals how they look at this approach. Hardware manufacturers have started supplying necessary interfaces for OpenCL on their multi core (both on CPU or GPU) processors. Owing to this support once an implementation is done it will work both on multi core CPU systems and GPGPU enabled systems by using OpenCL where CUDA only works on NVidia. ATI was one of the first supporters of OpenCL and they released their OpenCL libraries after Apple’s opening  of the OpenCL structure. NVidia also released their libraries for OpenCL standards short time after acceptance of OpenCL as a standard thus they refused the monopoly claims after their support. My idea on this monopoly issue is somehow pragmatic, NVidia is the one who first gave huge power on GPGPU and even they had the chance of only supporting Cuda they try to support every new structure by using Cuda in the infrastructure part, they at least support new comers and don’t force community to use only its products as Microsoft once did. IBM (Cell processors), AMD, ARM have both released their support libraries and the last microprocessor company who released their OpenCL alpha support (November 2010) was Intel in their whatif platform. Performance comparison studies revealed that CUDA works with an approximate factor of %10 better performances than OpenCL on the same NVidia card which can be neglected for the freedom gained for working on cross processor availability. One should also not forget that this comparison can only be done on NVidia cards and NVidia OpenCL libraries which make use of CUDA at the back end, so this slight performance difference can be understood easily. One of the most important asset of OpenCL is any implementation that is not capable of gaining performance on GPGPU can easily be applied to multi-core CPU structure so that the effort spent for a parallel implementation is not lost owing to hardware limitations of GPU (high data transfer times, abundance of messaging). NVidia is against these parts and claim that CUDA code can easily be converted to OpenCL code but in the end this still means extra effort and time. In  projects where great man/month spent it is not that easy to convert from one architecture to other, and project managers usually stick to the first architecture that the system is implemented on (“If it works don’t touch!”)

The software architecture and framework that is planned to be implemented during my PhD study is expected to work on different hardware like desktops, laptops etc. as it will be used in clinical applications. Thus there is no chance of pushing the clinicians to work with a specific hardware.

One should see that Cuda is an architecture whereas OpenCL is a framework for hardware abstraction. This results in support from all available hardware vendors in many core field. Yes Cuda has huge support, yes you can convert Cuda to OpenCL when you need it, yes Cuda is slightly more efficient than OpenCL but in the end it is vendor specific. As a software engineer I never would like to stick into one hardware architecture unless it gives huge software infrastructure advantages and as a software company whose ambition is on letting its customers to exploit all hardware architectures for real time image processing we believe that we should be suggesting as many hardware  available as that works efficiently with our real time signal processing framework. This war is between hardware vendors and as a software company we must be offering availability within the biggest coverage we can both from hardware and OS level support.

Owing to all reasons stated above OpenCL is selected as the parallel programming framework of my PhD study and my company GPGPU infrastructure.



PS. Larrabee is like a child of mine, I’ve spent lots of time to understand it starting from the times when Larrabee was only a myth about which no official explanation was allowed. I still believe every CPU core covered by lots of GPU will change the game they’re still not obvious about it:) If Larrabee will lack OpenCL support I’ll adapt on whatever Intel suggests but I believe Larrabee – ArBB and OpenCL will be bundled in order to use the ready software framework community.

12 Responses to “GPGPU : OpenCL vs Cuda vs ArBB”

  1. Jeff Finckel says:

    Here are some other nice benchmarks of OpenCL vs CUDA: http://blog.accelereyes.com/blog/2010/05/10/nvidia-fermi-cuda-and-opencl/

Leave a Reply