CUDA Parallel Computing

Revision as of 18:50, 27 February 2010

Based on Benny11's thread Most Efficient CUDA Processing per Dollar!

Post 1
.. 1.1 Introduction
.. 1.2 What is CUDA?
.. 1.3 But isn't parallel computing difficult?
.. 1.4 How much faster is it? The GPU vs. CPU Architecture....
.. 1.5 Before Benchmarking - A little theory
.. 1.6 Benchmarking Methods
...... 1.6.1 Requirements
...... 1.6.2 Method 1
...... 1.6.3 Method 2
...... 1.6.4 Benchmark Submission Post Format
...... 1.6.5 Finding your Driver Version
.. 1.7 Conclusion
.. 1.8 Notes and Improvements to Thread
.. 1.9 References

Post 2
.. 1.1 How results are calculated
.. 1.2 Results

Post 3.
.. Example

1.1 Introduction

For the past couple of months I have been trying to find the most cost effective CUDA solution. Currently I'm looking at creating a small Bewoulf cluster for CUDA algorithm processing on linux, but after looking around the net I could never find a comprehensive benchmark list of different GPU's using CUDA. I'm not sure what GPU to buy and I don't think many people who are interested in the scene know either. Sure NVIDIA made a GPU just for CUDA but who wants to pay $1400 for it? So i thought I would create this thread and help everyone get interested in CUDA and find some effective ways for a CUDA solution...

1.2 What is CUDA?

NVIDIA CUDA (Compute Unified Device Architecture) technology is the world's only C language environment that enables programmers and developers to write software to solve complex computational problems in a fraction of the time by tapping into the many-core parallel processing power of GPUs. With millions of CUDA-capable GPUs already deployed, thousands of software programmers are already using the free CUDA software tools to accelerate applications-from video and audio encoding to oil and gas exploration, product design, medical imaging, and scientific research. Basically CUDA is a software and GPU architecture that makes it possible to use the many processor cores (and eventually thousands of cores) in a GPU to perform general-purpose mathematical calculations.

1.3 But isn't parallel computing difficult?

Parallel programming is difficult because it was typically defined as making many CPUs work together (as in a cluster). Desktop applications have been slow to take advantage of multi-core CPUs due to the difficulty of splitting a single program into one that works across multiple threads. These difficulties arise from the fact that a CPU is inherently a serial processor and having multiple CPUs require complex software to manage them.
CUDA removes much of the burden of manually managing parallelism. A program written in CUDA is actually a serial program called a kernel. The GPU takes this kernel and makes it parallel by launching thousands of instances of the program. Since CUDA is an extension of C, it's often trivial to port programs to CUDA. It can be as simple as converting a loop into a CUDA call.

The key features of CUDA are:

Shared memory: Every multiprocessor in CUDA-capable GPUs contains 16 KB of shared memory. This allows different threads to communicate with each other and share data. Shared memory can be considered as software managed cache, which provides great speedups by conserving bandwidth to main memory. This benefits a number of common applications such as linear algebra, fast Fourier transforms, and image-processing filters.
Random read and write (ie. gather and scatter): Whereas fragment programs in the graphics API are limited to outputting 32 floats (RGBA * 8 render targets) at a pre-specified memory location, CUDA supports scattered writes, i.e., an unlimited number of stores to any memory address. This enables many new algorithms that are not feasible using a graphics API.
Arrays and integer addressing: Graphics APIs force the user to store data as textures, which requires packing long arrays into 2D textures. This is cumbersome and imposes extra addressing math. CUDA allows data to be stored in standard arrays and can perform loads from any address.
Texturing Support: CUDA provides optimized texture access with automatic caching, free filtering and integer addressing.
Coalesced memory loads and stores: CUDA groups multiple memory load requests or multiple store requests together, effectively reading or writing data from memory in chunks, allowing near-peak use of memory bandwidth.

1.4 How much faster is it? The GPU vs. CPU Architecture....

Suppose we have two arrays of 1,000 elements and want to find the sum of their elements. The CPU program would iteratively step through the two arrays, finding the sum at each point. For 1,000 elements, it takes 1,000 iterations to execute.
On a GPU, the program is defined as a sum operation over the two arrays. When the GPU executes the program, it generates an instance of the sum program for every element in the array. For an array of 1000 elements, it creates and launches 1000 "sum threads." A GeForce GTX 280 has 240 cores, allowing 240 threads to be calculated per clock. For 1000 elements, the GeForce GTX 280 finishes execution in five cycles.

1.5 Before Benchmarking - A little theory:

In computing a floating point describes a system for representing numbers that would be too large or too small to be represented as integers. Numbers are in general represented approximately to a fixed number of significant digits and scaled using an exponent.
The term floating point refers to the fact that the radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation.

The advantage of floating-point representation over fixed-point (and integer) representation is that it can support a much wider range of values. For example, a fixed-point representation that has seven decimal digits, with the decimal point assumed to be positioned after the fifth digit, can represent the numbers 12345.67, 8765.43, 123.00, and so on, whereas a floating-point representation with seven decimal digits could in addition represent 1.234567, 123456.7, 0.00001234567, 1234567000000000, and so on. The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of precision. Historically, different bases have been used for representing floating-point numbers, with base 2 (binary) being the most common, followed by base 10 (decimal), and other less common varieties such as base 16 (hexadecimal notation).

The speed of floating-point operations is measured in FLOPS. FLOPS (or flops or flop/s) is an acronym meaning FLoating point Operations Per Second. The FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations.

Just to help put this into context (and an interesting fact):
The world's fastest supercomputer as of November 2009 is the Cray XT5, also known as Jaguar—beating Roadrunner, which held the number one position for 18 months. Jaguar recently upgraded its quad-core CPUs to hex-core Opteron processors, which meant a 2.3 petaflop per second theoretical performance peak (”nearly a quarter of a million cores”), and 1.75 petaFLOPS measured by the Linpack benchmark. For comparison, a hand-held calculator must perform relatively few FLOPS. Each calculation request, such as to add or subtract two numbers, requires only a single operation, so there is rarely any need for its response time to exceed what the operator can physically use. A computer response time below 0.1 second in a calculation context is usually perceived as instantaneous by a human operator, so a simple calculator needs only about 10 FLOPS to be functional.

1.6 Benchmarking Methods:

After hours of searching for methods to benchmark CUDA on a specific setup i came to no definitive method. However for the purpose of this test I have chosen to use two simple methods (one a little more complex than the other but still simple) for benchmarking CUDA.

1.6.1 Requirements:

CUDA enabled GPU (http://www.nvidia.com/object/cuda_gpus.html)
CUDA Supported NVIDIA drivers

1.6.2 Method 1:

The first and easiest method is to use a simple program called CUDA-Z (Donwload from http://sourceforge.net/projects/cuda-z/files/cuda-z/0.5/CUDA-Z-0.5.95.exe/download). Run this little program and navigate to the “Performance” tab and click export > to text. Copy the information into your post. MAKE SURE YOU DO NOT HAVE ANY GPU INTENSIVE PROGRAMS ALREADY RUNNING.... IT IS BEST TO CLOSE ALL PROGRAMS BEFORE EXPORTING AS GPU INTENSIVE PROGRAMS WILL CHANGE THE RESULTS!!!!

1.6.3 Method 2:

Being re-thought...... It's to buggy...

1.6.4 Benchmark Submission Post Format:

I will give an example below if you are unsure!

Operating System:
CPU:
Driver Version: (Is a must Please! - Go to 1.6.5 if you are unsure how to find it.)
Core/Shader/Memory [C/S/M]: PLEASE! If your unsure... Download GPU-Z
Current Average Price of GPU: (if you can find it)
Method 1 output:
Method 2 When updated.....

And of course.... this is OCAU..... Please specify any overclocks with CPU or GPU. If you can and are willing too, overclock your GPU.... Do the tests again and repost your results inc. specs of your overclock.

1.6.5 Finding your Driver Version

Method 1: Click run and type in dxdiag then click OK. A popup will come up; click the display tab and to the right under drivers note the version.

Method 2: If method 1 did not work go to device manager (search for it in vista or go to control panel click advanced view and double click device manager). Click the + on display adapters. Right click your GPU and click properties. Once Properties is up, click the driver tab and note the Driver Version.

@@ Line 25: / Line 25: @@
 Post 3.<br>
 .. Example<br>
+'''1.1 Introduction'''
+For the past couple of months I have been trying to find the most cost effective CUDA solution. Currently I'm looking at creating a small Bewoulf cluster for CUDA algorithm processing on linux, but after looking around the net I could never find a comprehensive benchmark list of different GPU's using CUDA. I'm not sure what GPU to buy and I don't think many people who are interested in the scene know either. Sure NVIDIA made a GPU just for CUDA but who wants to pay $1400 for it? So i thought I would create this thread and help everyone get interested in CUDA and find some effective ways for a CUDA solution...
+'''1.2 What is CUDA?'''
+NVIDIA CUDA (Compute Unified Device Architecture) technology is the world's only C language environment that enables programmers and developers to write software to solve complex computational problems in a fraction of the time by tapping into the many-core parallel processing power of GPUs. With millions of CUDA-capable GPUs already deployed, thousands of software programmers are already using the free CUDA software tools to accelerate applications-from video and audio encoding to oil and gas exploration, product design, medical imaging, and scientific research. Basically CUDA is a software and GPU architecture that makes it possible to use the many processor cores (and eventually thousands of cores) in a GPU to perform general-purpose mathematical calculations.
+'''1.3 But isn't parallel computing difficult?'''
+Parallel programming is difficult because it was typically defined as making many CPUs work together (as in a cluster). Desktop applications have been slow to take advantage of multi-core CPUs due to the difficulty of splitting a single program into one that works across multiple threads. These difficulties arise from the fact that a CPU is inherently a serial processor and having multiple CPUs require complex software to manage them.<br>
+CUDA removes much of the burden of manually managing parallelism. A program written in CUDA is actually a serial program called a kernel. The GPU takes this kernel and makes it parallel by launching thousands of instances of the program. Since CUDA is an extension of C, it's often trivial to port programs to CUDA. It can be as simple as converting a loop into a CUDA call.
+''The key features of CUDA are:''
+*Shared memory: Every multiprocessor in CUDA-capable GPUs contains 16 KB of shared memory. This allows different threads to communicate with each other and share data. Shared memory can be considered as software managed cache, which provides great speedups by conserving bandwidth to main memory. This benefits a number of common applications such as linear algebra, fast Fourier transforms, and image-processing filters.
+*Random read and write (ie. gather and scatter): Whereas fragment programs in the graphics API are limited to outputting 32 floats (RGBA * 8 render targets) at a pre-specified memory location, CUDA supports scattered writes, i.e., an unlimited number of stores to any memory address. This enables many new algorithms that are not feasible using a graphics API.
+*Arrays and integer addressing: Graphics APIs force the user to store data as textures, which requires packing long arrays into 2D textures. This is cumbersome and imposes extra addressing math. CUDA allows data to be stored in standard arrays and can perform loads from any address.
+*Texturing Support: CUDA provides optimized texture access with automatic caching, free filtering and integer addressing.
+*Coalesced memory loads and stores: CUDA groups multiple memory load requests or multiple store requests together, effectively reading or writing data from memory in chunks, allowing near-peak use of memory bandwidth.
+'''1.4 How much faster is it? The GPU vs. CPU Architecture....'''
+Suppose we have two arrays of 1,000 elements and want to find the sum of their elements. The CPU program would iteratively step through the two arrays, finding the sum at each point. For 1,000 elements, it takes '''1,000 iterations to execute'''.<br>
+On a GPU, the program is defined as a sum operation over the two arrays. When the GPU executes the program, it generates an instance of the sum program for every element in the array. For an array of 1000 elements, it creates and launches 1000 "sum threads." A GeForce GTX 280 has 240 cores, allowing 240 threads to be calculated per clock. For 1000 elements, the GeForce GTX 280 finishes execution in '''five cycles'''.
+'''1.5 Before Benchmarking - A little theory:'''
+In computing a floating point describes a system for representing numbers that would be too large or too small to be represented as integers. Numbers are in general represented approximately to a fixed number of significant digits and scaled using an exponent.<br>
+The term floating point refers to the fact that the radix point (decimal point, or, more commonly in computers, binary point) can "float"; that is, it can be placed anywhere relative to the significant digits of the number. This position is indicated separately in the internal representation, and floating-point representation can thus be thought of as a computer realization of scientific notation.
+The advantage of floating-point representation over fixed-point (and integer) representation is that it can support a much wider range of values. For example, a fixed-point representation that has seven decimal digits, with the decimal point assumed to be positioned after the fifth digit, can represent the numbers 12345.67, 8765.43, 123.00, and so on, whereas a floating-point representation with seven decimal digits could in addition represent 1.234567, 123456.7, 0.00001234567, 1234567000000000, and so on. The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of precision. Historically, different bases have been used for representing floating-point numbers, with base 2 (binary) being the most common, followed by base 10 (decimal), and other less common varieties such as base 16 (hexadecimal notation).
+The speed of floating-point operations is measured in FLOPS. FLOPS (or flops or flop/s) is an acronym meaning FLoating point Operations Per Second. The FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations.
+'''Just to help put this into context (and an interesting fact):'''<br>
+The world's fastest supercomputer as of November 2009 is the Cray XT5, also known as Jaguar—beating Roadrunner, which held the number one position for 18 months. Jaguar recently upgraded its quad-core CPUs to hex-core Opteron processors, which meant a 2.3 petaflop per second theoretical performance peak (”nearly a quarter of a million cores”), and 1.75 petaFLOPS measured by the Linpack benchmark. For comparison, a hand-held calculator must perform relatively few FLOPS. Each calculation request, such as to add or subtract two numbers, requires only a single operation, so there is rarely any need for its response time to exceed what the operator can physically use. A computer response time below 0.1 second in a calculation context is usually perceived as instantaneous by a human operator, so a simple calculator needs only about 10 FLOPS to be functional.
+'''1.6 Benchmarking Methods:'''
+After hours of searching for methods to benchmark CUDA on a specific setup i came to no definitive method. However for the purpose of this test I have chosen to use two simple methods (one a little more complex than the other but still simple) for benchmarking CUDA.
+'''1.6.1 Requirements:'''
+* CUDA enabled GPU (http://www.nvidia.com/object/cuda_gpus.html)
+* CUDA Supported NVIDIA drivers
+'''1.6.2 Method 1:'''
+The first and easiest method is to use a simple program called CUDA-Z (Donwload from http://sourceforge.net/projects/cuda-z/files/cuda-z/0.5/CUDA-Z-0.5.95.exe/download). Run this little program and navigate to the “Performance” tab and click export > to text. Copy the information into your post. '''MAKE SURE YOU DO NOT HAVE ANY GPU INTENSIVE PROGRAMS ALREADY RUNNING.... IT IS BEST TO CLOSE ALL PROGRAMS BEFORE EXPORTING AS GPU INTENSIVE PROGRAMS WILL CHANGE THE RESULTS!!!!'''
+'''1.6.3 Method 2:'''
+''Being re-thought...... It's to buggy...''
+'''1.6.4 Benchmark Submission Post Format:'''
+I will give an example below if you are unsure!
+Operating System:<br>
+CPU:<br>
+Driver Version: (Is a must Please! - Go to 1.6.5 if you are unsure how to find it.)<br>
+'''Core/Shader/Memory [C/S/M]: PLEASE! If your unsure... Download GPU-Z'''<br>
+Current Average Price of GPU: (if you can find it)<br>
+Method 1 output:<br>
+Method 2 When updated.....<br>
+And of course.... this is OCAU..... Please specify any overclocks with CPU or GPU. If you can and are willing too, overclock your GPU.... Do the tests again and repost your results inc. specs of your overclock.
+'''1.6.5 Finding your Driver Version'''
+'''Method 1:''' Click run and type in '''dxdiag''' then click OK. A popup will come up; click the display tab and to the right under drivers note the version.
+'''Method 2:''' If method 1 did not work go to device manager (search for it in vista or go to control panel click advanced view and double click device manager). Click the + on display adapters. Right click your GPU and click properties. Once Properties is up, click the driver tab and note the Driver Version.

CUDA Parallel Computing

Revision as of 18:50, 27 February 2010

Contents:

More Links