BrookGPU

BrookGPU
Developer(s)	Stanford University
Stable release	v0.5 Beta 1 / 2007; 17 years ago
Repository	svn.code.sf.net/p/brook/code/ ;
Operating system	Linux, Windows
Type	Compiler/runtime
License	BSD license (parts are under the GPL)
Website	http://graphics.stanford.edu/projects/brookgpu/

The Brook programming language and its implementation BrookGPU were early and influential attempts to enable general-purpose computing on graphics processing units.^[1]^[2] Brook, developed at Stanford University graphics group, was a compiler and runtime implementation of a stream programming language targeting modern, highly parallel GPUs such as those found on ATI or Nvidia graphics cards.

BrookGPU compiled programs written using the Brook stream programming language, which is a variant of ANSI C. It could target OpenGL v1.3+, DirectX v9+ or AMD's Close to Metal for the computational backend and ran on both Microsoft Windows and Linux. For debugging, BrookGPU could also simulate a virtual graphics card on the CPU.

Status[edit]

The last major beta release (v0.4) was in October 2004 but renewed development began and stopped again in November 2007 with a v0.5 beta 1 release.

The new features of v0.5 include a much upgraded and faster OpenGL backend which uses framebuffer objects instead of PBuffers and harmonised the code around standard OpenGL interfaces instead of using proprietary vendor extensions. GLSL support was added which brings all the functionality (complex branching and loops) previously only supported by DirectX 9 to OpenGL. In particular, this means that Brook is now just as capable on Linux as Windows.

Other improvements in the v0.5 series include multi-backend usage whereby different threads can run different Brook programs concurrently (thus maximising use of a multi-GPU setup) and SSE and OpenMP support for the CPU backend (this allows near maximal usage of modern CPUs).

Performance comparison[edit]

A like for like comparison between desktop CPUs and GPGPUs is problematic because of algorithmic & structural differences.

For example, a 2.66 GHz Intel Core 2 Duo can perform a maximum of 25 GFLOPs (25 billion single-precision floating-point operations per second) if optimally using SSE and streaming memory access so the prefetcher works perfectly. However, traditionally (due to shader program length limits) most GPGPU kernels tend to perform relatively small amounts of work on large amounts of data in parallel, so the big problem with directly executing GPGPU algorithms on desktop CPUs is vastly lower memory bandwidth as generally speaking the CPU spends most of its time waiting on RAM. As an example, dual-channel PC2-6400 DDR2 RAM can throughput about 11 Gbit/s which is around 1.5 GFLOPs maximum given that there is a total of 3 GFLOPs total bandwidth and one must both read and write. As a result, if memory bandwidth constrained, Brook's CPU backend won't exceed 2 GFLOPs. In practice, it's even lower than that most especially for anything other than float4 which is the only data type which can be SSE accelerated.

On an ATI HD 2900 XT (740 MHz core 1000 MHz memory), Brook can perform a maximum of 410 GFLOPs via its DirectX 9 backend. OpenGL is currently (due to driver and Cg compiler limitations) much less efficient as a GPGPU backend on that GPU, so Brook can only manage 210 GFLOPs when using OpenGL on that GPU. On paper, this looks like around twenty times faster than the CPU, but as just explained it isn't as easy as that. GPUs currently have major branch and read/write access penalties so expect a reasonable maximum of one third of the peak maximum in real world code - this still leaves that ATI card at around 125 GFLOPs some five times faster than the Intel Core 2 Duo.

However this discounts the important part of transferring the data to be processed to and from the GPU. With a PCI Express 1.0 x8 interface, the memory of an ATI HD 2900 XT can be written to at about 730 Mbit/s and read from at about 311 Mbit/s which is significantly slower than normal PC memory. For large datasets, this can greatly diminish the speed increase of using a GPU over a well-tuned CPU implementation. Of course, as GPUs become faster far more quickly than CPUs and the PCI Express interface improves, it will make more sense to offload large processing to GPUs.

Applications and games that use BrookGPU[edit]

Folding@home

References[edit]

^ Tarditi, David; Puri, Sidd; Oglesby, Jose (2006). "Accelerator: using data parallelism to program GPUs for general-purpose uses" (PDF). ACM SIGARCH Computer Architecture News. 34 (5). doi:10.1145/1168919.1168898.
^ Che, Shuai; Boyer, Michael; Meng, Jiayuan; Tarjan, D.; Sheaffer, Jeremy W.; Skadron, Kevin (2008). "A performance study of general-purpose applications on graphics processors using CUDA". J. Parallel and Distributed Computing. 68 (10): 1370–1380. doi:10.1016/j.jpdc.2008.05.014.

External links[edit]

Official website, Stanford University

[1] Tarditi, David; Puri, Sidd; Oglesby, Jose (2006). "Accelerator: using data parallelism to program GPUs for general-purpose uses" (PDF). ACM SIGARCH Computer Architecture News. 34 (5). doi:10.1145/1168919.1168898.

[2] Che, Shuai; Boyer, Michael; Meng, Jiayuan; Tarjan, D.; Sheaffer, Jeremy W.; Skadron, Kevin (2008). "A performance study of general-purpose applications on graphics processors using CUDA". J. Parallel and Distributed Computing. 68 (10): 1370–1380. doi:10.1016/j.jpdc.2008.05.014.

[1]

[2]

Status[edit]

Performance comparison[edit]

Applications and games that use BrookGPU[edit]

See also[edit]

References[edit]

External links[edit]