Friday, August 27, 2010

OpenCL™ Optimization Case Study: Simple Reductions

This study show that even on the GPU the naive implementation works but a clever scheme could improve the effectiveness of the approach.

The article is well explained with illustration to show each clever idea to use the full SMID machine. It explain how to implement parallel Reduction operation on GPU (A reduce operation with a given predicat => find a min, max ...).

Associative Reduction Tree and SIMD Mapping


Commutative Reduction and SIMD Mapping
Two-stage Reduction
Final code :

__kernel
void reduce(__global float* buffer,
__const int block,
__const int length,
__global float* result) {

int global_index = get_global_id(0) * block;
float accumulator = INFINITY;
int upper_bound = (get_global_id(0) + 1) * block;
if (upper_bound > length) upper_bound = length;
while (global_index < upper_bound) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index++;
}
result[get_group_id(0)] = accumulator;
}

Source : AMD

No comments:

Post a Comment