Small ideas for a big world: OpenCL™ Optimization Case Study: Simple Reductions

Friday, August 27, 2010

OpenCL™ Optimization Case Study: Simple Reductions

This study show that even on the GPU the naive implementation works but a clever scheme could improve the effectiveness of the approach.

The article is well explained with illustration to show each clever idea to use the full SMID machine. It explain how to implement parallel Reduction operation on GPU (A reduce operation with a given predicat => find a min, max ...).

Associative Reduction Tree and SIMD Mapping

Commutative Reduction and SIMD Mapping

Two-stage Reduction

Final code :

__kernel
void reduce(__global float* buffer,
__const int block,
__const int length,
__global float* result) {

int global_index = get_global_id(0) * block;
float accumulator = INFINITY;
int upper_bound = (get_global_id(0) + 1) * block;
if (upper_bound > length) upper_bound = length;
while (global_index < upper_bound) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index++;
}
result[get_group_id(0)] = accumulator;
}

Source : AMD

Small ideas for a big world

Friday, August 27, 2010

OpenCL™ Optimization Case Study: Simple Reductions

No comments:

Post a Comment