We are that efficient download Les types algorithms can be calculated on GPUs either by CUDA or by OpenCL design. This allows us to leverage the bandwidth of distributed computation processing, enabling very efficient parallel computations. We compare the CUDA approach with an OpenCL implementation designed for cluster environments via MPI. A performance comparison with optimized sequential implementations is presented and the parallel speedup is evaluated. Efficient implementation in graphics hardware (the parallel computation is significantly faster than for sequential execution) makes this approach suitable for real-time processing applications. For applications requiring high throughput and low latency such as computer vision, efficient parallel implementations are essential. In this work we demonstrate that through careful algorithm design and optimization of data structures and efficient use of available GPU resources we can achieve substantial performance improvements without complex low-level optimizations (we report speedups of more than 200 compared to Matlab implementations). Motivated by recent advances in GPU computing, parallel algorithms are among the most promising approaches to address a variety of challenging Computer Vision problems.