A while ago, I was going through the pics from my last vacation, casually editing using the built-in filters in the Photos app, and I never really paid attention before that the whole
Looking at the convolution algorithm, it's easy to see that it can be parallelized effectively, as each output element can be computed independently. The most basic implementation would involve assigning each
The kernel function from part 1 (shown in Figure 2.1) has a big problem (at least theoretically). It is accessing all of the required data from global memory! I've discussed
So far I have not been able get any meaningful improvement over the naive GPU implementation of 2D convolution (table below shows runtime). I will first start with the detailed analysis of the