gpgpu - Count the number of cycles in a CUDA kernel -
how can count number of cycles performed function following. should count straight forward number of sums , muls , divs? can check how many cycles addition takes in cuda?
__global__ void mandelbrotset_per_element(grayscale *image){ float minr = -2.0f, maxr = 1.0f; float mini = -1.2f, maxi = mini + (maxr-minr) * c_rows / c_cols; float realfactor = (maxr - minr) / (c_cols-1); float imagfactor = (maxi - mini) / (c_rows-1); bool isinset; float c_real, c_imag, z_real, z_imag; int y = blockdim.y * blockidx.y + threadidx.y; int x = blockdim.x * blockidx.x + threadidx.x; while (y < c_rows){ while (x < c_cols) { c_real = minr + x * realfactor; c_imag = maxi - y * imagfactor; z_real = c_real; z_imag = c_imag; isinset = true; (int k = 0; k < c_iterations; k++){ float z_real2 = z_real * z_real; float z_imag2 = z_imag * z_imag; if (z_real2 + z_imag2 > 4){ isinset = false; break; } z_imag = 2 * z_real * z_imag + c_imag; z_real = z_real2 - z_imag2 + c_real; } if (isinset) image[y*c_cols+x] = 255; else image[y*c_cols+x] = 0; x += blockdim.x * griddim.x; } x = blockdim.x * blockidx.x + threadidx.x; y += blockdim.y * griddim.y; } }
instruction throughput described in programming guide here
you can try measuring sequence of instructions using native clock()
function described here
the compiler tends obscure actual counts of operations @ source code level (increasing or possibly decreasing apparent arithmetic intensity) if want indentify machine doing may want inspect ptx (nvcc -ptx ...) or possibly machine assembly level code, called sass, can extract executable using cuobjdump
utility.
Comments
Post a Comment