Cuda/PyCuda - Large matrix traversal and block/grid size -
i working on has highlighted fact don't have firm grasp of how blocks , grids work in cuda. have 1000x10 matrix traverse , fill in each element value. kernel this:
__global__ void myfun(float *vals,float *out, int m, int n) { int row = blockidx.y*blockdim.y + threadidx.y; int col = blockidx.x*blockdim.x + threadidx.x; int index = row*n + col; if( (row < m ) && (col < n) ) { out[index] = index; } }
where, m=1000 , n = 10. don't know how slice can cover every element in matrix. since need coverage 1000*10 = 10,000 elements , given limitations on number of threads, can't use block sizes of (10,1000,1). using pycuda, i've tried things block = (10,100,1), grid = (1,10) never full coverage of matrix elements. what's right way this?
fix block size, , keep grid size dynamic. in way, kernel cover each element of matrix no matter values of m , n are.
block = (8,8) grid = ((n + 7) / 8, (m + 7) / 8)
launch kernel grid , block configuration. keeping in limits of device, may change block size if desired.
Comments
Post a Comment