Cuda/PyCuda - Large matrix traversal and block/grid size -


i working on has highlighted fact don't have firm grasp of how blocks , grids work in cuda. have 1000x10 matrix traverse , fill in each element value. kernel this:

__global__ void myfun(float *vals,float *out, int m, int n)   {         int row = blockidx.y*blockdim.y + threadidx.y;       int col = blockidx.x*blockdim.x + threadidx.x;       int index = row*n + col;        if( (row < m ) && (col < n) ) {           out[index] = index;       } } 

where, m=1000 , n = 10. don't know how slice can cover every element in matrix. since need coverage 1000*10 = 10,000 elements , given limitations on number of threads, can't use block sizes of (10,1000,1). using pycuda, i've tried things block = (10,100,1), grid = (1,10) never full coverage of matrix elements. what's right way this?

fix block size, , keep grid size dynamic. in way, kernel cover each element of matrix no matter values of m , n are.

block = (8,8) grid = ((n + 7) / 8, (m + 7) / 8) 

launch kernel grid , block configuration. keeping in limits of device, may change block size if desired.


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

CSS3 Transition to highlight new elements created in JQuery -