Cuda/PyCuda - Large matrix traversal and block/grid size -

May 15, 2015

i working on has highlighted fact don't have firm grasp of how blocks , grids work in cuda. have 1000x10 matrix traverse , fill in each element value. kernel this:

__global__ void myfun(float *vals,float *out, int m, int n)   {         int row = blockidx.y*blockdim.y + threadidx.y;       int col = blockidx.x*blockdim.x + threadidx.x;       int index = row*n + col;        if( (row < m ) && (col < n) ) {           out[index] = index;       } }

where, m=1000 , n = 10. don't know how slice can cover every element in matrix. since need coverage 1000*10 = 10,000 elements , given limitations on number of threads, can't use block sizes of (10,1000,1). using pycuda, i've tried things block = (10,100,1), grid = (1,10) never full coverage of matrix elements. what's right way this?

fix block size, , keep grid size dynamic. in way, kernel cover each element of matrix no matter values of m , n are.

block = (8,8) grid = ((n + 7) / 8, (m + 7) / 8)

launch kernel grid , block configuration. keeping in limits of device, may change block size if desired.

Search This Blog

Three

Cuda/PyCuda - Large matrix traversal and block/grid size -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

Socket.connect doesn't throw exception in Android -

SPSS keyboard combination alters encoding -