cuda - what's the difference between a thread in a block and a warp(32 threads)? -


i have written program string matching test,to test performance vs cpu.

i call kernel <<<1,1>>>, 1 block contains 1 thread, execution time 430ms, , use 1 block 2 threads <<<1,2>>> call kernel, execution time 303ms, lastly call kernel <<<2,1><<, 2 blocks , 1 thread each, , time half of 430ms (that 215ms).

what's difference between thread in block , warp? makes 1 block contains 2 threads slower 2 blocks 1 thread each?

the first point make gpu requires hundreds or thousands of active threads hide architectures inherent high latency , utilise available arithmetic capacity , memory bandwidth. benchmarking code 1 or 2 threads in 1 or 2 blocks waste of time.

the second point there no such thing "thread in block". threads fundamentally executed in warps of 32 threads. blocks composed of 1 or more warps, , grid of 1 or more blocks.

when launch grid containing single block 1 thread, launch 1 warp. warp contains 31 "dummy" threads masked off, , single live thread. if launch single block 2 threads, still launch 1 warp, single warp contains 2 active threads.

when launch 2 blocks containing single thread each, results in 2 warps, each of contains 1 active thread. because scheduling , execution done on per warp basis, have 2 separate entities (warps) hardware can schedule , execute independently. allows more latency hiding , less instruction pipeline stalls, , code runs faster result.

so tldr answer 1 block = 1 warp, 2 blocks = 2 warps, latter being less sub-optimal former.


Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

javascript - jQuery .height() return 0 when visible but non-0 when hidden -