cuda - what's the difference between a thread in a block and a warp(32 threads)? -
i have written program string matching test,to test performance vs cpu.
i call kernel <<<1,1>>>
, 1 block contains 1 thread, execution time 430ms, , use 1 block 2 threads <<<1,2>>>
call kernel, execution time 303ms, lastly call kernel <<<2,1><<
, 2 blocks , 1 thread each, , time half of 430ms (that 215ms).
what's difference between thread in block , warp? makes 1 block contains 2 threads slower 2 blocks 1 thread each?
the first point make gpu requires hundreds or thousands of active threads hide architectures inherent high latency , utilise available arithmetic capacity , memory bandwidth. benchmarking code 1 or 2 threads in 1 or 2 blocks waste of time.
the second point there no such thing "thread in block". threads fundamentally executed in warps of 32 threads. blocks composed of 1 or more warps, , grid of 1 or more blocks.
when launch grid containing single block 1 thread, launch 1 warp. warp contains 31 "dummy" threads masked off, , single live thread. if launch single block 2 threads, still launch 1 warp, single warp contains 2 active threads.
when launch 2 blocks containing single thread each, results in 2 warps, each of contains 1 active thread. because scheduling , execution done on per warp basis, have 2 separate entities (warps) hardware can schedule , execute independently. allows more latency hiding , less instruction pipeline stalls, , code runs faster result.
so tldr answer 1 block = 1 warp, 2 blocks = 2 warps, latter being less sub-optimal former.
Comments
Post a Comment