visual c++ - Memory layout mismatching between CPU and GPU code with CUDA -
i'm experiencing weird situation. have template structures:
#ifdef __cudacc__ #define __hostdevice __host__ __device__ #else #define __hostdevice #endif template <typename t> struct matrix { t* ptr; int col_size, row_size; int stride; // host & device methods }; struct dummy1 {}; struct dummy2 : dummy1 {}; template <typename t> struct a_functor : dummy2 { matriz<t> help_m; matrix<t> x, y; t *x_ptr, *y_ptr; int bsx, ind_thr; __hostdevice void operator()(t* __x, t* __y) { // functor code } };
i've structured code separate cpp , cu files, a_functor object created in cpp file , used in kernel function. problem that, executing operator() inside kernel, found random behaviour couldn't explain looking @ code. structs sort of corrupted. so, calling sizeof() on a_functor object, found:
cpu code (.cpp , .cu outside kernel): 64 bytes
gpu code (inside kernel): 68 bytes
there kind of mismatching ruined whole stuff. going further, tracked distance between struct parameter pointers , struct - try inspect produced memory layout - , here's found:
a_functor foo; // cpu (char*)(&foo.help_m) - (char*)(&foo) = 0 (char*)(&foo.x) - (char*)(&foo) = 16 (char*)(&foo.y) - (char*)(&foo) = 32 (char*)(&foo.x_ptr) - (char*)(&foo) = 48 (char*)(&foo.y_ptr) - (char*)(&foo) = 52 (char*)(&foo.bsx) - (char*)(&foo) = 56 (char*)(&foo.ind_thr) - (char*)(&foo) = 60 // gpu - inside a_functor::operator(), in-kernel (char*)(&this->help_m) - (char*)(this) = 4 (char*)(&this->x) - (char*)(this) = 20 (char*)(&this->y) - (char*)(this) = 36 (char*)(&this->x_ptr) - (char*)(this) = 52 (char*)(&this->y_ptr) - (char*)(this) = 56 (char*)(&this->bsx) - (char*)(this) = 60 (char*)(&this->ind_thr) - (char*)(this) = 64
i can't understand why nvcc generated memory layout struct (what 4 bytes supposed be/do!?!). thought alignment problem , tryed explicitly align a_functor, can't because passed value in kernel
template <typename t, typename str> __global__ void mykernel(str foo, t* src, t*dst);
and when try compile get
error: cannot pass parameter large explicit alignment global routine on win32 platforms
so, solve strange situation (...and think that's nvcc bug), should do? thing can think of playing alignment , passing struct kernel pointer avoid aforementioned error. however, i'm wondering: why memory layout mismatching?! makes no sense...
further information: i'm using visual studio 2008, compiling msvc on windows xp 32bit platform. installed latest cuda toolkit 5.0.35. card geforce gtx 570 (compute capability 2.0).
from comments appears there may differences between code you're running , code you've posted, it's difficult give more vague answers without being able reproduce problem. said, on windows there cases layout , size of struct can differ between cpu , gpu, these documented in programming guide:
on windows, cuda compiler may produce different memory layout, compared host microsoft compiler, c++ object of class type t satisfies of following conditions:
- t has virtual functions or derives direct or indirect base class has virtual functions;
- t has direct or indirect virtual base class;
- t has multiple inheritance more 1 direct or indirect empty base class.
the size such object may different in host , device code. long type t used exclusively in host or device code, program should work correctly. not pass objects of type t between host , device code (e.g., arguments global functions or through cudamemcpy*() calls).
the third case may apply in case have empty base class, have multiple inheritance in real code?
Comments
Post a Comment