我试图使用一些共享内存来计算直方图,以提高性能。然而,我遇到了一个我似乎不知道的问题。下面是我有问题的内核代码。我肯定我漏掉了一些愚蠢的东西,但我找不到它。
__global__
void histogram_kernel_shared(const unsigned int* const d_vals,
unsigned int* d_histo,
const unsigned int numElems) {
unsigned int gid = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int lid = threadIdx.x;
unsigned int bin = d_vals[gid];
__syncthreads();
__shared__ unsigned int local_bin[1024];
local_bin[lid] = d_histo[lid];
__syncthreads();
if(local_bin[lid] != d_histo[lid])
printf("After copy to local. block = %u, lid = %u, local_bin = %u, d_histo = %u \n", blockIdx.x, lid, local_bin[lid], d_histo[lid]);
__syncthreads();
// If I comment out this line everything works fine.
d_histo[lid] = local_bin[lid];
// Even this leads to some wrong answers. Printouts on the next printf.
// d_histo[lid] = d_histo[lid];
__syncthreads();
if(local_bin[lid] != d_histo[lid])
printf("copy back. block = %u, lid = %u, local_bin = %u, d_histo = %u \n", blockIdx.x, lid, local_bin[lid], d_histo[lid]);
__syncthreads();
atomicAdd(&d_histo[bin], static_cast<unsigned int>(1));
__syncthreads();
// atomicAdd(&local_bin[bin], static_cast<unsigned int>(1));
__syncthreads();
}内核按以下方式启动
threads = 1024;
blocks = numElems/threads;
histogram_kernel_shared<<<blocks, threads>>>(d_vals, d_histo, numElems);元素数为10 240 000
垃圾箱的数量是1024个。
困扰我的是为什么d_histo[lid] = local_bin[lid];的任务会在这里起作用。没有它,代码运行良好。但是,由于我只是将值复制为local_bin[lid] = d_histo[lid];,而更多的是为什么local_bin[lid] = d_histo[lid];也会提供垃圾值,所以不应该对此进行任何更改。
我猜是别的地方出了问题,给出了一些奇怪的UB,但在哪里呢?
谢谢你的帮助。
发布于 2013-12-16 17:22:58
你要发射10,000个街区
blocks = numElems/threads;每个块正在写入d_histo的第一个1024 (d_histo)位置
d_histo[lid] = local_bin[lid]; 因为您有10,000个块--所有这些块都写到相同的位置--它们都是相互踩在一起的,并且是相互覆盖的。由于块执行的顺序是未定义的,所以您肯定会得到未定义的行为。
https://stackoverflow.com/questions/20616203
复制相似问题