Is synchronization always required when a CUDA kernel's launch parameters depend on a previous kernel?


Samuel Martin

I have a lot of this additional pattern in my code. Basically it is equivalent to a first kernel for filtering a large dataset where the selected entries returned will be very sparse, then a second kernel for performing more complex computations on a greatly reduced dataset.

It seems that cudaStreamSynchronize is almost redundant, but I don't see any workaround.

  • Are there alternative patterns to avoid synchronization between cores?
  • Will CUDA dynamic parallelism help in any way?

Sample code:

/* Pseudocode. Won't Compile */
/* Please ignore silly mistakes/syntax and inefficiant/incorrect simplifications */

__global__ void bar( const float * dataIn, float * dataOut, unsigned int * counter_ptr ) 
{
   < do some computation > 
   if (bConditionalComputedAboveIsTrue)
   { 
      const unsigned int ind = atomicInc(counter_ptr, (unsigned int)(-1));
      dataOut[ ind ] = resultOfAboveComputation;
   } 
}

int foo( float * d_datain, float* d_tempbuffer, float* d_output, cudaStream_t stream  ){    
   /* Initialize a counter that will be updated by the bar kernel */ 
   unsigned int * counter_ptr;
   cudaMalloc( &counter_ptr, sizeof( unsigned int) ); //< Create a Counter
   cudaMemsetAsync(counter_ptr, 0, sizeof(unsigned int), stream); //<Initially Set the Counter to 0
   dim3 threadsInit(16,16,1);
   dim3 gridInit(256, 1, 1);
   /* Launch the Filtering Kernel. This will update the value in counter_ptr*/
   bar<<< gridInit, threadsInit, 0, stream >>>( d_datain, d_tempbuffer, counter_ptr );
   /* Download the count and synchronize the stream */ 
   unsigned int count;
   cudaMemcpyAsync(&count, counter_ptr, sizeof(unsigned int), cudaMemcpyDeviceToHost, stream);
   cudaStreamSynchronize( stream ); //< Is there any way around this synchronize? 
   /* Compute the grid parameters and launch a second kernel */
   dim3 bazThreads(128,1,1);
   dim3 bazGrid( count/128 + 1, 1, 1); //< Here I use the counter modified in the prior kernel to set the grid parameters
   baz<<< bazGrid, bazThreads, 0, stream >>>( d_tempbuffer, d_output );
   /* cleanup */
   cudaFree(counter_ptr);
}
Terah

Instead of changing the number of blocks in the second kernel, you can use a fixed number of blocks and let the blocks adapt to the amount of work they do.

For example, start more blocks and let them exit early if nothing works. Or start enough blocks to fill the device and let each block cycle to work. Grid step loops are a good way to do this.

There is also an option to use dynamic parallelism to move the kernel launch itself (and the grid size decision) to the device.

Related


What is synchronization during kernel launch?

Matthew Sainsbury If for some reason a Linux system fails to boot, the following message is usually seen: kernel panic - not syncing: [Error Message] But what does "out of sync" mean? What exactly is out of sync? The only place I've seen synchronization befor

What is synchronization during kernel launch?

Matthew Sainsbury If for some reason a Linux system fails to boot, the following message is usually seen: kernel panic - not syncing: [Error Message] But what does "out of sync" mean? What exactly is out of sync? The only place I've seen synchronization befor

Configuration parameters of the cuda kernel

Abx I have to add two square matrices N x Nusing cuda program . This book requires that configuration parameters for the kernel be written for the following situations: (a) Each thread can only process 1matrix elements (b) Each thread produces one output matri

CUDA kernel parameters are incompatible

username I don't know what's wrong, I want to add two vectors together but I get an error. I looked at some tutorials but didn't see the problem, here is my error: Argument of type 'int**' is incompatible with parameter of type 'int*' #include "cuda_runtime.h"

Configuration parameters of the cuda kernel

Abx I have to add two square matrices N x Nusing cuda program . This book requires that configuration parameters for the kernel be written for the following situations: (a) Each thread can only process 1matrix elements (b) Each thread produces one output matri

CUDA kernel parameters are incompatible

username I don't know what's wrong, I want to add two vectors together but I get an error. I looked at some tutorials but didn't see the problem, here is my error: Argument of type 'int**' is incompatible with parameter of type 'int*' #include "cuda_runtime.h"

Learn about this CUDA kernel launch parameter

Decker I'm trying to analyze some code I've found online and keep thinking I'm stuck. I am looking at a histogram kernel launched with the following parameters histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...); I know the parameters are grid,

How to implement CUDA <<< ... >>>() kernel launch syntax

David CUDA kernels are launched using this syntax (at least in the runtime API) mykernel<<<blocks, threads, shared_mem, stream>>>(args); Is this implemented as a macro or is there a special syntax nvcc removes before handing the host code to gcc? Robert Clovi

How to catch or handle CUDA kernel launch errors

Tyson Hilmer I use the checkCudaErrors helper function from the CUDA toolkit examples. See "helper_cuda.h". I'm confused as to why the startup errors in this example are not caught by checkCudaErrors. The error is too many threads started (2048). From Debug (L

Learn about this CUDA kernel launch parameter

Dirk I'm trying to analyze some code I've found online and keep thinking I'm stuck. I am looking at a histogram kernel launched with the following parameters histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...); I know the parameters are grid, b

Learn about this CUDA kernel launch parameter

Decker I'm trying to analyze some code I've found online and keep thinking I'm stuck. I am looking at a histogram kernel launched with the following parameters histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...); I know the parameters are grid,

How to implement CUDA <<< ... >>>() kernel launch syntax

David CUDA kernels are launched using this syntax (at least in the runtime API) mykernel<<<blocks, threads, shared_mem, stream>>>(args); Is this implemented as a macro or is there a special syntax nvcc removes before handing the host code to gcc? Robert Clovi

How to catch or handle CUDA kernel launch errors

Tyson Hilmer I use the checkCudaErrors helper function from the CUDA toolkit examples. See "helper_cuda.h". I'm confused as to why the startup errors in this example are not caught by checkCudaErrors. The error is too many threads started (2048). From Debug (L

How to catch or handle CUDA kernel launch errors

Tyson Hilmer I use the checkCudaErrors helper function from the CUDA toolkit examples. See "helper_cuda.h". I'm confused as to why the startup errors in this example are not caught by checkCudaErrors. The error is too many threads started (2048). From Debug (L

How to catch or handle CUDA kernel launch errors

Tyson Hilmer I use the checkCudaErrors helper function from the CUDA toolkit examples. See "helper_cuda.h". I'm confused as to why the startup errors in this example are not caught by checkCudaErrors. The error is too many threads started (2048). From Debug (L

Learn about this CUDA kernel launch parameter

Decker I'm trying to analyze some code I've found online and keep thinking I'm stuck. I am looking at a histogram kernel launched with the following parameters histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...); I know the parameters are grid,

How to implement CUDA <<< ... >>>() kernel launch syntax

David CUDA kernels are launched using this syntax (at least in the runtime API) mykernel<<<blocks, threads, shared_mem, stream>>>(args); Is this implemented as a macro or is there a special syntax nvcc removes before handing the host code to gcc? Robert Clovi

How to catch or handle CUDA kernel launch errors

Tyson Hilmer I use the checkCudaErrors helper function from the CUDA toolkit examples. See "helper_cuda.h". I'm confused as to why the startup errors in this example are not caught by checkCudaErrors. The error is too many threads started (2048). From Debug (L

How to catch or handle CUDA kernel launch errors

Tyson Hilmer I use the checkCudaErrors helper function from the CUDA toolkit examples. See "helper_cuda.h". I'm confused as to why the startup errors in this example are not caught by checkCudaErrors. The error is too many threads started (2048). From Debug (L

How to catch or handle CUDA kernel launch errors

Tyson Hilmer I use the checkCudaErrors helper function from the CUDA toolkit examples. See "helper_cuda.h". I'm confused as to why the startup errors in this example are not caught by checkCudaErrors. The error is too many threads started (2048). From Debug (L

kernel synchronization

elastic band I am new to Cuda programming and am implementing the classic Floyd APSP algorithm. The algorithm consists of 3 nested loops, and all code in the two inner loops can be executed in parallel. As the main part of my code, this is the kernel code: __g

kernel synchronization

elastic band I am new to Cuda programming and am implementing the classic Floyd APSP algorithm. The algorithm consists of 3 nested loops, and all code in the two inner loops can be executed in parallel. As the main part of my code, this is the kernel code: __g

kernel synchronization

elastic band I am new to Cuda programming and am implementing the classic Floyd APSP algorithm. The algorithm consists of 3 nested loops, and all code in the two inner loops can be executed in parallel. As the main part of my code, this is the kernel code: __g

CUDA-Python: How to launch a CUDA kernel in Python (Numba 0.25)?

Novitor Can you help me understand how to write a CUDA kernel in Python? AFAIK, numba.vectorize can be executed on cuda, cpu, parallel (multi-cpus) based on target . But target='cuda' requires the CUDA kernel to be set. The main problem is that many of the exa

CUDA-Python: How to launch a CUDA kernel in Python (Numba 0.25)?

Novitor Can you help me understand how to write a CUDA kernel in Python? AFAIK, numba.vectorize can be executed on cuda, cpu, parallel (multi-cpus) based on target . But target='cuda' requires the CUDA kernel to be set. The main problem is that many of the exa

CUDA-Python: How to launch a CUDA kernel in Python (Numba 0.25)?

Novitor Can you help me understand how to write a CUDA kernel in Python? AFAIK, numba.vectorize can be executed on cuda, cpu, parallel (multi-cpus) based on target . But target='cuda' requires the CUDA kernel to be set. The main problem is that many of the exa

CUDA-Python: How to launch a CUDA kernel in Python (Numba 0.25)?

Novitor Can you help me understand how to write a CUDA kernel in Python? AFAIK, numba.vectorize can be executed on cuda, cpu, parallel (multi-cpus) based on target . But target='cuda' requires the CUDA kernel to be set. The main problem is that many of the exa

CUDA-Python: How to launch a CUDA kernel in Python (Numba 0.25)?

Novitor Can you help me understand how to write a CUDA kernel in Python? AFAIK, numba.vectorize can be executed on cuda, cpu, parallel (multi-cpus) based on target . But target='cuda' requires the CUDA kernel to be set. The main problem is that many of the exa

CUDA-Python: How to launch a CUDA kernel in Python (Numba 0.25)?

Novitor Can you help me understand how to write a CUDA kernel in Python? AFAIK, numba.vectorize can be executed on cuda, cpu, parallel (multi-cpus) based on target . But target='cuda' requires the CUDA kernel to be set. The main problem is that many of the exa