By kar


2011-02-18 12:33:15 8 Comments

What is "coalesced" in CUDA global memory transaction? I couldn't understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix example, accessing the matrix row by row is called "coalesced" or col.. by col.. is called coalesced? Which is correct and why?

4 comments

@ramino 2014-02-14 20:39:39

Memory coalescing is a technique which allows optimal usage of the global memory bandwidth. That is, when parallel threads running the same instruction access to consecutive locations in the global memory, the most favorable access pattern is achieved.

enter image description here

The example in Figure above helps explain the coalesced arrangement:

In Fig. (a), n vectors of length m are stored in a linear fashion. Element i of vector j is denoted by v j i. Each thread in GPU kernel is assigned to one m-length vector. Threads in CUDA are grouped in an array of blocks and every thread in GPU has a unique id which can be defined as indx=bd*bx+tx, where bd represents block dimension, bx denotes the block index and tx is the thread index in each block.

Vertical arrows demonstrate the case that parallel threads access to the first components of each vector, i.e. addresses 0, m, 2m... of the memory. As shown in Fig. (a), in this case the memory access is not consecutive. By zeroing the gap between these addresses (red arrows shown in figure above), the memory access becomes coalesced.

However, the problem gets slightly tricky here, since the allowed size of residing threads per GPU block is limited to bd. Therefore coalesced data arrangement can be done by storing the first elements of the first bd vectors in consecutive order, followed by first elements of the second bd vectors and so on. The rest of vectors elements are stored in a similar fashion, as shown in Fig. (b). If n (number of vectors) is not a factor of bd, it is needed to pad the remaining data in the last block with some trivial value, e.g. 0.

In the linear data storage in Fig. (a), component i (0 ≤ i < m) of vector indx (0 ≤ indx < n) is addressed by m × indx +i; the same component in the coalesced storage pattern in Fig. (b) is addressed as

(m × bd) ixC + bd × ixB + ixA,

where ixC = floor[(m.indx + j )/(m.bd)]= bx, ixB = j and ixA = mod(indx,bd) = tx.

In summary, in the example of storing a number of vectors with size m, linear indexing is mapped to coalesced indexing according to:

m.indx +i −→ m.bd.bx +i .bd +tx

This data rearrangement can lead to a significant higher memory bandwidth of GPU global memory.


source: "GPU‐based acceleration of computations in nonlinear finite element deformation analysis." International journal for numerical methods in biomedical engineering (2013).

@ArchaeaSoftware 2011-04-23 18:12:09

The criteria for coalescing are nicely documented in the CUDA 3.2 Programming Guide, Section G.3.2. The short version is as follows: threads in the warp must be accessing the memory in sequence, and the words being accessed should >=32 bits. Additionally, the base address being accessed by the warp should be 64-, 128-, or 256-byte aligned for 32-, 64- and 128-bit accesses, respectively.

Tesla2 and Fermi hardware does an okay job of coalescing 8- and 16-bit accesses, but they are best avoided if you want peak bandwidth.

Note that despite improvements in Tesla2 and Fermi hardware, coalescing is BY NO MEANS obsolete. Even on Tesla2 or Fermi class hardware, failing to coalesce global memory transactions can result in a 2x performance hit. (On Fermi class hardware, this seems to be true only when ECC is enabled. Contiguous-but-uncoalesced memory transactions take about a 20% hit on Fermi.)

@penmatsa 2011-02-18 18:08:31

If the threads in a block are accessing consecutive global memory locations, then all the accesses are combined into a single request(or coalesced) by the hardware. In the matrix example, matrix elements in row are arranged linearly, followed by the next row, and so on. For e.g 2x2 matrix and 2 threads in a block, memory locations are arranged as:

(0,0) (0,1) (1,0) (1,1)

In row access, thread1 accesses (0,0) and (1,0) which cannot be coalesced. In column access, thread1 accesses (0,0) and (0,1) which can be coalesced because they are adjacent.

@jmilloy 2011-02-18 18:16:24

nice and concise, but.. remember that coalesced is not about two serial accesses by thread1, but a simultaneous access by thread1 and thread2 in parallel. In your row access example, if thread1 accesses (0,0) and (1,0), then I assume thread2 is accessing (0,1) and (1,1). Thus, the first parallel access is 1:(0,0) and 2:(0,1) --> coalesced!

@jmilloy 2011-02-18 17:20:28

It's likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact "coalesced global loads" are not even profiled for these chips.

Also, this logic can be applied to shared memory to avoid bank conflicts.


A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.

So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.

In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below

0 1 2 3
4 5 6 7
8 9 a b

could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)

0 1 2 3 4 5 6 7 8 9 a b

Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either

thread 0:  0, 1, 2
thread 1:  3, 4, 5
thread 2:  6, 7, 8
thread 3:  9, a, b

or

thread 0:  0, 4, 8
thread 1:  1, 5, 9
thread 2:  2, 6, a
thread 3:  3, 7, b

Which is better? Which will result in coalesced reads, and which will not?

Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!

The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.

@tim 2011-05-19 09:04:31

Thanks for the explanation looking on which thread accesses which element. Currently I have the first option (thread 0: 0, 1, 2 etc...) so I'm looking out for a better option now :-)

@muradin 2013-12-12 20:33:31

@jmilloy - I want to ask how to profile kernel to see non-coalesced global loads and stores.

@jmilloy 2013-12-12 21:01:55

@muradin Can you use the Visual Profiler? developer.nvidia.com/nvidia-visual-profiler

@muradin 2013-12-12 22:03:49

@jmilloy - Since i work in non-graphical environment i searched and found nvprof in command line mode. but when i wanted to run it there was and error: nvprof couldn't load libcuda.so.1 there is no such file or directory! do you know why?

@George 2014-09-29 12:57:36

@jmilloy:Hello , very nice example!Thanks!I wanted to ask you ,when you say you can run the profiler to see if you have coalesced or not access , how can you do it?For xample, running : nvprof --metrics gld_efficiency ? And the higher the better?

@jmilloy 2014-10-01 10:47:49

@George I was using the visual profiler. nvprof seems like a powerful tool that will work for you as well. I want to emphasize that the metrics that are important depend on your device compute capability and CUDA version, and nvprof should allow you monitor any of them. Get your kernel working first, and then optimize with any one of the available profilers.

@George 2014-10-01 10:49:19

@jmilloy:Ok , thanks , I just wanted to know if it is the "gld_efficiency" command for this.

@filtfilt 2018-01-25 12:56:13

@jmilloy A relatively stupid question, but what is the issue if the memory access is non-coalesced (Option 1 in your example.)? The threads still access the data and there are no race conditions.

@jmilloy 2018-01-26 22:34:18

@filtfilt sequential (instead of simultaneous) reads, so inefficiency.

@Tommy 2018-10-04 13:30:28

btw I think in 2018 (after like compute 2) "half warp" should be changed to "warp" in this answer.

Related Questions

Sponsored Content

32 Answered Questions

[SOLVED] What exactly is RESTful programming?

  • 2009-03-22 14:45:39
  • hasen
  • 1599652 View
  • 3839 Score
  • 32 Answer
  • Tags:   http rest definition

1 Answered Questions

[SOLVED] How to avoid un-coalesced accesses in matrix multiplication CUDA kernel?

3 Answered Questions

[SOLVED] Coalescence vs Bank conflicts (Cuda)

1 Answered Questions

1 Answered Questions

[SOLVED] Memory coalescing and transaction

  • 2014-01-19 18:39:13
  • user2005893
  • 91 View
  • 2 Score
  • 1 Answer
  • Tags:   cuda gpgpu

1 Answered Questions

[SOLVED] CUDA: When can someone achieve coalescing memory?

1 Answered Questions

[SOLVED] CUDA 5.0 Memory alignment and coalesced access

4 Answered Questions

[SOLVED] CUDA coalesced access to global memory

1 Answered Questions

[SOLVED] Texture memory in CUDA: Concept and simple example to demonstrate performance

  • 2012-01-07 03:17:10
  • smilingbuddha
  • 18507 View
  • 17 Score
  • 1 Answer
  • Tags:   cuda

1 Answered Questions

[SOLVED] CUDA: Memory performance, What is Global memory bandwidth

Sponsored Content