[SOLVED] Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

After conducting some experiments on square matrices of different sizes, a pattern came up. Invariably, transposing a matrix of size `2^n` is slower than transposing one of size `2^n+1`. For small values of `n`, the difference is not major.

Big differences occur however over a value of 512. (at least for me)

Disclaimer: I know the function doesn't actually transpose the matrix because of the double swap of elements, but it makes no difference.

Follows the code:

``````#define SAMPLES 1000
#define MATSIZE 512

#include <time.h>
#include <iostream>
int mat[MATSIZE][MATSIZE];

void transpose()
{
for ( int i = 0 ; i < MATSIZE ; i++ )
for ( int j = 0 ; j < MATSIZE ; j++ )
{
int aux = mat[i][j];
mat[i][j] = mat[j][i];
mat[j][i] = aux;
}
}

int main()
{
//initialize matrix
for ( int i = 0 ; i < MATSIZE ; i++ )
for ( int j = 0 ; j < MATSIZE ; j++ )
mat[i][j] = i+j;

int t = clock();
for ( int i = 0 ; i < SAMPLES ; i++ )
transpose();
int elapsed = clock() - t;

std::cout << "Average for a matrix of " << MATSIZE << ": " << elapsed / SAMPLES;
}
``````

Changing `MATSIZE` lets us alter the size (duh!). I posted two versions on ideone:

In my environment (MSVS 2010, full optimizations), the difference is similar :

• size 512 - average 2.19 ms
• size 513 - average 0.57 ms

Why is this happening?

@Ruslan 2017-12-29 22:34:59

As an illustration to the explanation in Luchian Grigore's answer, here's what the matrix cache presence looks like for the two cases of 64x64 and 65x65 matrices (see the link above for details on numbers).

Colors in the animations below mean the following:

• – not in cache,
• – in cache,
• – cache hit,
• – just read from RAM,
• – cache miss.

The 64x64 case:

Notice how almost every access to a new row results in a cache miss. And now how it looks for the normal case, a 65x65 matrix:

Here you can see that most of the accesses after the initial warming-up are cache hits. This is how CPU cache is intended to work in general.

The code that generated frames for the above animations can be seen here.

@Josiah Yoder 2018-09-21 00:16:14

Why are the vertical scanning cache hits not saved in the first case, but they are in the second case? It seems like a given block is accessed exactly once for most blocks in both examples.

@Josiah Yoder 2018-09-21 00:51:22

I can see from @LuchianGrigore's answer that it is because all of the lines in the column belong to the same set.

@kelalaka 2018-09-21 07:50:54

Yes, great illustration. I see that they are at the same speed. But actually, they are not, aren't they?

@Ruslan 2018-09-21 09:27:46

@kelalaka yes, animation FPS is the same. I didn't simulate slowdown, only colors are important here.

@Josiah Yoder 2018-09-21 13:09:03

It would be interesting to have two static images illustrating the different cache sets.

@ken 2020-04-04 06:26:35

@Ruslan What tool did you use to draw these GIF images?

@Ruslan 2020-04-04 07:48:14

@ken the simulation frames were rendered by my own program into PNG (via Qt's `QImage`), and then the sequence was transformed to GIF with ImageMagick's `convert` tool.

@ken 2020-04-04 12:04:00

@Ruslan Do you mind send me that code? Thanks

@Ruslan 2020-04-04 12:49:46

@ken I've added the source for everyone to see.

@Luchian Grigore 2012-07-10 13:00:21

The explanation comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache.

For terms and detailed info, see the wiki entry on caching, I'm gonna narrow it down here.

A cache is organized in sets and lines. At a time, only one set is used, out of which any of the lines it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.

For a particular memory address, we can calculate which set should mirror it with the formula:

``````set = ( address / lineSize ) % numberOfsets
``````

This sort of formula ideally gives a uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).

It's clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.

I'll try to somewhat follow the example from Agner:

Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address `0x2710`, which goes in set `28`. And then we also attempt to read addresses `0x2F00`, `0x3700`, `0x3F00` and `0x4700`. All of these belong to the same set. Before reading `0x4700`, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding `0x2710`. The problem lies in the fact that we read addresses that are (for this example) `0x800` apart. This is the critical stride (again, for this example).

The critical stride can also be calculated:

``````criticalStride = numberOfSets * lineSize
``````

Variables spaced `criticalStride` or a multiple apart contend for the same cache lines.

This is the theory part. Next, the explanation (also Agner, I'm following it closely to avoid making mistakes):

Assume a matrix of 64x64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit `int`).

The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).

Assume we're processing row 28. We're attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they'll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).

When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).

Having a size that is not a multiple of the critical stride messes up this perfect scenario for disaster, as we're no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.

Another disclaimer - I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I'm waiting for a response (or confirmation) from Mysticial. :)

@Mysticial 2012-07-10 13:39:40

Oh and next time. Just ping me directly through the Lounge. I don't find every instance of name on SO. :) I only saw this through the periodic email notifications.

@Hongxu Chen 2012-09-27 01:58:17

@Mysticial @Luchian Grigore One of my friends tells me that his `Intel core i3` pc running on `Ubuntu 11.04 i386`demonstrates almost the same performance with gcc 4.6.And so is the same for my computer `Intel Core 2 Duo` with mingw gcc4.4,who's running on `windows 7(32)`.It does show a big difference when I compile this segment with a little older pc `intel centrino` with gcc 4.6,who's running on `ubuntu 12.04 i386`.

@Peter Cordes 2016-03-18 01:52:44

Also note that memory access where the addresses differ by a a multiple of 4096 have a false dependency on Intel SnB-family CPUs. (i.e. same offset within a page). This can reduce throughput when some of the operations are stores, esp. a mix of loads and stores.

@Ruslan 2016-04-02 18:04:09

`which goes in set 24` did you mean "in set 28" instead? And do you assume 32 sets?

@Luchian Grigore 2016-04-04 15:00:31

You are correct, it's 28. :) I also double-checked the linked paper, for the original explanation you can navigate to 9.2 Cache organization

@Josiah Yoder 2018-09-21 00:50:23

I feel like something is missing. What is the numberOfSets for the first example? It seems to be 0x80 = 128 from the way the numbers work out, but I don't know why it should be.

@Voo 2012-07-10 13:26:41

Luchian gives an explanation of why this behavior happens, but I thought it'd be a nice idea to show one possible solution to this problem and at the same time show a bit about cache oblivious algorithms.

``````for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
A[j][i] = A[i][j];
``````

which is just horrible for a modern CPU. One solution is to know the details about your cache system and tweak the algorithm to avoid those problems. Works great as long as you know those details.. not especially portable.

Can we do better than that? Yes we can: A general approach to this problem are cache oblivious algorithms that as the name says avoids being dependent on specific cache sizes [1]

The solution would look like this:

``````void recursiveTranspose(int i0, int i1, int j0, int j1) {
int di = i1 - i0, dj = j1 - j0;
const int LEAFSIZE = 32; // well ok caching still affects this one here
if (di >= dj && di > LEAFSIZE) {
int im = (i0 + i1) / 2;
recursiveTranspose(i0, im, j0, j1);
recursiveTranspose(im, i1, j0, j1);
} else if (dj > LEAFSIZE) {
int jm = (j0 + j1) / 2;
recursiveTranspose(i0, i1, j0, jm);
recursiveTranspose(i0, i1, jm, j1);
} else {
for (int i = i0; i < i1; i++ )
for (int j = j0; j < j1; j++ )
mat[j][i] = mat[i][j];
}
}
``````

Slightly more complex, but a short test shows something quite interesting on my ancient e8400 with VS2010 x64 release, testcode for `MATSIZE 8192`

``````int main() {
LARGE_INTEGER start, end, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&start);
recursiveTranspose(0, MATSIZE, 0, MATSIZE);
QueryPerformanceCounter(&end);

QueryPerformanceCounter(&start);
transpose();
QueryPerformanceCounter(&end);
return 0;
}

results:
recursive: 480.58ms
iterative: 3678.46ms
``````

Edit: About the influence of size: It is much less pronounced although still noticeable to some degree, that's because we're using the iterative solution as a leaf node instead of recursing down to 1 (the usual optimization for recursive algorithms). If we set LEAFSIZE = 1, the cache has no influence for me [`8193: 1214.06; 8192: 1171.62ms, 8191: 1351.07ms` - that's inside the margin of error, the fluctuations are in the 100ms area; this "benchmark" isn't something that I'd be too comfortable with if we wanted completely accurate values])

[1] Sources for this stuff: Well if you can't get a lecture from someone that worked with Leiserson and co on this.. I assume their papers a good starting point. Those algorithms are still quite rarely described - CLR has a single footnote about them. Still it's a great way to surprise people.

Edit (note: I'm not the one who posted this answer; I just wanted to add this):
Here's a complete C++ version of the above code:

``````template<class InIt, class OutIt>
void transpose(InIt const input, OutIt const output,
size_t const rows, size_t const columns,
size_t const r1 = 0, size_t const c1 = 0,
size_t r2 = ~(size_t) 0, size_t c2 = ~(size_t) 0,
size_t const leaf = 0x20)
{
if (!~c2) { c2 = columns - c1; }
if (!~r2) { r2 = rows - r1; }
size_t const di = r2 - r1, dj = c2 - c1;
if (di >= dj && di > leaf)
{
transpose(input, output, rows, columns, r1, c1, (r1 + r2) / 2, c2);
transpose(input, output, rows, columns, (r1 + r2) / 2, c1, r2, c2);
}
else if (dj > leaf)
{
transpose(input, output, rows, columns, r1, c1, r2, (c1 + c2) / 2);
transpose(input, output, rows, columns, r1, (c1 + c2) / 2, r2, c2);
}
else
{
for (ptrdiff_t i1 = (ptrdiff_t) r1, i2 = (ptrdiff_t) (i1 * columns);
i1 < (ptrdiff_t) r2; ++i1, i2 += (ptrdiff_t) columns)
{
for (ptrdiff_t j1 = (ptrdiff_t) c1, j2 = (ptrdiff_t) (j1 * rows);
j1 < (ptrdiff_t) c2; ++j1, j2 += (ptrdiff_t) rows)
{
output[j2 + i1] = input[i2 + j1];
}
}
}
}
``````

@Luchian Grigore 2012-07-10 13:28:45

This would be relevant if you compared the times between matrices of different sizes, not recursive and iterative. Try the recursive solution on a matrix of the sizes specified.

@Voo 2012-07-10 13:32:52

@Luchian Since you already explained why he's seeing the behavior I thought it quite interesting to introduce one solution to this problem in general.

@Luchian Grigore 2012-07-10 13:34:33

Because, I'm questioning why a larger matrix takes a shorter time to process, not looking for a speedier algorithm...

@Voo 2012-07-10 13:36:23

@Luchian The differences between 16383 and 16384 are.. 28 vs 27ms for me here, or a about 3.5% - not really significant. And I'd be surprised if it were.

@Voo 2012-07-10 13:47:55

@Luchian Why wouldn't 16383 count? 16384 is a multiple of 2 and 512, 16383 should mess the critical stride up, shouldn't it? With the iterative version I get a much lower result for 511 vs. 512 as well and that's way more than the bit less work it has to do.

@Luchian Grigore 2012-07-10 13:51:56

Ah, right about that. What about if you run 16383 and 16384 with the iterative solution? Is the difference much greater?

@Voo 2012-07-10 14:00:50

@Luchian 4541.32ms vs. 15010.51ms as expected. I mean clearly using the `LEAFSIZE` constant means my algorithm isn't completely cache oblivious (but obviously faster than recursing down to 1) so I'd expect some small effect as well. Increasing the leafsize should make that effect more dramatic though, since in the end we'd implement the iterative solution. I just thought an algorithm that would be not/much less influenced by caching would be a good fit for this question as an addendum to your answer.

@Matthieu M. 2012-07-10 14:52:46

It might be interesting to explain what the `recursiveTranspose` does, ie that it does not fill up the cache as much by operating on small tiles (of `LEAFSIZE x LEAFSIZE` dimension).

@Luchian Grigore 2012-07-10 15:02:34

Also, what happens if `LEAFSIZE` is a critical stride.