By Johannes Gerer

2011-12-17 20:40:52 8 Comments

Suppose a1, b1, c1, and d1 point to heap memory and my numerical code has the following core loop.

const int n = 100000;

for (int j = 0; j < n; j++) {
    a1[j] += b1[j];
    c1[j] += d1[j];

This loop is executed 10,000 times via another outer for loop. To speed it up, I changed the code to:

for (int j = 0; j < n; j++) {
    a1[j] += b1[j];

for (int j = 0; j < n; j++) {
    c1[j] += d1[j];

Compiled on MS Visual C++ 10.0 with full optimization and SSE2 enabled for 32-bit on a Intel Core 2 Duo (x64), the first example takes 5.5 seconds and the double-loop example takes only 1.9 seconds. My question is: (Please refer to the my rephrased question at the bottom)

PS: I am not sure, if this helps:

Disassembly for the first loop basically looks like this (this block is repeated about five times in the full program):

movsd       xmm0,mmword ptr [edx+18h]
addsd       xmm0,mmword ptr [ecx+20h]
movsd       mmword ptr [ecx+20h],xmm0
movsd       xmm0,mmword ptr [esi+10h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [edx+20h]
addsd       xmm0,mmword ptr [ecx+28h]
movsd       mmword ptr [ecx+28h],xmm0
movsd       xmm0,mmword ptr [esi+18h]
addsd       xmm0,mmword ptr [eax+38h]

Each loop of the double loop example produces this code (the following block is repeated about three times):

addsd       xmm0,mmword ptr [eax+28h]
movsd       mmword ptr [eax+28h],xmm0
movsd       xmm0,mmword ptr [ecx+20h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [ecx+28h]
addsd       xmm0,mmword ptr [eax+38h]
movsd       mmword ptr [eax+38h],xmm0
movsd       xmm0,mmword ptr [ecx+30h]
addsd       xmm0,mmword ptr [eax+40h]
movsd       mmword ptr [eax+40h],xmm0

The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:

Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?

It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.

PPS: Here is the full code. It uses TBB Tick_Count for higher resolution timing, which can be disabled by not defining the TBB_TIMING Macro:

#include <iostream>
#include <iomanip>
#include <cmath>
#include <string>

//#define TBB_TIMING

#ifdef TBB_TIMING   
#include <tbb/tick_count.h>
using tbb::tick_count;
#include <time.h>

using namespace std;

//#define preallocate_memory new_cont

enum { new_cont, new_sep };

double *a1, *b1, *c1, *d1;

void allo(int cont, int n)
    switch(cont) {
      case new_cont:
        a1 = new double[n*4];
        b1 = a1 + n;
        c1 = b1 + n;
        d1 = c1 + n;
      case new_sep:
        a1 = new double[n];
        b1 = new double[n];
        c1 = new double[n];
        d1 = new double[n];

    for (int i = 0; i < n; i++) {
        a1[i] = 1.0;
        d1[i] = 1.0;
        c1[i] = 1.0;
        b1[i] = 1.0;

void ff(int cont)
      case new_sep:
        delete[] b1;
        delete[] c1;
        delete[] d1;
      case new_cont:
        delete[] a1;

double plain(int n, int m, int cont, int loops)
#ifndef preallocate_memory

#ifdef TBB_TIMING   
    tick_count t0 = tick_count::now();
    clock_t start = clock();

    if (loops == 1) {
        for (int i = 0; i < m; i++) {
            for (int j = 0; j < n; j++){
                a1[j] += b1[j];
                c1[j] += d1[j];
    } else {
        for (int i = 0; i < m; i++) {
            for (int j = 0; j < n; j++) {
                a1[j] += b1[j];
            for (int j = 0; j < n; j++) {
                c1[j] += d1[j];
    double ret;

#ifdef TBB_TIMING   
    tick_count t1 = tick_count::now();
    ret = 2.0*double(n)*double(m)/(t1-t0).seconds();
    clock_t end = clock();
    ret = 2.0*double(n)*double(m)/(double)(end - start) *double(CLOCKS_PER_SEC);

#ifndef preallocate_memory

    return ret;

void main()
    freopen("C:\\test.csv", "w", stdout);

    char *s = " ";

    string na[2] ={"new_cont", "new_sep"};

    cout << "n";

    for (int j = 0; j < 2; j++)
        for (int i = 1; i <= 2; i++)
#ifdef preallocate_memory
            cout << s << i << "_loops_" << na[preallocate_memory];
            cout << s << i << "_loops_" << na[j];

    cout << endl;

    long long nmax = 1000000;

#ifdef preallocate_memory
    allo(preallocate_memory, nmax);

    for (long long n = 1L; n < nmax; n = max(n+1, long long(n*1.2)))
        const long long m = 10000000/n;
        cout << n;

        for (int j = 0; j < 2; j++)
            for (int i = 1; i <= 2; i++)
                cout << s << plain(n, m, j, i);
        cout << endl;

(It shows FLOP/s for different values of n.)

enter image description here


@Francis Cugler 2017-01-30 14:00:48

The Original Question

Why is one loop so much slower than two loops?


Case 1 is a classic interpolation problem that happens to be an inefficient one. I also think that this was one of the leading reasons why many machine architectures and developers ended up building and designing multi-core systems with the ability to do multi-threaded applications as well as parallel programming.

Looking at it from this kind of an approach without involving how the Hardware, OS, and Compiler(s) works together to do heap allocations that involve working with RAM, Cache, Page Files, etc.; the mathematics that is at the foundation of these algorithms shows us which of these two is the better solution.

We can use an analogy of a Boss being a Summation that will represent a For Loop that has to travel between workers A & B.

We can easily see that Case 2 is at least half as fast if not a little more than Case 1 due to the difference in the distance that is needed to travel and the time taken between the workers. This math lines up almost virtually and perfectly with both the BenchMark Times as well as the number of differences in Assembly Instructions.

I will now begin to explain how all of this works below.

Assessing The Problem

The OP's code:

const int n=100000;

for(int j=0;j<n;j++){
    a1[j] += b1[j];
    c1[j] += d1[j];


for(int j=0;j<n;j++){
    a1[j] += b1[j];
for(int j=0;j<n;j++){
    c1[j] += d1[j];

The Consideration

Considering the OP's original question about the 2 variants of the for loops and his amended question towards the behavior of caches along with many of the other excellent answers and useful comments; I'd like to try and do something different here by taking a different approach about this situation and problem.

The Approach

Considering the two loops and all of the discussion about cache and page filing I'd like to take another approach as to looking at this from a different perspective. One that doesn't involve the cache and page files nor the executions to allocate memory, in fact, this approach doesn't even concern the actual hardware or the software at all.

The Perspective

After looking at the code for a while it became quite apparent what the problem is and what is generating it. Let's break this down into an algorithmic problem and look at it from the perspective of using mathematical notations then apply an analogy to the math problems as well as to the algorithms.

What We Do Know

We know is that this loop will run 100,000 times. We also know that a1, b1, c1 & d1 are pointers on a 64-bit architecture. Within C++ on a 32-bit machine, all pointers are 4 bytes and on a 64-bit machine, they are 8 bytes in size since pointers are of a fixed length.

We know that we have 32 bytes in which to allocate for in both cases. The only difference is we are allocating 32 bytes or 2 sets of 2-8bytes on each iteration wherein the 2nd case we are allocating 16 bytes for each iteration for both of the independent loops.

Both loops still equal 32 bytes in total allocations. With this information let's now go ahead and show the general math, algorithms, and analogy of these concepts.

We do know the number of times that the same set or group of operations that will have to be performed in both cases. We do know the amount of memory that needs to be allocated in both cases. We can assess that the overall workload of the allocations between both cases will be approximately the same.

What We Don't Know

We do not know how long it will take for each case unless if we set a counter and run a benchmark test. However, the benchmarks were already included from the original question and from some of the answers and comments as well; and we can see a significant difference between the two and this is the whole reasoning for this proposal to this problem.

Let's Investigate

It is already apparent that many have already done this by looking at the heap allocations, benchmark tests, looking at RAM, Cache, and Page Files. Looking at specific data points and specific iteration indices were also included and the various conversations about this specific problem have many people starting to question other related things about it. How do we begin to look at this problem by using mathematical algorithms and applying an analogy to it? We start off by making a couple of assertions! Then we build out our algorithm from there.

Our Assertions:

  • We will let our loop and its iterations be a Summation that starts at 1 and ends at 100000 instead of starting with 0 as in the loops for we don't need to worry about the 0 indexing scheme of memory addressing since we are just interested in the algorithm itself.
  • In both cases we have 4 functions to work with and 2 function calls with 2 operations being done on each function call. We will set these up as functions and calls to functions as the following: F1(), F2(), f(a), f(b), f(c) and f(d).

The Algorithms:

1st Case: - Only one summation but two independent function calls.

Sum n=1 : [1,100000] = F1(), F2();
                       F1() = { f(a) = f(a) + f(b); }
                       F2() = { f(c) = f(c) + f(d); }

2nd Case: - Two summations but each has its own function call.

Sum1 n=1 : [1,100000] = F1();
                        F1() = { f(a) = f(a) + f(b); }

Sum2 n=1 : [1,100000] = F1();
                        F1() = { f(c) = f(c) + f(d); }

If you noticed F2() only exists in Sum from Case1 where F1() is contained in Sum from Case1 and in both Sum1 and Sum2 from Case2. This will be evident later on when we begin to conclude that there is an optimization that is happening within the second algorithm.

The iterations through the first case Sum calls f(a) that will add to its self f(b) then it calls f(c) that will do the same but add f(d) to itself for each 100000 iterations. In the second case, we have Sum1 and Sum2 that both act the same as if they were the same function being called twice in a row.

In this case we can treat Sum1 and Sum2 as just plain old Sum where Sum in this case looks like this: Sum n=1 : [1,100000] { f(a) = f(a) + f(b); } and now this looks like an optimization where we can just consider it to be the same function.

Summary with Analogy

With what we have seen in the second case it almost appears as if there is optimization since both for loops have the same exact signature, but this isn't the real issue. The issue isn't the work that is being done by f(a), f(b), f(c), and f(d). In both cases and the comparison between the two, it is the difference in the distance that the Summation has to travel in each case that gives you the difference in execution time.

Think of the For Loops as being the Summations that does the iterations as being a Boss that is giving orders to two people A & B and that their jobs are to meat C & D respectively and to pick up some package from them and return it. In this analogy, the for loops or summation iterations and condition checks themselves don't actually represent the Boss. What actually represents the Boss is not from the actual mathematical algorithms directly but from the actual concept of Scope and Code Block within a routine or subroutine, method, function, translation unit, etc. The first algorithm has 1 scope where the 2nd algorithm has 2 consecutive scopes.

Within the first case on each call slip, the Boss goes to A and gives the order and A goes off to fetch B's package then the Boss goes to C and gives the orders to do the same and receive the package from D on each iteration.

Within the second case, the Boss works directly with A to go and fetch B's package until all packages are received. Then the Boss works with C to do the same for getting all of D's packages.

Since we are working with an 8-byte pointer and dealing with heap allocation let's consider the following problem. Let's say that the Boss is 100 feet from A and that A is 500 feet from C. We don't need to worry about how far the Boss is initially from C because of the order of executions. In both cases, the Boss initially travels from A first then to B. This analogy isn't to say that this distance is exact; it is just a useful test case scenario to show the workings of the algorithms.

In many cases when doing heap allocations and working with the cache and page files, these distances between address locations may not vary that much or they can vary significantly depending on the nature of the data types and the array sizes.

The Test Cases:

First Case: On first iteration the Boss has to initially go 100 feet to give the order slip to A and A goes off and does his thing, but then the Boss has to travel 500 feet to C to give him his order slip. Then on the next iteration and every other iteration after the Boss has to go back and forth 500 feet between the two.

Second Case: The Boss has to travel 100 feet on the first iteration to A, but after that, he is already there and just waits for A to get back until all slips are filled. Then the Boss has to travel 500 feet on the first iteration to C because C is 500 feet from A. Since this Boss( Summation, For Loop ) is being called right after working with A he then just waits there as he did with A until all of C's order slips are done.

The Difference In Distances Traveled

const n = 100000
distTraveledOfFirst = (100 + 500) + ((n-1)*(500 + 500); 
// Simplify
distTraveledOfFirst = 600 + (99999*100);
distTraveledOfFirst = 600 + 9999900;
distTraveledOfFirst =  10000500;
// Distance Traveled On First Algorithm = 10,000,500ft

distTraveledOfSecond = 100 + 500 = 600;
// Distance Traveled On Second Algorithm = 600ft;    

The Comparison of Arbitrary Values

We can easily see that 600 is far less than 10 million. Now, this isn't exact, because we don't know the actual difference in distance between which address of RAM or from which Cache or Page File each call on each iteration is going to be due to many other unseen variables. This is just an assessment of the situation to be aware of and looking at it from the worst-case scenario.

From these numbers it would almost appear as if Algorithm One should be 99% slower than Algorithm Two; however, this is only the Boss's part or responsibility of the algorithms and it doesn't account for the actual workers A, B, C, & D and what they have to do on each and every iteration of the Loop. So the boss's job only accounts for about 15 - 40% of the total work being done. The bulk of the work that is done through the workers has a slightly bigger impact towards keeping the ratio of the speed rate differences to about 50-70%

The Observation: - The differences between the two algorithms

In this situation, it is the structure of the process of the work being done. It goes to show that Case 2 is more efficient from both the partial optimization of having a similar function declaration and definition where it is only the variables that differ by name and the distance traveled.

We also see that the total distance traveled in Case 1 is much farther than it is in Case 2 and we can consider this distance traveled our Time Factor between the two algorithms. Case 1 has considerable more work to do than Case 2 does.

This is observable from the evidence of the ASM instructions that were shown in both cases. Along with what was already stated about these cases, this doesn't account for the fact that in Case 1 the boss will have to wait for both A & C to get back before he can go back to A again for each iteration. It also doesn't account for the fact that if A or B is taking an extremely long time then both the Boss and the other worker(s) are idle waiting to be executed.

In Case 2 the only one being idle is the Boss until the worker gets back. So even this has an impact on the algorithm.

The OPs Amended Question(s)

EDIT: The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:

Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?

It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.

Regarding These Questions

As I have demonstrated without a doubt, there is an underlying issue even before the Hardware and Software becomes involved.

Now as for the management of memory and caching along with page files, etc. which all work together in an integrated set of systems between the following:

  • The Architecture { Hardware, Firmware, some Embedded Drivers, Kernels and ASM Instruction Sets }.
  • The OS{ File and Memory Management systems, Drivers and the Registry }.
  • The Compiler { Translation Units and Optimizations of the Source Code }.
  • And even the Source Code itself with its set(s) of distinctive algorithms.

We can already see that there is a bottleneck that is happening within the first algorithm before we even apply it to any machine with any arbitrary Architecture, OS, and Programmable Language compared to the second algorithm. There already existed a problem before involving the intrinsics of a modern computer.

The Ending Results

However; it is not to say that these new questions are not of importance because they themselves are and they do play a role after all. They do impact the procedures and the overall performance and that is evident with the various graphs and assessments from many who have given their answer(s) and or comment(s).

If you paid attention to the analogy of the Boss and the two workers A & B who had to go and retrieve packages from C & D respectively and considering the mathematical notations of the two algorithms in question; you can see without the involvement of the computer hardware and software Case 2 is approximately 60% faster than Case 1.

When you look at the graphs and charts after these algorithms have been applied to some source code, compiled, optimized, and executed through the OS to perform their operations on a given piece of hardware, you can even see a little more degradation between the differences in these algorithms.

If the Data set is fairly small it may not seem all that bad of a difference at first. However, since Case 1 is about 60 - 70% slower than Case 2 we can look at the growth of this function in terms of the differences in time executions:

DeltaTimeDifference approximately = Loop1(time) - Loop2(time)
Loop1(time) = Loop2(time) + (Loop2(time)*[0.6,0.7]) // approximately
// So when we substitute this back into the difference equation we end up with 
DeltaTimeDifference approximately = (Loop2(time) + (Loop2(time)*[0.6,0.7])) - Loop2(time)
// And finally we can simplify this to
DeltaTimeDifference approximately = [0.6,0.7]*Loop2(time)

This approximation is the average difference between these two loops both algorithmically and machine operations involving software optimizations and machine instructions.

When the data set grows linearly, so does the difference in time between the two. Algorithm 1 has more fetches than algorithm 2 which is evident when the Boss has to travel back and forth the maximum distance between A & C for every iteration after the first iteration while Algorithm 2 the Boss has to travel to A once and then after being done with A he has to travel a maximum distance only one time when going from A to C.

Trying to have the Boss focusing on doing two similar things at once and juggling them back and forth instead of focusing on similar consecutive tasks is going to make him quite angry by the end of the day since he had to travel and work twice as much. Therefore do not lose the scope of the situation by letting your boss getting into an interpolated bottleneck because the boss's spouse and children wouldn't appreciate it.

Amendment: Software Engineering Design Principles

-- The difference between Local Stack and Heap Allocated computations within iterative for loops and the difference between their usages, their efficiencies, and effectiveness --

The mathematical algorithm that I proposed above mainly applies to loops that perform operations on data that is allocated on the heap.

  • Consecutive Stack Operations:
    • If the loops are performing operations on data locally within a single code block or scope that is within the stack frame it will still sort of apply, but the memory locations are much closer where they are typically sequential and the difference in distance traveled or execution time is almost negligible. Since there are no allocations being done within the heap, the memory isn't scattered, and the memory isn't being fetched through ram. The memory is typically sequential and relative to the stack frame and stack pointer.
    • When consecutive operations are being done on the stack, a modern Processor will cache repetitive values and addresses keeping these values within local cache registers. The time of operations or instructions here is on the order of nano-seconds.
  • Consecutive Heap Allocated Operations:
    • When you begin to apply heap allocations and the processor has to fetch the memory addresses on consecutive calls, depending on the architecture of the CPU, the Bus Controller, and the Ram modules the time of operations or execution can be on the order of micro to milliseconds. In comparison to cached stack operations, these are quite slow.
    • The CPU will have to fetch the memory address from Ram and typically anything across the system bus is slow compared to the internal data paths or data buses within the CPU itself.

So when you are working with data that needs to be on the heap and you are traversing through them in loops, it is more efficient to keep each data set and its corresponding algorithms within its own single loop. You will get better optimizations compared to trying to factor out consecutive loops by putting multiple operations of different data sets that are on the heap into a single loop.

It is okay to do this with data that is on the stack since they are frequently cached, but not for data that has to have its memory address queried every iteration.

This is where Software Engineering and Software Architecture Design comes into play. It is the ability to know how to organize your data, knowing when to cache your data, knowing when to allocate your data on the heap, knowing how to design and implement your algorithms, and knowing when and where to call them.

You might have the same algorithm that pertains to the same data set, but you might want one implementation design for its stack variant and another for its heap-allocated variant just because of the above issue that is seen from its O(n) complexity of the algorithm when working with the heap.

From what I've noticed over the years many people do not take this fact into consideration. They will tend to design one algorithm that works on a particular data set and they will use it regardless of the data set being locally cached on the stack or if it was allocated on the heap.

If you want true optimization, yes it might seem like code duplication, but to generalize it would be more efficient to have two variants of the same algorithm. One for stack operations, and the other for heap operations that are performed in iterative loops!

Here's a pseudo example: Two simple structs, one algorithm.

struct A {
    int data;
    A() : data{0}{}
    A(int a) : data{a}{} 
struct B {
    int data;
    B() : data{0}{}
    A(int b) : data{b}{}

template<typename T>
void Foo( T& t ) {
    // do something with t

// some looping operation: first stack then heap.

// stack data:
A dataSetA[10] = {};
B dataSetB[10] = {};

// For stack operations this is okay and efficient
for (int i = 0; i < 10; i++ ) {

// If the above two were on the heap then performing
// the same algorithm to both within the same loop
// will create that bottleneck
A* dataSetA = new [] A();
B* dataSetB = new [] B();
for ( int i = 0; i < 10; i++ ) {
    Foo(dataSetA[i]); // dataSetA is on the heap here
    Foo(dataSetB[i]); // dataSetB is on the heap here
} // this will be inefficient.

// To improve the efficiency above, put them into separate loops... 

for (int i = 0; i < 10; i++ ) {
for (int i = 0; i < 10; i++ ) {
// This will be much more efficient than above.
// The code isn't perfect syntax, it's only psuedo code
// to illustrate a point.

This is what I was referring to by having separate implementations for stack variants versus heap variants. The algorithms themselves don't matter too much, it's the looping structures that you will use them in that do.

@Francis Cugler 2017-03-22 04:20:11

It has been a while since I posted this answer, but I also wanted to add a quick comment that may also help to understand this: In my analogy with the Boss as the for loop or the summations or iterations through a loop, we could also consider this boss to be the combination between the Stack Frame & Stack Pointer that manages the scope and stack variables and memory addressing of the for loops.

@Francis Cugler 2018-11-23 23:03:33

@PeterMortensen I have taken your advise into consideration by slightly modifying my original answer. I believe this is what you were suggesting.

@Mysticial 2011-12-17 21:17:37

Upon further analysis of this, I believe this is (at least partially) caused by the data alignment of the four-pointers. This will cause some level of cache bank/way conflicts.

If I've guessed correctly on how you are allocating your arrays, they are likely to be aligned to the page line.

This means that all your accesses in each loop will fall on the same cache way. However, Intel processors have had 8-way L1 cache associativity for a while. But in reality, the performance isn't completely uniform. Accessing 4-ways is still slower than say 2-ways.

EDIT: It does in fact look like you are allocating all the arrays separately. Usually when such large allocations are requested, the allocator will request fresh pages from the OS. Therefore, there is a high chance that large allocations will appear at the same offset from a page-boundary.

Here's the test code:

int main(){
    const int n = 100000;

    double *a1 = (double*)malloc(n * sizeof(double));
    double *b1 = (double*)malloc(n * sizeof(double));
    double *c1 = (double*)malloc(n * sizeof(double));
    double *d1 = (double*)malloc(n * sizeof(double));
    double *a1 = (double*)malloc(n * sizeof(double) * 4);
    double *b1 = a1 + n;
    double *c1 = b1 + n;
    double *d1 = c1 + n;

    //  Zero the data to prevent any chance of denormals.
    memset(a1,0,n * sizeof(double));
    memset(b1,0,n * sizeof(double));
    memset(c1,0,n * sizeof(double));
    memset(d1,0,n * sizeof(double));

    //  Print the addresses
    cout << a1 << endl;
    cout << b1 << endl;
    cout << c1 << endl;
    cout << d1 << endl;

    clock_t start = clock();

    int c = 0;
    while (c++ < 10000){

        for(int j=0;j<n;j++){
            a1[j] += b1[j];
            c1[j] += d1[j];
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
        for(int j=0;j<n;j++){
            c1[j] += d1[j];

    clock_t end = clock();
    cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl;

    return 0;

Benchmark Results:

EDIT: Results on an actual Core 2 architecture machine:

2 x Intel Xeon X5482 Harpertown @ 3.2 GHz:

#define ONE_LOOP
seconds = 6.206

//#define ONE_LOOP
seconds = 2.116

#define ONE_LOOP
seconds = 1.894

//#define ONE_LOOP
seconds = 1.993


  • 6.206 seconds with one loop and 2.116 seconds with two loops. This reproduces the OP's results exactly.

  • In the first two tests, the arrays are allocated separately. You'll notice that they all have the same alignment relative to the page.

  • In the second two tests, the arrays are packed together to break that alignment. Here you'll notice both loops are faster. Furthermore, the second (double) loop is now the slower one as you would normally expect.

As @Stephen Cannon points out in the comments, there is a very likely possibility that this alignment causes false aliasing in the load/store units or the cache. I Googled around for this and found that Intel actually has a hardware counter for partial address aliasing stalls:

5 Regions - Explanations

Region 1:

This one is easy. The dataset is so small that the performance is dominated by overhead like looping and branching.

Region 2:

Here, as the data sizes increase, the amount of relative overhead goes down and the performance "saturates". Here two loops is slower because it has twice as much loop and branching overhead.

I'm not sure exactly what's going on here... Alignment could still play an effect as Agner Fog mentions cache bank conflicts. (That link is about Sandy Bridge, but the idea should still be applicable to Core 2.)

Region 3:

At this point, the data no longer fits in the L1 cache. So performance is capped by the L1 <-> L2 cache bandwidth.

Region 4:

The performance drop in the single-loop is what we are observing. And as mentioned, this is due to the alignment which (most likely) causes false aliasing stalls in the processor load/store units.

However, in order for false aliasing to occur, there must be a large enough stride between the datasets. This is why you don't see this in region 3.

Region 5:

At this point, nothing fits in the cache. So you're bound by memory bandwidth.

2 x Intel X5482 Harpertown @ 3.2 GHz Intel Core i7 870 @ 2.8 GHz Intel Core i7 2600K @ 4.4 GHz

@Oliver Charlesworth 2011-12-17 21:20:28

+1: I think this is the answer. Contrary to what all the other answers say, it's not about the single loop variant inherently having more cache misses, it's about the particular alignment of the arrays causing the cache misses.

@Stephen Canon 2011-12-18 01:04:53

This; a false aliasing stall is the most likely explanation.

@Mysticial 2011-12-18 01:09:46

@StephenCanon I remember you mentioning that on this question. Yes, I agree, that is definitely a possibility as a result of the alignment.

@Mysticial 2011-12-18 01:25:57

@StephenCanon Just making sure: false aliasing would be this right? I'm about to make a huge edit to my answer with results on an actual Core 2 machine.

@Johannes Gerer 2011-12-18 01:56:54

It would be nice to see a comparable graph from these machines. Here is the code (

@Mysticial 2011-12-18 02:26:28

Edited with the graphs as well as (my) explanations for each of the 5 regions.

@Johannes Gerer 2011-12-18 02:45:08

But in Region 2 the two-loops are faster!

@Mysticial 2011-12-18 02:47:14

Oh you're right... Then to be honest, I'm not sure. It could still have something to do with alignment in the L1 cache.

@Mysticial 2011-12-18 02:52:58

It could be due to cache bank conflicts. Agner Fog talks about that here for the Sandy Bridge architecture, but I'm sure it's applicable to Core 2 as well.

@Johannes Gerer 2011-12-18 03:04:08

It would be nice to have higher resolution timing (e.g. by using the provided TBB). But thats it then

@Mysticial 2011-12-18 03:39:11

Updated. I don't have TBB, but I used the Windows performance counters along with 10x the # of iterations.

@greatwolf 2011-12-18 03:44:45

@Mysticial slightly off-topic but what did you use to generate those pretty graphs? Is it from VS profiler?

@Mysticial 2011-12-18 03:45:34

@VictorT. I used the code the OP linked to. It generates a .css file which I can open in Excel and make a graph from it.

@Nawaz 2011-12-18 05:58:47

For me, all the four cases produce almost same result. Information : GCC 4.6.1 (MinGW), -O4 optimization flag; Intel(R) Core(TM) i3, M350 @2.27GHz; 64-bit OS (Windows Basic).

@Mysticial 2011-12-18 06:02:00

@Nawaz What machine? Compiler? OS? I'm not sure if it matters. In this case, both the OP and me used Visual Studio 2010 on a Core 2 machine. On the Core i3/5/7, the difference is closer to 20%. I originally had benchmarks of these in my answer, but I took them out of my answer because it was sheer noise - and making the answer longer than it needed to be.

@Nawaz 2011-12-18 06:02:23

@Mysticial: I just updated my comment with all those information.

@Nawaz 2011-12-18 06:05:44

@Mysticial: By the way, could you please explain this statement of yours : You'll notice that they all have the same alignment relative to the page. How would I notice that? I don't know much about alignment and page. So please explain this part a bit more.

@Mysticial 2011-12-18 06:07:34

@Nawaz Ah, yes, you have a Nehalem machine. The difference is a lot smaller on those. It's also possible that GCC is doing something differently. Stephan Cannon pointed out the false aliasing as being the most likely culprit. If that's the case I'd expect that to be very sensitive to the order of the instructions that the compiler generates.

@Nawaz 2011-12-18 06:09:44

@Mysticial: I don't understand what false aliasing means. Could you explain this? Or give me a link (if you've any) so that I will read it myself.

@Mysticial 2011-12-18 06:10:39

@Nawaz A page is typically 4KB. If you look at the hexadecimal addresses that I print out, the separately allocated tests all have the same modulo 4096. (that is 32-bytes from the start of a 4KB boundary) Perhaps GCC doesn't have this behavior. That could explain why you aren't see the differences.

@Mysticial 2011-12-18 06:11:32

@Nawaz About the false aliasing. Somebody in chat asked me about this. See the transcript here: (granted, it's interleaved with another conversation...)

@Nawaz 2011-12-18 06:13:51

@Mysticial: Thanks.Now I understand that page alignment thingy. But is that a issue? How does it matter?

@Nawaz 2011-12-18 06:19:20

@Mysticial: I couldn't really understand that, as the chat is too short and I don't understand lots of terminologies out there. Could you refer me something (books or articles) so I will read it myself in detail?

@Mysticial 2011-12-18 06:22:39

@Nawaz At a larger scale (with more than 4 pointers), this alignment will cause conflict cache misses. But for reasons that I mentioned in my answer, this may not be the only culprit. False aliasing due to partial address aliasing can occur if the addresses are very similar. (which will be the case if they are aligned) Anyways, it's hard to get into further detail, without deep knowledge of architecture. The key terms to search for are "out-of-order execution", "load/store unit", and "store buffer". EDIT: We can continue this in chat if you have more questions.

@Oliver Charlesworth 2011-12-18 11:17:15

You say e.g. "nothing fits in cache". What is the relevance of cache size here? Each element is modified only once; I don't see any advantage in being able to store large amounts of the working set in cache.

@Mysticial 2011-12-18 11:19:10

Well, the outer-loop (the test loop) is iterating many times. So if the entire data-set doesn't fit in cache, it will be repeatedly flushed to and from memory.

@Oliver Charlesworth 2011-12-18 11:28:25

Ah, right. I hadn't noticed you were running the entire test multiple times. Then this makes perfect sense.

@Johannes Gerer 2011-12-18 18:18:16

Sorry for the confusion. It was the second line of my question: "This loop is executed 10.000 times via another outer for loop. "

@New Alexandria 2011-12-19 04:24:19

@nanofarad 2014-04-15 23:16:38

Sorry to poke such an old post. Is your Harpertown Xeon a Netburst-microarchitecture CPU? Or is it also Core?

@Mysticial 2014-04-15 23:18:19

@hexafraction It's a 45nm Core 2 Penryn.

@Chris Cirefice 2014-08-27 08:53:57

@Mysticial Out of curiosity... n00b programmers like me aren't supposed to know about this kind of stuff yet... right? I mean, I get it after reading the question and answer, but I never would have been able to figure out the nature of the problem on my own, let alone how to solve it...

@Mysticial 2014-08-27 09:07:52

@ChrisCirefice Yep. This topic is highly specialized to HPC. So the majority of programmers wouldn't know it unless they were in that field or had stumbled upon it randomly either via SO or some other channel.

@Francis Cugler 2017-10-05 01:40:34

This is an excellent answer; however without any software or hardware involved, there is already an existing problem between the differences in the two for loops. This existing problem is a mathematical - algorithmic problem. I like to call this problem an inefficient interpolated bottleneck problem. You can refer to the answer that I have already given.

@mathengineer 2018-07-11 07:00:53

It may be old C++ and optimizations. On my computer I obtained almost the same speed:

One loop: 1.577 ms

Two loops: 1.507 ms

I run Visual Studio 2015 on an E5-1620 3.5 GHz processor with 16 GB RAM.

@OldCurmudgeon 2011-12-18 01:36:31

Imagine you are working on a machine where n was just the right value for it only to be possible to hold two of your arrays in memory at one time, but the total memory available, via disk caching, was still sufficient to hold all four.

Assuming a simple LIFO caching policy, this code:

for(int j=0;j<n;j++){
    a[j] += b[j];
for(int j=0;j<n;j++){
    c[j] += d[j];

would first cause a and b to be loaded into RAM and then be worked on entirely in RAM. When the second loop starts, c and d would then be loaded from disk into RAM and operated on.

the other loop

for(int j=0;j<n;j++){
    a[j] += b[j];
    c[j] += d[j];

will page out two arrays and page in the other two every time around the loop. This would obviously be much slower.

You are probably not seeing disk caching in your tests but you are probably seeing the side effects of some other form of caching.

There seems to be a little confusion/misunderstanding here so I will try to elaborate a little using an example.

Say n = 2 and we are working with bytes. In my scenario we thus have just 4 bytes of RAM and the rest of our memory is significantly slower (say 100 times longer access).

Assuming a fairly dumb caching policy of if the byte is not in the cache, put it there and get the following byte too while we are at it you will get a scenario something like this:

  • With

    for(int j=0;j<n;j++){
     a[j] += b[j];
    for(int j=0;j<n;j++){
     c[j] += d[j];
  • cache a[0] and a[1] then b[0] and b[1] and set a[0] = a[0] + b[0] in cache - there are now four bytes in cache, a[0], a[1] and b[0], b[1]. Cost = 100 + 100.

  • set a[1] = a[1] + b[1] in cache. Cost = 1 + 1.
  • Repeat for c and d.
  • Total cost = (100 + 100 + 1 + 1) * 2 = 404

  • With

    for(int j=0;j<n;j++){
     a[j] += b[j];
     c[j] += d[j];
  • cache a[0] and a[1] then b[0] and b[1] and set a[0] = a[0] + b[0] in cache - there are now four bytes in cache, a[0], a[1] and b[0], b[1]. Cost = 100 + 100.

  • eject a[0], a[1], b[0], b[1] from cache and cache c[0] and c[1] then d[0] and d[1] and set c[0] = c[0] + d[0] in cache. Cost = 100 + 100.
  • I suspect you are beginning to see where I am going.
  • Total cost = (100 + 100 + 100 + 100) * 2 = 800

This is a classic cache thrash scenario.

@Brooks Moses 2011-12-18 21:38:39

This is incorrect. A reference to a particular element of an array does not cause the entire array to be paged in from disk (or from non-cached memory); only the relevant page or cache line is paged in.

@OldCurmudgeon 2011-12-19 21:34:16

@Brooks Moses - If you walk through the whole array, as is happening here, then it will.

@Brooks Moses 2011-12-20 02:05:48

Well, yes, but that's what happens over the whole operation, not what happens each time around the loop. You claimed that the second form "will page out two arrays and page in the other two every time around the loop," and that's what I'm objecting to. Regardless of the size of the overall arrays, in the middle of this loop your RAM will be holding a page from each of the four arrays, and nothing will get paged out until well after the loop has finished with it.

@OldCurmudgeon 2011-12-20 10:35:39

In the particular case where n was just the right value for it only to be possible to hold two of your arrays in memory at one time then accessing all elements of four arrays in one loop must surely end up thrashing.

@OldCurmudgeon 2011-12-20 10:42:53

Say you have just ten bytes of RAM and these arrays are 5 bytes long. Loop 1 would terminate leaving all ten bytes of a1 + a2 in RAM after just one or two page faults (assuming the loop counter is held in a register). Loop 2, however, will page in a1 and b1 for the first assignment, then c1 and d1 for the second and this process must repeat n times. The rest is just scaling.

@Brooks Moses 2011-12-21 23:37:56

Why are you staying that loop 2 pages in the entirety of a1 and b1 for the first assignment, rather than just the first page of each of them? (Are you assuming 5-byte pages, so a page is half of your RAM? That's not just scaling, that's completely unlike a real processor.)

@Brooks Moses 2011-12-21 23:44:30

More realistic numbers: 32kb of RAM, 1kb pages, arrays are 16kb long and aligned to page boundaries, and 1-byte ints. On the first pass through the loop, the first assignment causes the first 1kb page of a1 and b1 are paged in. That takes up 2kb; you've still got 30kb free. So, the second assignment can page in the first 1kb page of c1 and d1, without evicting the first pages of a1 and a2. At that point, you can go around the loop 1023 more times with no page faults. No pages need to be evicted until long after you're done with them.

@mabraham 2013-11-27 10:03:14

Memory hierarchies load upon demand of an address. Only if the compiler issued some kind of pre-fetch instruction might this answer be relevant; if it did, this answer would illustrate why it should do so in a size-aware manner, but since the compiler does not know the memory size of the execution machine, it will not do so. In general, the compiler does not even know the cache size without explicit instructions.

@user1899861 2012-12-30 01:34:43

I cannot replicate the results discussed here.

I don't know if poor benchmark code is to blame, or what, but the two methods are within 10% of each other on my machine using the following code, and one loop is usually just slightly faster than two - as you'd expect.

Array sizes ranged from 2^16 to 2^24, using eight loops. I was careful to initialize the source arrays so the += assignment wasn't asking the FPU to add memory garbage interpreted as a double.

I played around with various schemes, such as putting the assignment of b[j], d[j] to InitToZero[j] inside the loops, and also with using += b[j] = 1 and += d[j] = 1, and I got fairly consistent results.

As you might expect, initializing b and d inside the loop using InitToZero[j] gave the combined approach an advantage, as they were done back-to-back before the assignments to a and c, but still within 10%. Go figure.

Hardware is Dell XPS 8500 with generation 3 Core i7 @ 3.4 GHz and 8 GB memory. For 2^16 to 2^24, using eight loops, the cumulative time was 44.987 and 40.965 respectively. Visual C++ 2010, fully optimized.

PS: I changed the loops to count down to zero, and the combined method was marginally faster. Scratching my head. Note the new array sizing and loop counts.

// MemBufferMystery.cpp : Defines the entry point for the console application.
#include "stdafx.h"
#include <iostream>
#include <cmath>
#include <string>
#include <time.h>

#define  dbl    double
#define  MAX_ARRAY_SZ    262145    //16777216    // AKA (2^24)
#define  STEP_SZ           1024    //   65536    // AKA (2^16)

int _tmain(int argc, _TCHAR* argv[]) {
    long i, j, ArraySz = 0,  LoopKnt = 1024;
    time_t start, Cumulative_Combined = 0, Cumulative_Separate = 0;
    dbl *a = NULL, *b = NULL, *c = NULL, *d = NULL, *InitToOnes = NULL;

    a = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
    b = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
    c = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
    d = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
    InitToOnes = (dbl *)calloc( MAX_ARRAY_SZ, sizeof(dbl));
    // Initialize array to 1.0 second.
    for(j = 0; j< MAX_ARRAY_SZ; j++) {
        InitToOnes[j] = 1.0;

    // Increase size of arrays and time
    for(ArraySz = STEP_SZ; ArraySz<MAX_ARRAY_SZ; ArraySz += STEP_SZ) {
        a = (dbl *)realloc(a, ArraySz * sizeof(dbl));
        b = (dbl *)realloc(b, ArraySz * sizeof(dbl));
        c = (dbl *)realloc(c, ArraySz * sizeof(dbl));
        d = (dbl *)realloc(d, ArraySz * sizeof(dbl));
        // Outside the timing loop, initialize
        // b and d arrays to 1.0 sec for consistent += performance.
        memcpy((void *)b, (void *)InitToOnes, ArraySz * sizeof(dbl));
        memcpy((void *)d, (void *)InitToOnes, ArraySz * sizeof(dbl));

        start = clock();
        for(i = LoopKnt; i; i--) {
            for(j = ArraySz; j; j--) {
                a[j] += b[j];
                c[j] += d[j];
        Cumulative_Combined += (clock()-start);
        printf("\n %6i miliseconds for combined array sizes %i and %i loops",
                (int)(clock()-start), ArraySz, LoopKnt);
        start = clock();
        for(i = LoopKnt; i; i--) {
            for(j = ArraySz; j; j--) {
                a[j] += b[j];
            for(j = ArraySz; j; j--) {
                c[j] += d[j];
        Cumulative_Separate += (clock()-start);
        printf("\n %6i miliseconds for separate array sizes %i and %i loops \n",
                (int)(clock()-start), ArraySz, LoopKnt);
    printf("\n Cumulative combined array processing took %10.3f seconds",
    printf("\n Cumulative seperate array processing took %10.3f seconds",

    free(a); free(b); free(c); free(d); free(InitToOnes);
    return 0;

I'm not sure why it was decided that MFLOPS was a relevant metric. I though the idea was to focus on memory accesses, so I tried to minimize the amount of floating point computation time. I left in the +=, but I am not sure why.

A straight assignment with no computation would be a cleaner test of memory access time and would create a test that is uniform irrespective of the loop count. Maybe I missed something in the conversation, but it is worth thinking twice about. If the plus is left out of the assignment, the cumulative time is almost identical at 31 seconds each.

@Mysticial 2012-12-31 19:15:45

The misalignment penalty that you mention here is when an individual load/store that is misaligned (including the unaligned SSE load/stores). But that is not the case here since the performance is sensitive to the relative alignments of the different arrays. There are no misalignments at the instruction level. Every single load/store is properly aligned.

@Johannes Gerer 2011-12-18 01:29:55

OK, the right answer definitely has to do something with the CPU cache. But to use the cache argument can be quite difficult, especially without data.

There are many answers, that led to a lot of discussion, but let's face it: Cache issues can be very complex and are not one dimensional. They depend heavily on the size of the data, so my question was unfair: It turned out to be at a very interesting point in the cache graph.

@Mysticial's answer convinced a lot of people (including me), probably because it was the only one that seemed to rely on facts, but it was only one "data point" of the truth.

That's why I combined his test (using a continuous vs. separate allocation) and @James' Answer's advice.

The graphs below shows, that most of the answers and especially the majority of comments to the question and answers can be considered completely wrong or true depending on the exact scenario and parameters used.

Note that my initial question was at n = 100.000. This point (by accident) exhibits special behavior:

  1. It possesses the greatest discrepancy between the one and two loop'ed version (almost a factor of three)

  2. It is the only point, where one-loop (namely with continuous allocation) beats the two-loop version. (This made Mysticial's answer possible, at all.)

The result using initialized data:

Enter image description here

The result using uninitialized data (this is what Mysticial tested):

Enter image description here

And this is a hard-to-explain one: Initialized data, that is allocated once and reused for every following test case of different vector size:

Enter image description here


Every low-level performance related question on Stack Overflow should be required to provide MFLOPS information for the whole range of cache relevant data sizes! It's a waste of everybody's time to think of answers and especially discuss them with others without this information.

@Mysticial 2011-12-18 01:48:21

+1 Nice analysis. I didn't intend to leave the data uninitialized in the first place. It just happened that the allocator zeroed them anyway. So the initialized data is what matters. I just edited my answer with results on an actual Core 2 architecture machine and they are a lot closer to what you are observing. Another thing is that I tested a range of sizes n and it shows the same performance gap for n = 80000, n = 100000, n = 200000, etc...

@v.oddou 2017-11-30 05:36:26

@Mysticial I think the OS implements page zeroing whenever giving new pages to a process to avoid possible inter process spying.

@ShadowRanger 2019-07-01 17:20:07

@v.oddou: The behavior depends on the OS too; IIRC, Windows has a thread to background zero-out freed pages, and if a request can't be satisfied from already zeroed pages, the VirtualAlloc call blocks until it can zero enough to satisfy the request. By contrast, Linux just maps the zero page as copy-on-write as much as necessary, and on write, it copies the new zeroes to a fresh page before writing in the new data. Either way, from the user mode process's perspective, the pages are zeroed, but the first use of uninitialized memory will usually be more expensive on Linux than on Windows.

@Guillaume Kiz 2012-08-17 15:23:58

The first loop alternates writing in each variable. The second and third ones only make small jumps of element size.

Try writing two parallel lines of 20 crosses with a pen and paper separated by 20 cm. Try once finishing one and then the other line and try another time by writting a cross in each line alternately.

@FeRD 2019-08-03 02:47:27

Analogies to real-world activities are fraught with peril, when thinking about things like CPU instructions. What you're illustrating is effectively seek time, which would apply if we were talking about reading/writing data stored on a spinning disk, but there is no seek time in the CPU cache (or in RAM, or on an SSD). Accesses to disjoint regions of memory incur no penalty vs. adjacent accesses.

@James 2011-12-17 20:52:48

It's because the CPU doesn't have so many cache misses (where it has to wait for the array data to come from the RAM chips). It would be interesting for you to adjust the size of the arrays continually so that you exceed the sizes of the level 1 cache (L1), and then the level 2 cache (L2), of your CPU and plot the time taken for your code to execute against the sizes of the arrays. The graph shouldn't be a straight line like you'd expect.

@Oliver Charlesworth 2011-12-17 21:01:05

I don't believe there's any interaction between the cache size and the array size. Each array element is only used once, and can then be safely evicted. There may well be an interaction between the cache line size and the array size, though, if it causes the four arrays to conflict.

@Emilio Garavaglia 2011-12-17 20:49:03

It's not because of a different code, but because of caching: RAM is slower than the CPU registers and a cache memory is inside the CPU to avoid to write the RAM every time a variable is changing. But the cache is not big as the RAM is, hence, it maps only a fraction of it.

The first code modifies distant memory addresses alternating them at each loop, thus requiring continuously to invalidate the cache.

The second code don't alternate: it just flow on adjacent addresses twice. This makes all the job to be completed in the cache, invalidating it only after the second loop starts.

@Oliver Charlesworth 2011-12-17 21:22:52

Why would this cause the cache to continuously be invalidated?

@Emilio Garavaglia 2011-12-17 21:30:44

@OliCharlesworth: Think to the cache as an hard copy of a contiguous range of memory addresses. If you pretend to access an address not part of them, you have to re-load the cache. And if something in the cache had been modified, it has to be written back in RAM, or it will be lost. In the sample code, 4 vector of 100'000 integers (400kBytes) are most likely more than the capacity of the L1 cache (128 or 256K).

@Oliver Charlesworth 2011-12-17 21:36:51

The size of the cache has no impact in this scenario. Each array element is only used once, and after that it doesn't matter if it's evicted. Cache size only matters if you have temporal locality (i.e. you're going to re-use the same elements in the future).

@Emilio Garavaglia 2011-12-18 08:43:42

@OliCharlesworth: If I have to load a new value in a cache, and there is already a value in it that has been modifyed, I have first to write it down, and this makes me wait for the write to happen.

@Oliver Charlesworth 2011-12-18 11:11:54

But in both variants of the OP's code, each value gets modified precisely once. You do so the same number of write-backs in each variant.

@Brooks Moses 2011-12-18 21:43:23

Emilio, your analysis is wrong, because your claim that "a cache is a ... copy of a contiguous range of memory addresses" is wrong. That sort of contiguous range is a called a "cache line", and real processor caches have many separate cache lines. Thus, they can hold contiguous sections from multiple different regions at once.

@Emilio Garavaglia 2013-04-04 07:11:53

@BrooksMoses: This is true, but irrelevant in this context (don't forget this is a ANSWER to a QUESTION). I'm not describing the processor theory, but what is actually happening in THAT PARTICULAR CONTEXT. This answer [] is "correct", but matches perfectly my analysis that -hence- cannot be plainly wrong. It just work at an higher approximation level. The one that a programmer that's not programming caches must follow as a guideline. What you say is just a detail, operating at a more microscopic level.

@Puppy 2011-12-17 20:47:55

The second loop involves a lot less cache activity, so it's easier for the processor to keep up with the memory demands.

@Oliver Charlesworth 2011-12-17 20:49:08

You're saying that the second variant incurs fewer cache misses? Why?

@Puppy 2011-12-17 20:50:30

@Oli: In the first variant, the processor needs to access four memory lines at a time- a[i], b[i], c[i] and d[i] In the second variant, it needs just two. This makes it much more viable to refill those lines whilst adding.

@Oliver Charlesworth 2011-12-17 20:52:22

But so long as the arrays don't collide in cache, each variant requires the exact same number of reads and writes from/to main memory. So the conclusion is (I think) that these two arrays happen to be colliding all the time.

@Puppy 2011-12-17 20:54:15

@Oli: But all the lines need reading at once, in the first variant. In the second, the processor can delay reading the second two arrays.

@Oliver Charlesworth 2011-12-17 20:55:39

I don't follow. Per instruction (i.e. per instance of x += y), there are two reads and one write. This is true for either variant. The cache<->CPU bandwidth requirement is therefore the same. So long as there are no conflicts, the cache<->RAM bandwidth requirement is also the same..

@Puppy 2011-12-17 21:00:43

@Oli: That's only true if the CPU can read faster than it can execute. In the double-loop variant, the CPU only has to predict the memory needs of the next += instruction in time to keep the ALU at maximum capacity. In the single-loop variant, the CPU needs to predict the memory needs of two += instructions. If the CPU can't react in time, the memory read won't be issued in time- that is, if the CPU can only see a certain time ahead to predict the memory needs, then in the single-loop version, it can't react in time.

@Oliver Charlesworth 2011-12-17 21:05:29

Either data can be transferred from cache to ALU fast enough to keep it utilised, or it can't be. So far as I understand, cache is "flat" in this regard; there is no difference in cost to read from one cache location than to read from another.

@Puppy 2011-12-17 21:07:56

@Oli: It's not about whether the cache is flat, it's about whether the CPU can see far enough ahead to issue the reads in advance. The more operations you try to perform, the less likely it is that the CPU can out-of-order execute them.

@Oliver Charlesworth 2011-12-17 21:17:37

Ok, but what prediction does the CPU do in this regard? The main thing that permits OOE is the ability to track register dependencies, no?

@Puppy 2011-12-17 21:19:52

@Oli: Got to admit, I don't know all that much about it. However, it seems fairly logical to me that it'd be much easier to look at a simpler loop.

@Oliver Charlesworth 2011-12-17 21:22:20

No worries; you forced me to think about it ;) Like most discussions about cache behaviour; I still don't know 100% which of us is correct... (although I think @Mysticial has found the issue in his answer).

@Puppy 2011-12-17 21:28:35

@Oli: I suspect that at this level, it's beyond the theory of cache and into the details of a given vendor's specific implementation on a specific CPU.

@Brooks Moses 2011-12-18 21:51:11

As noted in, the Pentium M's hardware prefetch can track 12 different forward streams (and I would expect later hardware to be at least as capable). Loop 2 is still only reading four streams, so is well within that limit.

@Puppy 2011-12-19 07:33:04

@Brooks: Later hardware is derived from the Pentium 3 architecture, not Pentium 4. More than that, there's a big difference between integral ALU ops and SSE operations.

@Ben Voigt 2012-06-20 16:18:14

@DeadMG: Later hardware is derived from Pentium M, which is derived from Pentium 3. Brooks's comment appears to acknowledge that.

Related Questions

Sponsored Content

10 Answered Questions

10 Answered Questions

[SOLVED] Why is 2 * (i * i) faster than 2 * i * i in Java?

26 Answered Questions

11 Answered Questions

21 Answered Questions

[SOLVED] Why should I use a pointer rather than the object itself?

3 Answered Questions

[SOLVED] Why does Python code run faster in a function?

3 Answered Questions

[SOLVED] Why is printing "B" dramatically slower than printing "#"?

14 Answered Questions

[SOLVED] Is < faster than <=?

40 Answered Questions

[SOLVED] When is assembly faster than C?

5 Answered Questions

[SOLVED] Why is [] faster than list()?

Sponsored Content