By Lukas Kalbertodt


2019-07-07 12:47:56 8 Comments

I noticed something really strange that I haven't seen before. The basic setup is described in this pseudo code:

TARGET_LOOP_TIME = X

loop forever:
    before = now()
    payload()
    payload_time = now() - before
    sleep(TARGET_LOOP_TIME - payload_time)

This setup is fairly common, e.g. for keeping the loop at 60 FPS. The interesting part is: the payload_time depends on the sleep duration! If the TARGET_LOOP_TIME is high and the program will thus sleep a lot, the payload_time is way higher compared to when the program does not sleep at all.

To measure this, I wrote this program:

use std::time::{Duration, Instant};

const ITERS: usize = 100;

fn main() {
    // A dummy variable to prevent the compiler from removing the dummy prime
    // code.
    let mut x = 0;

    // Iterate over different target loop times
    for loop_time in (1..30).map(|n| Duration::from_millis(n)) {
        let mut payload_duration = Duration::from_millis(0);

        for _ in 0..ITERS {
            let before = Instant::now();
            x += count_primes(3_500);
            let elapsed = before.elapsed();
            payload_duration += elapsed;

            // Sleep the remaining time
            if loop_time > elapsed {
                std::thread::sleep(loop_time - elapsed);
            }
        }

        let avg_duration = payload_duration / ITERS as u32;
        println!("loop_time {:.2?}  \t=> {:.2?}", loop_time, avg_duration);
    }

    println!("{}", x);
}

/// Dummy function.
fn count_primes(up_to: u64) -> u64 {
    (2..up_to)
        .filter(|n| (2..n / 2).all(|d| n % d != 0))
        .count() as u64
}

I iterate over different target loop times to test (1ms to 30ms) and iterate for ITERS many times. I compiled this with cargo run --release. On my machine (Ubuntu), the program outputs:

loop_time 1.00ms    => 3.37ms
loop_time 2.00ms    => 3.38ms
loop_time 3.00ms    => 3.17ms
loop_time 4.00ms    => 3.25ms
loop_time 5.00ms    => 3.38ms
loop_time 6.00ms    => 4.05ms
loop_time 7.00ms    => 4.09ms
loop_time 8.00ms    => 4.48ms
loop_time 9.00ms    => 4.43ms
loop_time 10.00ms   => 4.22ms
loop_time 11.00ms   => 4.59ms
loop_time 12.00ms   => 5.53ms
loop_time 13.00ms   => 5.82ms
loop_time 14.00ms   => 6.18ms
loop_time 15.00ms   => 6.32ms
loop_time 16.00ms   => 6.96ms
loop_time 17.00ms   => 8.00ms
loop_time 18.00ms   => 7.97ms
loop_time 19.00ms   => 8.28ms
loop_time 20.00ms   => 8.75ms
loop_time 21.00ms   => 9.70ms
loop_time 22.00ms   => 9.57ms
loop_time 23.00ms   => 10.48ms
loop_time 24.00ms   => 10.29ms
loop_time 25.00ms   => 10.31ms
loop_time 26.00ms   => 10.82ms
loop_time 27.00ms   => 10.84ms
loop_time 28.00ms   => 10.82ms
loop_time 29.00ms   => 10.91ms

I made a plot of those numbers (the sleep_time is max(0, loop_time - avg_duration)):

enter image description here

When the program does not sleep at all, the payload needs approximately 3.3ms (as the first three measurements show). As soon as the loop starts to sleep after the payload, the payload duration rises! In fact, it increases to roughly 10.5ms where it stays. Sleeping even longer does not increase the payload time.

Why? Why does the execution time of the piece of code depend on something I do afterwards (or before)? That does not make sense to me! It looks like the CPU says "I'm gonna sleep afterwards anyway, so let's take it slowly". I thought about caching effects, especially of the instruction cache, but loading instruction data from the main memory does not take 7ms! Something else is going on here!

Is there a way to fix this? I.e. to make the payload execute as quickly as possible regardless of sleep time?

1 comments

@rodrigo 2019-07-07 13:32:05

I am quite sure that this is caused by the CPU throttling. When the OS scheduler detects there is little work to do, the CPU frequency lowers to save power.

When you do a lot of sleeps, you are telling the scheduler that you are not in such a hurry, and the CPU can take it easy.

You can see that this is the case by running a CPU intensive task in another window, with low priority. For example, in Linux you can run:

$ nice bash -c 'while true ; do : ; done'

And at the same time, in another window run your program:

$ cargo run --release
loop_time 1.00ms    => 3.13ms
loop_time 2.00ms    => 3.17ms
loop_time 3.00ms    => 3.19ms
loop_time 4.00ms    => 3.13ms
loop_time 5.00ms    => 3.16ms
loop_time 6.00ms    => 3.22ms
loop_time 7.00ms    => 3.14ms
loop_time 8.00ms    => 3.15ms
loop_time 9.00ms    => 3.13ms
loop_time 10.00ms   => 3.18ms
loop_time 11.00ms   => 3.14ms
loop_time 12.00ms   => 3.17ms
loop_time 13.00ms   => 3.15ms
...

Avoiding this depends on your OS. For example, in Linux, you can fiddle with sys/devices/system/cpu/* options. I think that UPower provides some functions to manage it from a non-root application. It would be nice if there were a crate that managed this cross-system, but I don't know of any.

An easy but hacky way to fix this, if you don't mind the wasted power, is just to run an idle thread with a busy loop.

std::thread::spawn(|| {
    use thread_priority::*; //external crate thread-priority
    let thread_id = thread_native_id();
    set_thread_priority(
        thread_id,
        ThreadPriority::Min,
        ThreadSchedulePolicy::Normal(NormalThreadSchedulePolicy::Idle),
    )
    .unwrap();
    loop {}
});

Naturally, if you just want to avoid throttling in this piece of code, you can do a busy wait:

    //if loop_time > elapsed {
    //    std::thread::sleep(loop_time - elapsed);
    //}
    // Busy-wait the remaining time, to avoid CPU throttling
    while loop_time > before.elapsed() {
        //you may want to try both with and without yield
        std::thread::yield_now();
    }

@Lukas Kalbertodt 2019-07-07 13:59:31

Thanks! The CPU frequency thing makes sense. Regarding the solution: instead of a busy thread I could probably just busy wait instead of sleeping in my main thread.

@Lukas Kalbertodt 2019-07-07 14:00:58

I have an additional question though (in case someone can answer this): according to this paper (if I understand it correctly) the latency to change between two CPU frequency is fairly low (<100µs). So I would expect the CPU to increase the frequency fairly quickly after my payload started. Why isn't that happening?

@rodrigo 2019-07-07 14:05:51

@LukasKalbertodt: It is not so easy. You seem to think that the CPU is going from 100% power to 0% and then to 100% again, and the time you lose is because of the latency in the switch. But while the latency in the switch is definitely a thing, your main issue is the frequency throttling, that is that the OS scheduler sees that your code is only e.g. 30% of the time busy, so it can reduce the CPU frequency to 30% of its maximum without losing overall performance, while saving a lot in power.

@rodrigo 2019-07-07 14:07:41

@LukasKalbertodt: In fact, in your graph you can read that the OS is actually trying to predict the time you are going to sleep, and go to the lowers power that makes you still sleep just a bit. Up to a maximum of course, the CPU frequency has a minimum limit.

@Lukas Kalbertodt 2019-07-07 14:08:26

I see. Is there any way to tell the OS to "switch to full power now"? Or some kind of related hint? In my real application, I care about latency so I really want my piece of code to execute as quickly as possible (but also want to sleep the remaining time).

@rodrigo 2019-07-07 14:10:58

@LukasKalbertodt: Not that I know of, that is both reliable and cross-platform. Your most reliable option, which is also the easiest would be just to busy wait. Let me add the code to my answer.

@Jens 2019-07-07 14:53:30

@LukasKalbertodt These settings are OS-specific. On Linux, you can echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor, on Windows you select something the power saving settings. There is also Turbo Boost which dynamically increases the max. frequency of a core if there is thermal budget. I would recommand to disable power scaling and turbo boost in the BIOS for benchmarking.

@Matthieu M. 2019-07-08 07:50:49

@LukasKalbertodt: I would also note that there are other reasons for core frequency to vary than throttling. On multi-cores CPUs, there's also scaling: that is, if a single core is in use, it can get to a higher frequency than if all cores are busy, for both power supply and heat dissipation reasons. Furthermore the use of specifically power-hungry instructions (such as AVX512) may also trigger this kind of frequency switch.

Related Questions

Sponsored Content

10 Answered Questions

[SOLVED] Improve INSERT-per-second performance of SQLite?

4 Answered Questions

[SOLVED] Why is if (variable1 % variable2 == 0) inefficient?

  • 2019-01-28 16:01:07
  • Robert Cotterman
  • 17473 View
  • 175 Score
  • 4 Answer
  • Tags:   java performance

10 Answered Questions

3 Answered Questions

[SOLVED] Why does Python code run faster in a function?

6 Answered Questions

1 Answered Questions

[SOLVED] C++ What does the execution time of a class method depend on?

  • 2017-06-22 08:44:48
  • HeavensInc
  • 75 View
  • -1 Score
  • 1 Answer
  • Tags:   c++ performance

1 Answered Questions

1 Answered Questions

[SOLVED] Why does this piece of code gets slower with time?

2 Answered Questions

[SOLVED] Python time.sleep vs busy wait accuracy

1 Answered Questions

Sponsored Content