By Cassio Neri


2019-05-12 17:03:02 8 Comments

In a research project of mine I'm writing C++ code. However, the generated assembly is one of the crucial points of the project. C++ doesn't provide direct access to flag manipulating instructions, in particular, to ADC but this shouldn't be a problem provided the compiler is smart enough to use it. Consider:

constexpr unsigned X = 0;

unsigned f1(unsigned a, unsigned b) {
    b += a;
    unsigned c = b < a;
    return c + b + X;
}

Variable c is a workaround to get my hands on the carry flag and add it to b and X. It looks I got luck and the (g++ -O3, version 9.1) generated code is this:

f1(unsigned int, unsigned int):
 add %edi,%esi
 mov %esi,%eax
 adc $0x0,%eax
 retq 

For all values of X that I've tested the code is as above (except, of course for the immediate value $0x0 that changes accordingly). I found one exception though: when X == -1 (or 0xFFFFFFFFu or ~0u, ... it really doesn't matter how you spell it) the generated code is:

f1(unsigned int, unsigned int):
 xor %eax,%eax
 add %edi,%esi
 setb %al
 lea -0x1(%rsi,%rax,1),%eax
 retq 

This seems less efficient than the initial code as suggested by indirect measurements (not very scientific though) Am I right? If so, is this a "missing optimization opportunity" kind of bug that is worth reporting?

For what is worth, clang -O3, version 8.8.0, always uses ADC (as I wanted) and icc -O3, version 19.0.1 never does.

I've tried using the intrinsic _addcarry_u32 but it didn't help.

unsigned f2(unsigned a, unsigned b) {
    b += a;
    unsigned char c = b < a;
    _addcarry_u32(c, b, X, &b);
    return b;
}

I reckon I might not be using _addcarry_u32 correctly (I couldn't find much info on it). What's the point of using it since it's up to me to provide the carry flag? (Again, introducing c and praying for the compiler to understand the situation.)

I might, actually, be using it correctly. For X == 0 I'm happy:

f2(unsigned int, unsigned int):
 add %esi,%edi
 mov %edi,%eax
 adc $0x0,%eax
 retq 

For X == -1 I'm unhappy :-(

f2(unsigned int, unsigned int):
 add %esi,%edi
 mov $0xffffffff,%eax
 setb %dl
 add $0xff,%dl
 adc %edi,%eax
 retq 

I do get the ADC but this is clearly not the most efficient code. (What's dl doing there? Two instructions to read the carry flag and restore it? Really? I hope I'm very wrong!)

1 comments

@Peter Cordes 2019-05-12 18:25:17

mov + adc $-1, %eax is more efficient than xor-zero + setc + 3-component lea for both latency and uop count on most CPUs, and no worse on any still-relevant CPUs.1


This looks like a gcc missed optimization: it probably sees a special case and latches onto that, shooting itself in the foot and preventing the adc pattern recognition from happening.

I don't know what exactly it saw / was looking for, so yes you should report this as a missed-optimization bug. Or if you want to dig deeper yourself, you could look at the GIMPLE or RTL output after optimization passes and see what happens. If you know anything about GCC's internal representations. Godbolt has a GIMPLE tree-dump window you can add from the same dropdown as "clone compiler".


The fact that clang compiles it with adc proves that it's legal, i.e. that the asm you want does match the C++ source, and you didn't miss some special case that's stopping the compiler from doing that optimization. (Assuming clang is bug-free, which is the case here.)

That problem can certainly happen if you're not careful, e.g. trying to write a general-case adc function that takes carry in and provides carry-out from the 3-input addition is hard in C, because either of the two additions can carry so you can't just use the sum < a+b idiom after adding the carry to one of the inputs. I'm not sure it's possible to get gcc or clang to emit add/adc/adc where the middle adc has to take carry-in and produce carry-out.

e.g. 0xff...ff + 1 wraps around to 0, so sum = a+b+carry_in / carry_out = sum < a can't optimize to an adc because it needs to ignore carry in the special case where a = -1 and carry_in = 1.

So another guess is that maybe gcc considered doing the + X earlier, and shot itself in the foot because of that special case. That doesn't make a lot of sense, though.


What's the point of using it since it's up to me to provide the carry flag?

You're using _addcarry_u32 correctly.

The point of its existence is to let you express an add with carry in as well as carry out, which is hard in pure C. GCC and clang don't optimize it well, often not just keeping the carry result in CF.

If you only want carry-out, you can provide a 0 as the carry in and it will optimize to add instead of adc, but still give you the carry-out as a C variable.

e.g. to add two 128-bit integers in 32-bit chunks, you can do this

// bad on x86-64 because it doesn't optimize the same as 2x _addcary_u64
// even though __restrict guarantees non-overlap.
void adc_128bit(unsigned *__restrict dst, const unsigned *__restrict src)
{
    unsigned char carry;
    carry = _addcarry_u32(0, dst[0], src[0], &dst[0]);
    carry = _addcarry_u32(carry, dst[1], src[1], &dst[1]);
    carry = _addcarry_u32(carry, dst[2], src[2], &dst[2]);
    carry = _addcarry_u32(carry, dst[3], src[3], &dst[3]);
}

(On Godbolt with GCC/clang/ICC)

That's very inefficient vs. unsigned __int128 where compilers would just use 64-bit add/adc, but does get clang and ICC to emit a chain of add/adc/adc/adc. GCC makes a mess, using setcc to store CF to an integer for some of the steps, then add dl, -1 to put it back into CF for an adc.

GCC unfortunately sucks at extended-precision / biginteger written in pure C. Clang sometimes does slightly better, but most compilers are bad at it. This is why the lowest-level gmplib functions are hand-written in asm for most architectures.


Footnote 1: or for uop count: equal on Intel Haswell and earlier where adc is 2 uops, except with a zero immediate where Sandybridge-family's decoders special case that as 1 uop.

But the 3-component LEA with a base + index + disp makes it a 3-cycle latency instruction on Intel CPUs, so it's definitely worse.

On Intel Broadwell and later, adc is a 1-uop instruction even with a non-zero immediate, taking advantage of support for 3-input uops introduced with Haswell for FMA.

So equal total uop count but worse latency means that adc would still be a better choice.

https://agner.org/optimize/

@Cassio Neri 2019-05-12 20:22:04

Many thanks. +1 also for the explanation on _addcarry_u32. My real case (like to my post) is concerned about ADD followed by ADC where the carry after ADD is important but the one after ADC is not. Hence, I was not paying attention to the latter and I failed to appreciate the crucial point of _addcarry_u32 which is producing the carry out. Now, thanks to your post, _addcarry_u32 makes perfect sense to me. I'll keep that in mind. I have a feeling that I might need that soon and you probably saved me from another headache.

@Cassio Neri 2019-05-12 20:27:50

I will report the bug to gcc. Sorry for my rant in the post. This project is really uncovering bugs in many unexpected places (even godbolt) and I keep reporting them. This reason only is enough to keep me motivated and carry on with it.

@Peter Cordes 2019-05-12 20:33:36

@CassioNeri: yeah, it certainly gets frustrating when compilers are dumb and pessimize your code no matter how you express it in C. Having to fight against C and it's lack of primitives for add-with-carry, popcnt, bitscan, and so on doesn't help, but at least there are intrinsics if you have the luxury of targeting a specific implementation like GNU C for x86.

@Cassio Neri 2019-05-13 08:43:27

I've done my homework :-) gcc.gnu.org/bugzilla/show_bug.cgi?id=90447 The question became even more interesting after the addition of a third version where I don't create variable c and use b += X + (b < a) instead.

Related Questions

Sponsored Content

5 Answered Questions

[SOLVED] What are the rules about using an underscore in a C++ identifier?

1 Answered Questions

[SOLVED] Memory allocation and addressing in Assembly

  • 2019-01-17 05:01:39
  • D Kar
  • 313 View
  • 0 Score
  • 1 Answer
  • Tags:   c assembly x86-64

1 Answered Questions

[SOLVED] Avoiding the JMP in the JMP CALL POP technique for shellcode NASM?

1 Answered Questions

[SOLVED] g++ bug? (bool_val ? 0 : 1) returns neither 0 nor 1

  • 2015-07-16 13:34:37
  • mikewei
  • 1307 View
  • 15 Score
  • 1 Answer
  • Tags:   c++ gcc

0 Answered Questions

Cannot read ZF correctly?

  • 2016-10-27 18:47:57
  • David
  • 64 View
  • 0 Score
  • 0 Answer
  • Tags:   assembly x86 gas

1 Answered Questions

NASM on linux: Using sys_read adds extra line at the end

2 Answered Questions

[SOLVED] How does loop address alignment affect the speed on Intel x86_64?

2 Answered Questions

[SOLVED] Efficient 128-bit addition using carry flag

2 Answered Questions

1 Answered Questions

[SOLVED] x86 ADC carry flag and length

Sponsored Content