By balajimc55

2015-11-12 07:55:01 8 Comments

All the following instructions do the same thing: set %eax to zero. Which way is optimal (requiring fewest machine cycles)?

xorl   %eax, %eax
mov    $0, %eax
andl   $0, %eax


@Peter Cordes 2015-11-12 09:37:17

TL;DR summary: xor same, same is the best choice for all CPUs. No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD. In 64bit mode, still use xor r32, r32, because writing a 32-bit reg zeros the upper 32. xor r64, r64 is a waste of a byte, because it needs a REX prefix.

Even worse than that, Silvermont only recognizes xor r32,r32 as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d, not xor r10,r10.

GP-integer examples:

xor   eax, eax       ; RAX = 0.  Including AL=0 etc.
xor   r10d, r10d     ; R10 = 0
xor   edx, edx       ; RDX = 0

; small code-size alternative:    cdq    ; zero RDX if EAX is already zero

xor   rax,rax       ; waste of a REX prefix, and extra slow on Silvermont
mov   eax, 0        ; doesn't touch FLAGS, but not faster and takes more bytes

xor   al, al        ; false dep on some CPUs, not a zeroing idiom.
mov   al, 0         ; only 2 bytes, and probably better than xor al,al *if* you need to leave the rest of EAX/RAX unmodified

Zeroing a vector register is usually best done with pxor xmm, xmm. That's typically what gcc does (even before use with FP instructions).

xorps xmm, xmm can make sense. It's one byte shorter than pxor, but xorps needs execution port 5 on Intel Nehalem, while pxor can run on any port (0/1/5). (Nehalem's 2c bypass delay latency between integer and FP is usually not relevant, because out-of-order execution can typically hide it at the start of a new dependency chain).

On SnB-family microarchitectures, neither flavour of xor-zeroing even needs an execution port. On AMD, and pre-Nehalem P6/Core2 Intel, xorps and pxor are handled the same way (as vector-integer instructions).

Using the AVX version of a 128b vector instruction zeros the upper part of the reg as well, so vpxor xmm, xmm, xmm is a good choice for zeroing YMM(AVX1/AVX2) or ZMM(AVX512), or any future vector extension. vpxor ymm, ymm, ymm doesn't take any extra bytes to encode, though, and runs the same on Intel, but slower on AMD before Zen2 (2 uops). The AVX512 ZMM zeroing would require extra bytes (for the EVEX prefix), so XMM or YMM zeroing should be preferred.

XMM/YMM/ZMM examples

 xorps   xmm0, xmm0         ; smallest code size (for non-AVX)
 pxor    xmm0, xmm0         ; costs an extra byte, runs on any port on Nehalem.
 xorps   xmm15, xmm15       ; Needs a REX prefix but that's unavoidable if you need to use high registers without AVX.  Code-size is the only penalty.

   # with AVX
 vpxor xmm0, xmm0, xmm0    ; zeros X/Y/ZMM0
 vpxor xmm15, xmm0, xmm0   ; zeros X/Y/ZMM15, still only 2-byte VEX prefix

 vpxord xmm30, xmm30, xmm30  ; EVEX is unavoidable when zeroing high 16, but still prefer XMM or YMM for fewer uops on probable future AMD.  May be worth it to avoid needing vzeroupper.
 vpxord zmm30, zmm30, zmm30  ; Without AVX512VL you have to use a 512-bit instruction.

 vpxor   xmm15, xmm15, xmm15   ; 3-byte VEX prefix for high source reg
 vpxord  zmm0, zmm0, zmm0      ; EVEX prefix (4 bytes), and a 512-bit uop.

See Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm? and
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?

Semi-related: Fastest way to set __m256 value to all ONE bits and
Set all bits in CPU register to 1 efficiently also covers AVX512 k0..7 mask registers.

What's special about zeroing idioms like xor on various uarches

Some CPUs recognize sub same,same as a zeroing idiom like xor, but all CPUs that recognize any zeroing idioms recognize xor. Just use xor so you don't have to worry about which CPU recognizes which zeroing idiom.

xor (being a recognized zeroing idiom, unlike mov reg, 0) has some obvious and some subtle advantages (summary list, then I'll expand on those):

  • smaller code-size than mov reg,0. (All CPUs)
  • avoids partial-register penalties for later code. (Intel P6-family and SnB-family).
  • doesn't use an execution unit, saving power and freeing up execution resources. (Intel SnB-family)
  • smaller uop (no immediate data) leaves room in the uop cache-line for nearby instructions to borrow if needed. (Intel SnB-family).
  • doesn't use up entries in the physical register file. (Intel SnB-family (and P4) at least, possibly AMD as well since they use a similar PRF design instead of keeping register state in the ROB like Intel P6-family microarchitectures.)

Smaller machine-code size (2 bytes instead of 5) is always an advantage: Higher code density leads to fewer instruction-cache misses, and better instruction fetch and potentially decode bandwidth.

The benefit of not using an execution unit for xor on Intel SnB-family microarchitectures is minor, but saves power. It's more likely to matter on SnB or IvB, which only have 3 ALU execution ports. Haswell and later have 4 execution ports that can handle integer ALU instructions, including mov r32, imm32, so with perfect decision-making by the scheduler (which doesn't always happen in practice), HSW could still sustain 4 uops per clock even when they all need ALU execution ports.

See my answer on another question about zeroing registers for some more details.

Bruce Dawson's blog post that Michael Petch linked (in a comment on the question) points out that xor is handled at the register-rename stage without needing an execution unit (zero uops in the unfused domain), but missed the fact that it's still one uop in the fused domain. Modern Intel CPUs can issue & retire 4 fused-domain uops per clock. That's where the 4 zeros per clock limit comes from. Increased complexity of the register renaming hardware is only one of the reasons for limiting the width of the design to 4. (Bruce has written some very excellent blog posts, like his series on FP math and x87 / SSE / rounding issues, which I do highly recommend).

On AMD Bulldozer-family CPUs, mov immediate runs on the same EX0/EX1 integer execution ports as xor. mov reg,reg can also run on AGU0/1, but that's only for register copying, not for setting from immediates. So AFAIK, on AMD the only advantage to xor over mov is the shorter encoding. It might also save physical register resources, but I haven't seen any tests.

Recognized zeroing idioms avoid partial-register penalties on Intel CPUs which rename partial registers separately from full registers (P6 & SnB families).

xor will tag the register as having the upper parts zeroed, so xor eax, eax / inc al / inc eax avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor, IvB only needs a merging uop when the high 8bits (AH) are modified and then the whole register is read, and Haswell even removes that.

From Agner Fog's microarch guide, pg 98 (Pentium M section, referenced by later sections including SnB):

The processor recognizes the XOR of a register with itself as setting it to zero. A special tag in the register remembers that the high part of the register is zero so that EAX = AL. This tag is remembered even in a loop:

    ; Example    7.9. Partial register problem avoided in loop
    xor    eax, eax
    mov    ecx, 100
    mov    al, [esi]
    mov    [edi], eax    ; No extra uop
    inc    esi
    add    edi, 4
    dec    ecx
    jnz    LL

(from pg82): The processor remembers that the upper 24 bits of EAX are zero as long as you don't get an interrupt, misprediction, or other serializing event.

pg82 of that guide also confirms that mov reg, 0 is not recognized as a zeroing idiom, at least on early P6 designs like PIII or PM. I'd be very surprised if they spent transistors on detecting it on later CPUs.

xor sets flags, which means you have to be careful when testing conditions. Since setcc is unfortunately only available with an 8bit destination, you usually need to take care to avoid partial-register penalties.

It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit setcc r/m, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway.

Ideally, you should use xor / set flags / setcc / read full register:

call  some_func
xor     ecx,ecx    ; zero *before* the test
test    eax,eax
setnz   cl         ; cl = (some_func() != 0)
add     ebx, ecx   ; no partial-register penalty here

This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies).

Things are more complicated when you don't want to xor before a flag-setting instruction. e.g. you want to branch on one condition and then setcc on another condition from the same flags. e.g. cmp/jle, sete, and you either don't have a spare register, or you want to keep the xor out of the not-taken code path altogether.

There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It appears to be cheaper on SnB, but I didn't spend much time trying to measure. Using mov reg, 0 / setcc would have a significant penalty on older Intel CPUs, and still be somewhat worse on newer Intel.

Using setcc / movzx r32, r8 is probably the best alternative for Intel P6 & SnB families, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf / lahf or pushf / popf). IvB can eliminate movzx r32, r8 (i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). Haswell and later only eliminate regular mov instructions, so movzx takes an execution unit and has non-zero latency, making test/setcc/movzx worse than xor/test/setcc, but still at least as good as test/mov r,0/setcc (and much better on older CPUs).

Using setcc / movzx with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0/setcc for zeroing / dependency-breaking is probably the best alternative when xor/test/setcc isn't an option.

Of course, if you don't need setcc's output to be wider than 8 bits, you don't need to zero anything. However, beware of false dependencies on CPUs other than P6 / SnB if you pick a register that was recently part of a long dependency chain. (And beware of causing a partial reg stall or extra uop if you call a function that might save/restore the register you're using part of.)

and with an immediate zero isn't special-cased as independent of the old value on any CPUs I'm aware of, so it doesn't break dependency chains. It has no advantages over xor, and many disadvantages.

See for microarch documentation, including which zeroing idioms are recognized as dependency breaking (e.g. sub same,same is on some but not all CPUs, while xor same,same is recognized on all.) mov does break the dependency chain on the old value of the register (regardless of the source value, zero or not, because that's how mov works). xor only breaks dependency chains in the special-case where src and dest are the same register, which is why mov is left out of the list of specially recognized dependency-breakers. (Also, because it's not recognized as a zeroing idiom, with the other benefits that carries.)

Interestingly, the oldest P6 design (PPro through Pentium III) didn't recognize xor-zeroing as a dependency-breaker, only as a zeroing idiom for the purposes of avoiding partial-register stalls, so in some cases it was worth using both mov and then xor-zeroing in that order to break the dep and then zero again + set the internal tag bit that the high bits are zero so EAX=AX=AL. (See Agner Fog's Example 6.17. in his microarch pdf. He says this also applies to P2, P3, and even (early?) PM. A comment on the linked blog post says it was only PPro that had this oversight, but I've tested on Katmai PIII, and @Fanael tested on a Pentium M, and we both found that it didn't break a dependency for a latency-bound imul chain. This confirms Agner Fog's results, unfortunately.)

If it really makes your code nicer or saves instructions, then sure, zero with mov to avoid touching the flags, as long as you don't introduce a performance problem other than code size. Avoiding clobbering flags is the only sensible reason for not using xor, though.

@Z boson 2015-11-12 10:12:50

Interesting. So it's not really 100% free. I mean even though it does not use a port it still costs a micro-op. That's a subtlety I missed in Agner's manual. Thanks! So it has zero latency but throughput is 4 (or 0.25 reciprocal throughput).

@Ira Baxter 2015-11-12 10:41:05

Most arithmetic instructions OP R,S are forced by an out of order CPU to wait for the content of register R to be filled by previous instructions with register R as a target; this is a data dependency. The key point is that Intel/AMD chips have special hardware to break must-wait-for-data-dependencies on register R when XOR R,R is encountered, and does not necessarily do so for other register zeroing instructions. This means the XOR instruction can be scheduled for immediate execution, and this is why Intel/AMD recommend using it.

@Peter Cordes 2015-11-12 11:15:14

@IraBaxter: Yup, and just to avoid any confusion (because I have seen this misconception on SO), mov reg, src also breaks dep chains for OO CPUs (regardless of src being imm32, [mem], or another register). This dependency-breaking doesn't get mentioned in optimization manuals because it's not a special case that only happens when src and dest are the same register. It always happens for instructions that don't depend on their dest. (except for Intel's implementation of popcnt/lzcnt/tzcnt having a false dep on the dest.)

@Peter Cordes 2015-11-12 13:35:52

@Zboson: The "latency" of an instruction with no dependencies only matters if there was a bubble in the pipeline. It's nice for mov-elimination, but for zeroing instructions the zero-latency benefit only comes into play after something like a branch mispredict or I$ miss, where execution is waiting for the decoded instructions, rather than for data to be ready. But yes, mov-elimination doesn't make mov free, only zero latency. The "not taking an execution port" part usually isn't important. Fused-domain throughput can easily be the bottleneck, esp. with loads or stores in the mix.

@Z boson 2016-12-22 10:22:38

According to Agner KNL does not recognize Independence of 64-bit registers. So xor r64, r64 does not just waste a byte. As you say xor r32, r32 is the best choice especially with KNL. See section 15.7 "Special cases of independence" in this micrarch manual if you want to read more.

@Peter Cordes 2016-12-22 10:35:27

@Zboson: I already started working on a KNL update for this answer after seeing that in Agner's update: need to point out that r32 is important even when it doesn't save a REX prefix (xor r8d,r8d). But then I got side-tracked setting up my new Skylake i7-6700k desktop with 16G of DDR4-2666 RAM :) Bit of an upgrade from 65nm Core2Duo, e.g. 8x faster video encoding with x264 (which has good tuning/optimization for Core2, unlike x265). I'll get back to that edit Real Soon Now, since I still have the text saved.

@Z boson 2016-12-22 10:40:03

I am getting a Skylake system soon as well. Boring! Well not if you are coming from Core2Duo. But now that I have access to KNL. I was highly disappointed in Skylake due to the lack of AVX512. So much that I quit thinking about x86 SIMD for a while. I know that internally Skylake changes the pipeline over Broadwell significantly (so I hear) even if the instruction sets does not change but still.

@Peter Cordes 2016-12-22 10:44:46

@Zboson: yeah, front-end bubbles should be a lot rarer with the increased uop-cache read bandwidth and increased legacy-decode throughput. (But the 4-uop/clock frontend max is still the same). More instructions running on more ports is potentially very cool. If I want to tune code to be good on recent Intel before SKL, I'm worried that SKL will avoid bottlenecks that HSW/BDW still have :/ I considered buying worse hardware :P We should take this to chat if you have any more to say. (I don't ATM). But I'd have to look how to create a chat when the comment thing doesn't offer the option.

@Z boson 2016-12-22 10:55:19

I would look into Function Multiversioning (FMV) with GCC if you want to optimize for SKL as well as other archs. See…

@Peter Cordes 2016-12-22 10:57:04

@Zboson: I meant that code I tested/tuned on SKL might fail to avoid performance pitfalls for HSW or SnB, regardless of instruction-set choice. The question isn't how to make different versions for HSW and SKL, it's how to tune the HSW version without testing on pre-SKL hardware.

@Z boson 2016-12-22 11:35:19

Okay, I understand what you mean now.

@BeeOnRope 2017-07-26 04:14:02

Here's an example of even newest gcc somewhat stupidly issuing a mov reg, 0 rather than an xor for a simple function. Sure, it is probably doing that because it needs the flags preserved from the earlier cmp, but it could have just swapped the order! clang does fine, and icc also uses an xor but only gets part marks because it pointlessly includes a mov esi, esi in the critical path.

@Peter Cordes 2017-07-26 04:22:29

@BeeOnRope: yeah, gcc does silly stuff like that sometimes. I wonder if it comes out of trying to save a register for cases where the cmp can't be deferred. (e.g. if it wanted to result in esi). In other cases of gcc making worse code (like setcc/movzx instead of xor/setcc), it usually looks like an idiom designed to reduce register pressure, used even when there isn't any.

@BeeOnRope 2017-07-26 04:47:13

Yeah, perhaps there is no feedback mechanism for the register pressure: it uses some "typical" tradeoff while generating code, and then when it gets to the end of a function it doesn't go back and relax the assumption that there might be register pressure.

@Peter Cordes 2017-07-26 04:57:03

@BeeOnRope: Yeah exactly. I'm not sure how "canned" some of its idioms are, but I imagine it's easier for a compiler to deal with setcc/movzx as a single thing that stays together, vs. having to add stuff to the function's internal representation to express "ok, we need an xor-zeroed register before the flag-setting", and maybe have a fallback in case that's hard to do. (Although in most cases you'd expect it would just end up redoing a test or cmp).

@hayalci 2017-12-29 00:24:53

ah, where's good old MIPS, with its "zero register" when you need it.

@Wes Turner 2017-12-29 05:29:18

Is there a way to do this with negative entropy (less heat) when the register value is known? "The thermodynamic meaning of negative entropy"

@Peter Cordes 2017-12-29 19:11:06

@WesTurner: Not in a regular x86 CPU using CMOS logic, like Intel Sandybridge-family. Running zeroing-idiom instructions uses about the same amount of power as running NOP instructions on SnB-family, cheaper than mov-immediate or a regular XOR, and probably less than any other instruction other than pause. But still much more than sitting in low-power sleep. Modern digital logic is very far from the information-theoretic limits of energy per computation, and nothing they do internally has negative cost.

@Evan Carroll 2018-10-19 00:44:34

I totally lose you when you hit on setcc what does that have to do with zeroing? How does that add to the xor-idiom?

@user784668 2018-12-16 10:43:01

@PeterCordes "See Agner Fog's Example 6.17. in his microarch pdf. He claims this also applies to P2, P3, and even (early?) PM, but I'm sceptical of that. A comment on the linked blog post says it was only PPro that had this oversight. It seems really unlikely that multiple generations of the P6 family existed without recognizing xor-zeroing as a dep breaker." I tested it on my Tualatin Pentium III and Dothan Pentium M by looping on 10× imul eax, eax/xor eax, eax, with the reasoning that if xor is dep-breaking then the loop will be throughput bound and latency bound if it's not…

@user784668 2018-12-16 10:48:21

…and on those CPUs the results are unambiguous: 50 cycles for each 22 instructions (i.e. one iteration), indicating a clear dependency chain; compare 10 cycles for each 22 instructions on more modern CPUs where xor is dependency breaking. So it's clear that Agner is correct here in that xor is not dependency breaking on Pentium II/III and Pentium M. It may have changed in Yonah, the last generation of Pentium M sold as Core Solo and Core Duo (note, not Core 2), but I don't have that hardware to test.

@Peter Cordes 2018-12-16 11:29:43

@Fanael: Thanks, I should have updated this a while ago. I checked on a Katmai PIII and found it wasn't dep-breaking a while ago, but never finished editing an update. Made one now to fix the two major things I left out.

@ecm 2019-09-27 19:34:06

"so in some cases it was worth using both." I am unclear on what the other one is meant as here, aside the xor-zeroing. Is it mov with immediate? And would you do them in the order mov reg, 0 \ xor reg, reg, or the other way around?

@Peter Cordes 2019-09-27 20:32:06

@ecm: Yes, zero the reg and break the dependency on the old value with mov reg, 0, then zero it again and set the internal EAX = AX = AL tag bit (avoiding partial-register stalls) with xor reg,reg. i.e. the "upper bits zeroed" tag. IDK why xor-zeroing didn't break the dependency; maybe recognizing that in the decoders or issue stage would have taken extra logic vs. handling it only in the boolean logic execution unit? But AFAIK it only worked with xor same,same, not just any xor that happened to produce 0, and the ALU doesn't know if its inputs came from a reg, mem, or immed.

@ecm 2019-09-27 20:37:15

Thanks! Will you edit your answer to make that part clearer?

@Peter Cordes 2019-09-27 20:44:29

@ecm: I think the context for that "both" got lost in previous edits; fixed. Thanks for pointing it out.

Related Questions

Sponsored Content

10 Answered Questions

[SOLVED] Improve INSERT-per-second performance of SQLite?

9 Answered Questions

6 Answered Questions

16 Answered Questions

[SOLVED] String formatting: % vs. .format

35 Answered Questions

77 Answered Questions

4 Answered Questions

[SOLVED] How do I achieve the theoretical maximum of 4 FLOPs per cycle?

2 Answered Questions

10 Answered Questions

4 Answered Questions

Sponsored Content