By Adrian


2010-07-24 01:31:52 8 Comments

As far as I can tell, the only difference between __asm { ... }; and __asm__("..."); is that the first uses mov eax, var and the second uses movl %0, %%eax with :"=r" (var) at the end. What other differences are there? And what about just asm?

4 comments

@Ciro Santilli 新疆改造中心法轮功六四事件 2018-04-14 11:35:36

asm vs __asm__ in GCC

asm does not work with -std=c99, you have two alternatives:

  • use __asm__
  • use -std=gnu99

More details: error: ‘asm’ undeclared (first use in this function)

__asm vs __asm__ in GCC

I could not find where __asm is documented (notably not mentioned at https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/Alternate-Keywords.html#Alternate-Keywords ), but from the GCC 8.1 source they are exactly the same:

  { "__asm",        RID_ASM,    0 },
  { "__asm__",      RID_ASM,    0 },

so I would just use __asm__ which is documented.

@Peter Cordes 2018-05-28 22:19:48

The OP is asking about __asm { ... }; (with braces, not parens), so it's definitely not GNU C inline asm. Clang and MSVC support MSVC-style inline asm, but gcc doesn't.

@Ciro Santilli 新疆改造中心法轮功六四事件 2018-05-28 22:25:41

@PeterCordes thanks for the info. Partly answering this here because of GCC tag, and partly because of the Google hit theory. Now looking at history the GCC tag was not added by the OP, but not by me either ;-)

@Ciro Santilli 新疆改造中心法轮功六四事件 2018-05-28 22:31:28

@PeterCordes can you also add the appropriate microsoft compiler tag to the question in addition to gcc? I don't know which one it is thankfully :-)

@Peter Cordes 2018-05-28 22:38:56

visual-c++ is the tag for Microsoft's dialect of C++, implemented by MSVC.

@Peter Cordes 2016-03-12 15:53:58

There's a massive difference between MSVC inline asm and GNU C inline asm. GCC syntax is designed for optimal output without wasted instructions, for wrapping a single instruction or something. MSVC syntax is designed to be fairly simple, but AFAICT it's impossible to use without the latency and extra instructions of a round trip through memory for your inputs and outputs.

If you're using inline asm for performance reasons, this makes MSVC inline asm only viable if you write a whole loop entirely in asm, not for wrapping short sequences in an inline function. The example below (wrapping idiv with a function) is the kind of thing MSVC is bad at: ~8 extra store/load instructions.

MSVC inline asm (used by MSVC and probably icc, maybe also available in some commercial compilers):

  • looks at your asm to figure out which registers your code steps on.
  • can only transfer data via memory. Data that was live in registers is stored by the compiler to prepare for your mov ecx, shift_count, for example. So using a single asm instruction that the compiler won't generate for you involves a round-trip through memory on the way in and on the way out.
  • more beginner-friendly, but often impossible to avoid overhead getting data in/out. Even besides the syntax limitations, the optimizer in current versions of MSVC isn't good at optimizing around inline asm blocks, either.

GNU C inline asm is not a good way to learn asm. You have to understand asm very well so you can tell the compiler about your code. And you have to understand what compilers need to know. That answer also has links to other inline-asm guides and Q&As. The tag wiki has lots of good stuff for asm in general, but just links to that for GNU inline asm. (The stuff in that answer is applicable to GNU inline asm on non-x86 platforms, too.)

GNU C inline asm syntax is used by gcc, clang, icc, and maybe some commercial compilers which implement GNU C:

  • You have to tell the compiler what you clobber. Failure to do this will lead to breakage of surrounding code in non-obvious hard-to-debug ways.
  • Powerful but hard to read, learn, and use syntax for telling the compiler how to supply inputs, and where to find outputs. e.g. "c" (shift_count) will get the compiler to put the shift_count variable into ecx before your inline asm runs.
  • extra clunky for large blocks of code, because the asm has to be inside a string constant. So you typically need

    "insn   %[inputvar], %%reg\n\t"       // comment
    "insn2  %%reg, %[outputvar]\n\t"
    
  • very unforgiving / harder, but allows lower overhead esp. for wrapping single instructions. (wrapping single instructions was the original design intent, which is why you have to specially tell the compiler about early clobbers to stop it from using the same register for an input and output if that's a problem.)


Example: full-width integer division (div)

On a 32bit CPU, dividing a 64bit integer by a 32bit integer, or doing a full-multiply (32x32->64), can benefit from inline asm. gcc and clang don't take advantage of idiv for (int64_t)a / (int32_t)b, probably because the instruction faults if the result doesn't fit in a 32bit register. So unlike this Q&A about getting quotient and remainder from one div, this is a use-case for inline asm. (Unless there's a way to inform the compiler that the result will fit, so idiv won't fault.)

We'll use calling conventions that put some args in registers (with hi even in the right register), to show a situation that's closer to what you'd see when inlining a tiny function like this.


MSVC

Be careful with register-arg calling conventions when using inline-asm. Apparently the inline-asm support is so badly designed/implemented that the compiler might not save/restore arg registers around the inline asm, if those args aren't used in the inline asm. Thanks @RossRidge for pointing this out.

// MSVC.  Be careful with _vectorcall & inline-asm: see above
// we could return a struct, but that would complicate things
int _vectorcall div64(int hi, int lo, int divisor, int *premainder) {
    int quotient, tmp;
    __asm {
        mov   edx, hi;
        mov   eax, lo;
        idiv   divisor
        mov   quotient, eax
        mov   tmp, edx;
        // mov ecx, premainder   // Or this I guess?
        // mov   [ecx], edx
    }
    *premainder = tmp;
    return quotient;     // or omit the return with a value in eax
}

Update: apparently leaving a value in eax or edx:eax and then falling off the end of a non-void function (without a return) is supported, even when inlining. I assume this works only if there's no code after the asm statement. This avoids the store/reloads for the output (at least for quotient), but we can't do anything about the inputs. In a non-inline function with stack args, they will be in memory already, but in this use-case we're writing a tiny function that could usefully inline.


Compiled with MSVC 19.00.23026 /O2 on rextester (with a main() that finds the directory of the exe and dumps the compiler's asm output to stdout).

## My added comments use. ##
; ... define some symbolic constants for stack offsets of parameters
; 48   : int ABI div64(int hi, int lo, int divisor, int *premainder) {
    sub esp, 16                 ; 00000010H
    mov DWORD PTR _lo$[esp+16], edx      ## these symbolic constants match up with the names of the stack args and locals
    mov DWORD PTR _hi$[esp+16], ecx

    ## start of __asm {
    mov edx, DWORD PTR _hi$[esp+16]
    mov eax, DWORD PTR _lo$[esp+16]
    idiv    DWORD PTR _divisor$[esp+12]
    mov DWORD PTR _quotient$[esp+16], eax  ## store to a local temporary, not *premainder
    mov DWORD PTR _tmp$[esp+16], edx
    ## end of __asm block

    mov ecx, DWORD PTR _premainder$[esp+12]
    mov eax, DWORD PTR _tmp$[esp+16]
    mov DWORD PTR [ecx], eax               ## I guess we should have done this inside the inline asm so this would suck slightly less
    mov eax, DWORD PTR _quotient$[esp+16]  ## but this one is unavoidable
    add esp, 16                 ; 00000010H
    ret 8

There's a ton of extra mov instructions, and the compiler doesn't even come close to optimizing any of it away. I thought maybe it would see and understand the mov tmp, edx inside the inline asm, and make that a store to premainder. But that would require loading premainder from the stack into a register before the inline asm block, I guess.

This function is actually worse with _vectorcall than with the normal everything-on-the-stack ABI. With two inputs in registers, it stores them to memory so the inline asm can load them from named variables. If this were inlined, even more of the parameters could potentially be in the regs, and it would have to store them all, so the asm would have memory operands! So unlike gcc, we don't gain much from inlining this.

Doing *premainder = tmp inside the asm block means more code written in asm, but does avoid the totally braindead store/load/store path for the remainder. This reduces the instruction count by 2 total, down to 11 (not including the ret).

I'm trying to get the best possible code out of MSVC, not "use it wrong" and create a straw-man argument. But AFAICT it's horrible for wrapping very short sequences. Presumably there's an intrinsic function for 64/32 -> 32 division that allows the compiler to generate good code for this particular case, so the entire premise of using inline asm for this on MSVC could be a straw-man argument. But it does show you that intrinsics are much better than inline asm for MSVC.


GNU C (gcc/clang/icc)

Gcc does even better than the output shown here when inlining div64, because it can typically arrange for the preceding code to generate the 64bit integer in edx:eax in the first place.

I can't get gcc to compile for the 32bit vectorcall ABI. Clang can, but it sucks at inline asm with "rm" constraints (try it on the godbolt link: it bounces function arg through memory instead of using the register option in the constraint). The 64bit MS calling convention is close to the 32bit vectorcall, with the first two params in edx, ecx. The difference is that 2 more params go in regs before using the stack (and that the callee doesn't pop the args off the stack, which is what the ret 8 was about in the MSVC output.)

// GNU C
// change everything to int64_t to do 128b/64b -> 64b division
// MSVC doesn't do x86-64 inline asm, so we'll use 32bit to be comparable
int div64(int lo, int hi, int *premainder, int divisor) {
    int quotient, rem;
    asm ("idivl  %[divsrc]"
          : "=a" (quotient), "=d" (rem)    // a means eax,  d means edx
          : "d" (hi), "a" (lo),
            [divsrc] "rm" (divisor)        // Could have just used %0 instead of naming divsrc
            // note the "rm" to allow the src to be in a register or not, whatever gcc chooses.
            // "rmi" would also allow an immediate, but unlike adc, idiv doesn't have an immediate form
          : // no clobbers
        );
    *premainder = rem;
    return quotient;
}

compiled with gcc -m64 -O3 -mabi=ms -fverbose-asm. With -m32 you just get 3 loads, idiv, and a store, as you can see from changing stuff in that godbolt link.

mov     eax, ecx  # lo, lo
idivl  r9d      # divisor
mov     DWORD PTR [r8], edx       # *premainder_7(D), rem
ret

For 32bit vectorcall, gcc would do something like

## Not real compiler output, but probably similar to what you'd get
mov     eax, ecx               # lo, lo
mov     ecx, [esp+12]          # premainder
idivl   [esp+16]               # divisor
mov     DWORD PTR [ecx], edx   # *premainder_7(D), rem
ret   8

MSVC uses 13 instructions (not including the ret), compared to gcc's 4. With inlining, as I said, it potentially compiles to just one, while MSVC would still use probably 9. (It won't need to reserve stack space or load premainder; I'm assuming it still has to store about 2 of the 3 inputs. Then it reloads them inside the asm, runs idiv, stores two outputs, and reloads them outside the asm. So that's 4 loads/stores for input, and another 4 for output.)

@Ira Baxter 2016-03-12 15:57:33

Great answer!!!

@Ross Ridge 2016-03-12 16:52:13

I'd say Microsoft's inline assembly is less forgiving. It was never very reliable and it was always a bit of crap shoot whether a new compiler version wouldn't break your code. Apparently it's also a huge maintenance nightmare for Microsoft, trying to ensure new compiler versions don't break existing code, and that's why they dropped support for it for their most recently supported architectures (IA-64, AMD64, ARM). In particular your MSVC example is dangerous, as it mixes a register based calling convention with inline assembly: msdn.microsoft.com/en-us/library/k1a8ss06.aspx.

@Peter Cordes 2016-03-12 16:55:15

@Ross: I assumed they dropped it for amd64 because it can't create good code, and vector intrinsics were good enough. Thanks for the heads up. Not that I ever would have recommended that anyone use MSVC inline asm for anything in the first place.

@Peter Cordes 2016-03-12 17:00:39

@Ross: Hmm, from my reading of that link, my specific use is not dangerous, because both register parameters are inputs to the inline asm (so they get stored to the stack). The general suggestion that this is a good example to modify into other code is dangerous, though. IDK how the compiler can check what regs you use and not see it should save args / restore args that will be clobbered by the asm, or put them in regs the asm doesn't use. "because a function has no way to tell which parameter is in which register" is nonsense. Compilers have to know this to make code from C source...

@Ross Ridge 2016-03-12 17:51:30

@PeterCordes Yah, for your specific example it should work. The fact that the problem exists demonstrates just how crudely inline assembler is implemented in Microsoft's compiler. I don't think the compiler even warns you that you're using a register that it needs for something else.

@Cody Gray 2016-12-16 10:05:34

For what it's worth, here's how I would write the function in inline asm for MSVC. At the bottom shows the disassembly. I compiled the function with __stdcall, but __cdecl works just as well. Don't use __vectorcall or __fastcall, they only pessimize inline asm. Notice that the output is pretty much identical to GCC's, except for the need to explicitly load the parameters from the stack into registers. That's the hard limitation of MSVC's inline asm, completely unavoidable and inevitably leads to sub-optimal code.

@Cody Gray 2016-12-16 10:09:00

Note that this whole business does get inlined by the compiler, inline asm and all. Here's an example of that, with a stub function that exercises it. Again, the only horribleness is the way that parameters get passed in to the block. If you write it correctly, you can avoid all penalties for return values out of the block. Of course, the compiler doesn't parse the inline asm instructions, so it doesn't know enough to elide the setting of the remainder, even though the Test function never uses it. But I'm not sure that GCC does this, either.

@Cody Gray 2016-12-16 10:18:15

Unfortunately, as far as I'm aware, there is no intrinsic that will do this operation. There is a nice intrinsic for 32-bit multiplication, producing a 64-bit result, and a variety of other arithmetic and bit-shifting operations, but none for division. There is MulDiv, but that's an API, not an intrinsic, and saddled with certain legacy behaviors for edge cases. MSVC supports a 64-bit int type for 32-bit builds, and it works sensibly for some...

@Cody Gray 2016-12-16 10:23:35

...operations, but multiply, divide, modulo, and shift are not among them. These always result in calls to CRT functions (_allmul, _alldiv, _allrem, _allshl, _allshr, and their unsigned equivalents). Worse, those CRT functions are not as efficient as I'd like, even though they're implemented in assembly. They're never inlined, no matter what (which is apparently fairly common; GCC and Clang don't generally inline their lib functions, either), so you can't possibly do worse by writing your own inline asm. Especially because it avoids a function call, and here, can be much simpler.

@Peter Cordes 2016-12-16 14:29:10

@CodyGray: Thanks for the clarification. So falling off the bottom of a non-void function after inline asm leaves something in EAX is a supported idiom that MSVC can optimize around? (ah, you commented on that here) That does make it a lot less horrible than I thought. I'll have to revise some of this answer.

@Cody Gray 2016-12-16 14:36:35

Yes, totally supported and totally optimizable. Now, if you were to ask me to produce some documentation supporting this claim...I'd have to hunt around. There isn't something I can point to off the top of my head. Most of my knowledge comes from actually writing code, tweaking it, analyzing disassembly, and actually using it in projects that have extensive assertions for correctness validation. But I think the fact that it used to emit a warning but no longer does is compelling evidence that it's recognizing the idiom. Not hard to see how it could, since all calling conventions work this way.

@Peter Cordes 2016-12-16 14:45:06

@CodyGray: The thing that's weird for me is that it's "inline" asm, not necessarily "whole function body in asm". So there's no way to take advantage of this without wrapping each block of inline asm in a separate inline function (which I'd probably want to do anyway, but still). You couldn't write a second inline-asm block inside a function and use EAX as an input register, could you? (Maybe you could, but it sounds like a terrible idea :P) I guess supporting this makes sense as an optimization after they realize they didn't design any clean and efficient way to communicate results.

@Cody Gray 2016-12-16 14:48:52

I think I mentioned this in another comment, but the thing to keep in mind is that this is a very old feature from Microsoft's (or perhaps Lattice's?) first C compilers, where it was primarily used to call BIOS interrupts and stuff like that. So it was "inline", and often was only one or two instructions rather than an entire block, but no one cared about optimization, or returning values, or whatever. They've entirely let it languish, and don't even support inline asm on 64-bit targets. Yes, I always use it in an inline fxn. Two inline asm blocks in a single function => bad results.

@Evan Carroll 2018-03-04 21:48:46

You are a mad mad genius with lvl 9,000 explanatory powers @PeterCordes.

@oDisPo 2012-06-07 20:26:44

With gcc compiler, it's not a big difference. asm or __asm or __asm__ are same, they just use to avoid conflict namespace purpose (there's user defined function that name asm, etc.)

@Ben Voigt 2010-07-24 01:33:31

Which one you use depends on your compiler. This isn't standard like the C language.

@Steven Sudit 2010-07-24 01:35:29

Uhm, yeah. The whole point of those leading underbars is to make it clear that this is non-standard.

@Manav 2011-03-02 15:16:14

the first one works in VC++, the second one in gcc

Related Questions

Sponsored Content

10 Answered Questions

[SOLVED] Improve INSERT-per-second performance of SQLite?

17 Answered Questions

[SOLVED] What is the difference between const int*, const int * const, and int const *?

21 Answered Questions

[SOLVED] C: What is the difference between ++i and i++?

23 Answered Questions

10 Answered Questions

[SOLVED] What is the difference between g++ and gcc?

  • 2008-10-05 20:25:13
  • Brian R. Bondy
  • 422897 View
  • 803 Score
  • 10 Answer
  • Tags:   c++ gcc g++

27 Answered Questions

11 Answered Questions

[SOLVED] What is the difference between float and double?

4 Answered Questions

[SOLVED] What does the C ??!??! operator do?

  • 2011-10-19 16:56:59
  • Peter Olson
  • 249715 View
  • 1892 Score
  • 4 Answer
  • Tags:   c operators trigraphs

10 Answered Questions

10 Answered Questions

Sponsored Content