By BeeOnRope


2019-02-04 22:00:33 8 Comments

Consider the following simple program:

#include <cstring>
#include <cstdio>
#include <cstdlib>

void replace(char *str, size_t len) {
    for (size_t i = 0; i < len; i++) {
        if (str[i] == '/') {
            str[i] = '_';
        }
    }
}

const char *global_str = "the quick brown fox jumps over the lazy dog";

int main(int argc, char **argv) {
  const char *str = argc > 1 ? argv[1] : global_str;
  replace(const_cast<char *>(str), std::strlen(str));
  puts(str);
  return EXIT_SUCCESS;
}

It takes an (optional) string on the command line and prints it, with / characters replaced by _. This replacement functionality is implemented by the c_repl function1. For example, a.out foo/bar prints:

foo_bar

Elementary stuff so far, right?

If you don't specify a string, it conveniently uses the global string the quick brown fox jumps over the lazy dog, which doesn't contain any / characters, and so doesn't undergo any replacement.

Of course, string constants are const char[], so I need to cast away the constness first - that's the const_cast you see. Since the string is never actually modified, I am under the impression this is legal.

gcc and clang compile a binary that has the expected behavior, with or without passing a string on the command line. icc crashes, when you don't provide a string, however:

icc -xcore-avx2 char_replace.cpp && ./a.out
Segmentation fault (core dumped)

The underlying cause is the main loop for c_repl which looks like this:

  400c0c:       vmovdqu ymm2,YMMWORD PTR [rsi]
  400c10:       add    rbx,0x20
  400c14:       vpcmpeqb ymm3,ymm0,ymm2
  400c18:       vpblendvb ymm4,ymm2,ymm1,ymm3
  400c1e:       vmovdqu YMMWORD PTR [rsi],ymm4
  400c22:       add    rsi,0x20
  400c26:       cmp    rbx,rcx
  400c29:       jb     400c0c <main+0xfc>

It is a vectorized loop. The basic idea is that 32 bytes are loaded, and then compared against the / character, forming a mask value with a byte set for each byte that matched, and then the existing string is blended against a vector containing 32 _ characters, effectively replacing only the / characters. Finally, the updated register is written back to the string, with the vmovdqu YMMWORD PTR [rsi],ymm4 instruction.

This final store crashes, because the string is read-only and allocated in the .rodata section of the binary, which is loaded using read-only pages. Of course, the store was a logical "no op", writing back the same characters it read, but the CPU doesn't care!

Is my code legal C++ and therefore I should blame icc for miscompiling this, or I am wading into the UB swamp somewhere?


1 The same crash from the same issue occurs with std::replace on a std::string rather than my "C-like" code, but I wanted to simplify the analysis as much as possible and make it entirely self-contained.

1 comments

@Peter Cordes 2019-02-05 05:38:12

Your program is well-formed and free of undefined behaviour, as far as I can tell. The C++ abstract machine never actually assigns to a const object. A not-taken if() is sufficient to "hide" / "protect" things that would be UB if they executed. The only thing an if(false) can't save you from is an ill-formed program, e.g. syntax errors or trying to use extensions that don't exist on this compiler or target arch.

Compilers aren't in general allowed to invent writes with if-conversion to branchless code.

Casting away const is legal, as long as you don't actually assign through it. e.g. for passing a pointer to a function that isn't const-correct, and takes a read-only input with a non-const pointer. The answer you linked on Is it allowed to cast away const on a const-defined object as long as it is not actually modified? is correct.


ICC's behaviour here is not evidence for UB in ISO C++ or C. I think your reasoning is sound, and this is well-defined. You've found an ICC bug. If anyone cares, report it on their forums: https://software.intel.com/en-us/forums/intel-c-compiler. Existing bug reports in that section of their forum have been accepted by developers, e.g. this one.


We can construct an example where it auto-vectorizes the same way (with unconditional and non-atomic read/maybe-modify/rewrite) where it's clearly illegal, because the read / rewrite is happening on a 2nd string that the C abstract machine doesn't even read.

Thus, we can't trust ICC's code-gen to tell us anything about when we've caused UB, because it will make crashing code even in clearly legal cases.

Godbolt: ICC19.0.1 -O2 -march=skylake (Older ICC only understood options like -xcore-avx2, but modern ICC understands the same -march as GCC/clang.)

#include <stddef.h>

void replace(const char *str1, char *str2, size_t len) {
    for (size_t i = 0; i < len; i++) {
        if (str1[i] == '/') {
            str2[i] = '_';
        }
    }
}

It checks for overlap between str1[0..len-1] and str2[0..len-1], but for large enough len and no overlap it will use this inner loop:

..B1.15:                        # Preds ..B1.15 ..B1.14                //do{
        vmovdqu   ymm2, YMMWORD PTR [rsi+r8]                    #6.13   // load from str2
        vpcmpeqb  ymm3, ymm0, YMMWORD PTR [rdi+r8]              #5.24   // compare vs. str1
        vpblendvb ymm4, ymm2, ymm1, ymm3                        #6.13   // blend
        vmovdqu   YMMWORD PTR [r8+rsi], ymm4                    #6.13   // store to str2
        add       r8, 32                                        #4.5    // i+=32
        cmp       r8, rax                                       #4.5
        jb        ..B1.15       # Prob 82%                      #4.5   // }while(i<len);

For thread-safety, it's well known that inventing write via non-atomic read/rewrite is unsafe.

The C++ abstract machine never touches str2 at all, so that invalidates any arguments for the one-string version about data-race UB being impossible because reading str at the same time another thread is writing it was already UB. Even C++20 std::atomic_ref doesn't change that, because we're reading through a non-atomic pointer.

But even worse than that, str2 can be nullptr. Or pointing to close to the end of an object (which happens to be stored near the end of a page), with str1 containing chars such that no writes past the end of str2 / the page will happen. We could even arrange for only the very last byte (str2[len-1]) to be in a new page, so that it's one-past-the-end of a valid object. It's even legal to construct such a pointer (as long as you don't deref). But it would be legal to pass str2=nullptr; code behind an if() that doesn't run doesn't cause UB.

Or another thread is running the same search/replace function in parallel, with a different key/replacement that will only write different elements of str2. The non-atomic load/store of unmodified values will step on modified values from the other thread. It's definitely allowed, according to the C++11 memory model, for different threads to simultaneously touch different elements of the same array. C++ memory model and race conditions on char arrays. (This is why char must be as large as the smallest unit of memory the target machine can write without a non-atomic RMW. An internal atomic RMW for a byte stores into cache is fine, though, and doesn't stop byte-store instructions from being useful.)

(This example is only legal with the separate str1/str2 version, because reading every element means the threads would be reading array elements the other thread could be in the middle of writing, which is data-race UB.)

As Herb Sutter mentioned in atomic<> Weapons: The C++ Memory Model and Modern Hardware Part 2: Restrictions on compilers and hardware (incl. common bugs); code generation and performance on x86/x64, IA64, POWER, ARM, and more; relaxed atomics; volatile: weeding out non-atomic RMW code-gen has been an ongoing issue for compilers after C++11 was standardized. We're most of the way there, but highly-aggressive and less-mainstream compilers like ICC clearly still have bugs.

(However, I'm pretty confident that Intel compiler devs would consider this a bug.)


Some less-plausible (to see in a real program) examples that this would also break:

Besides nullptr, you could pass a pointer to (an array of) std::atomic<T> or a mutex where a non-atomic read/rewrite breaks things by inventing writes. (char* can alias anything).

Or str2 points to a buffer that you've carved up for dynamic allocation, and the early part of str1 will have some matches, but later parts of str1 won't have any matches, and that part of str2 is being used by other threads. (And for some reason you can't easily calculate a length that stops the loop short).


For future readers: If you want to let compilers auto-vectorize this way:

You can write source like str2[i] = x ? replacement : str2[i]; that always writes the string in the C++ abstract machine. IIRC, that lets gcc/clang vectorize the way ICC does after doing its unsafe if-conversion to blend.

In theory an optimizing compiler can turn it back into a conditional branch in the scalar cleanup or whatever to avoid dirtying memory unnecessarily. (Or if targeting an ISA like ARM32 where a predicated store is possible, instead of only ALU select operations like x86 cmov, PowerPC isel, or AArch64 csel. ARM32 predicated instructions are architecturally a NOP if the predicate is false).

Or if an x86 compiler chose to use AVX512 masked stores, that would also make it safe to vectorize the way ICC does: masked stores do fault suppression, and never actually store to elements where the mask is false. (When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?).

vpcmpeqb k1, zmm0, [rdi]   ; compare from memory into mask
vmovdqu8 [rsi]{k1}, zmm1   ; masked store that only writes elements where the mask is true

ICC19 actually does basically this (but with indexed addressing modes) with -march=skylake-avx512. But with ymm vectors because 512-bit lowers max turbo too much to be worth it unless your whole program is heavily using AVX512, on Skylake Xeons anyway.

So I think ICC19 is safe when vectorizing this with AVX512, but not AVX2. Unless there are problems in its cleanup code where it does something more complicated with vpcmpuq and kshift / kor, a zero-masked load, and a masked compare into another mask reg.


AVX1 has masked stores (vmaskmovps/pd) with fault-suppression and everything, but until AVX512BW there's no granularity narrower than 32 bits. The AVX2 integer versions are only available in dword/qword granularity, vpmaskmovd/q.

@BeeOnRope 2019-02-05 06:08:27

I don't think you even need to use the char aliasing std::atomic objects to come up with a memory model problem: imagine the scenario where some thread T2 has properly synchronized exclusive access to str2 (eg because it was published from T1 to T2 via some mechanism), and is doing writes. Meanwhile T1 is running the replace function with str2, but this is allowed because you've arranged that no accessed occur: but the invented writes clobber stuff T2 is doing!

@BeeOnRope 2019-02-05 06:12:50

Alternately consider the case where T1 and T2 share some char array, but by convention T1 only ever accesses odd elements and T2 even. This is allowed because array elements are separate objects per the memory model (eg this means byte writes can't use a RMW of a larger word). Now either or both of T1/2 use the shared array as str2 organizing via str1 to only access "their" elements of str2. Seems legal. It will blow up on occasion tho because the merge is effectively a wide RMW which will clobber adjacent elements. No std::atomic aliasing through a char needed!

@Peter Cordes 2019-02-05 07:14:29

@BeeOnRope: oh yes, duh, another thread could simply be running the same search/replace with a key that's known to write a disjoint set of elements. Forgot about that case.

@BeeOnRope 2019-02-18 16:10:39

This a good answer, but I'm having trouble accepting it because I feel it lacks a clear answer to the headline question. Maybe a couple sentences at the start to answer the question? The first sentence is a bit confusing to me, maybe it could be clarified. I guess you are saying something like "The way ICC compiles this code is not evidence for UB in your code in C++...", right?

@Peter Cordes 2019-02-18 18:39:03

@BeeOnRope: good point. Added a section at the start. It might possibly be an overstatement that an if(false) can save you from everything except ill-formed code, but I can't think of a counterexample.

Related Questions

Sponsored Content

1 Answered Questions

[SOLVED] Can cython be compiled with icc?

1 Answered Questions

5 Answered Questions

[SOLVED] Why does std::ends cause string comparison to fail?

  • 2009-04-29 23:53:58
  • Andrew Potapov
  • 2165 View
  • 0 Score
  • 5 Answer
  • Tags:   c++ stl boost

6 Answered Questions

[SOLVED] c++ mysql sqlstring crashes 0xC0000005 unable to read memory

  • 2010-11-19 14:14:34
  • Miquel
  • 2722 View
  • 0 Score
  • 6 Answer
  • Tags:   c++ mysql string

2 Answered Questions

How can I see which compilation options are enabled on Intel ICC compiler?

  • 2015-12-16 11:13:34
  • Axel Borja
  • 1146 View
  • 11 Score
  • 2 Answer
  • Tags:   c++ g++ icc

0 Answered Questions

1 Answered Questions

1 Answered Questions

[SOLVED] C++ replace string

  • 2014-03-30 16:34:33
  • user3478522
  • 142 View
  • -2 Score
  • 1 Answer
  • Tags:   c++

1 Answered Questions

[SOLVED] libvlc-qt realloc error executing example program of libvlc-qt

  • 2013-10-04 07:13:36
  • pinco pallo pino
  • 770 View
  • 3 Score
  • 1 Answer
  • Tags:   c++ qt vlc

1 Answered Questions

[SOLVED] Can I compile by using ICC for several CPU architectures?

  • 2013-09-09 17:24:56
  • Alex
  • 481 View
  • 1 Score
  • 1 Answer
  • Tags:   c++ c x86-64 icc

Sponsored Content