By Devon


2019-02-05 14:15:27 8 Comments

I was solving some problem on codeforces. Normally I first check if the character is upper or lower English letter then subtract or add 32 to convert it to the corresponding letter. But I found someone do ^= 32 to do the same thing. Here it is:

char foo = 'a';
foo ^= 32;
char bar = 'A';
bar ^= 32;
cout << foo << ' ' << bar << '\n'; // foo is A, and bar is a

I have searched for an explanation for this and didn't find out. So why this works?

10 comments

@Damon 2019-02-06 11:43:34

Allow me to say that this is -- although it seems smart -- a really, really stupid hack. If someone recommends this to you in 2019, hit him. Hit him as hard as you can.
You can, of course, do it in your own software that you and nobody else uses if you know that you will never use any language but English anyway. Otherwise, no go.

The hack was arguable "OK" some 30-35 years ago when computers didn't really do much but English in ASCII, and maybe one or two major European languages. But... no longer so.

The hack works because US-Latin upper- and lowercases are exactly 0x20 apart from each other and appear in the same order, which is just one bit of difference. Which, in fact, this bit hack, toggles.

Now, the people creating code pages for Western Europe, and later the Unicode consortium, were smart enough to keep this scheme for e.g. German Umlauts and French-accented Vowels. Not so for ß which (until someone convinced the Unicode consortium in 2017, and a large Fake News print magazine wrote about it, actually convincing the Duden -- no comment on that) don't even exist as a versal (transforms to SS). Now it does exist as versal, but the two are 0x1DBF positions apart, not 0x20.

The implementors were, however, not considerate enough to keep this going. For example, if you apply your hack in some East European languages or the like (I wouldn't know about Cyrillic), you will get a nasty surprise. All those "hatchet" characters are examples of that, lowercase and uppercase are one apart. The hack thus does not work properly there.

There's much more to consider, for example, some characters do not simply transform from lower- to uppercase at all (they're replaced with different sequences), or they may change form (requiring different code points).

Do not even think about what this hack will do to stuff like Thai or Chinese (it'll just give you complete nonsense).

Saving a couple of hundred CPU cycles may have been very worthwhile 30 years ago, but nowadays, there is really no excuse for converting a string properly. There are library functions for performing this non-trivial task.
The time taken to convert several dozens kilobytes of text properly is negligible nowadays.

@Bill K 2019-02-06 17:58:23

I totally agree--although it is a good idea for every programmer to know why it works--might even make a good interview question.. What does this do and when should it be used :)

@Peter Cordes 2019-02-08 03:06:56

The lower-case and upper-case alphabetic ranges don't cross a %32 "alignment" boundary in the ASCII coding system.

This is why bit 0x20 is the only difference between the upper/lower case versions of the same letter.

If this wasn't the case, you'd need to add or subtract 0x20, not just toggle, and for some letters there would be carry-out to flip other higher bits. (And there wouldn't be a single operation that could toggle, and checking for alphabetic characters in the first place would be harder because you couldn't |= 0x20 to force lcase.)


Related ASCII-only tricks: you can check for an alphabetic ASCII character by forcing lowercase with c |= 0x20 and then checking if (unsigned) c - 'a' <= ('z'-'a'). So just 3 operations: OR + SUB + CMP against a constant 25. Of course, compilers know how to optimize (c>='a' && c<='z') into asm like this for you, so at most you should do the c|=0x20 part yourself. It's rather inconvenient to do all the necessary casting yourself, especially to work around default integer promotions to signed int.

unsigned char lcase = y|0x20;
if (lcase - 'a' <= (unsigned)('z'-'a')) {   // lcase-'a' will wrap for characters below 'a'
    // c is alphabetic ASCII
}
// else it's not

See also Convert a String In C++ To Upper Case (SIMD string toupper for ASCII only, masking the operand for XOR using that check.)

And also How to access a char array and change lower case letters to upper case, and vice versa (C with SIMD intrinsics, and scalar x86 asm case-flip for alphabetic ASCII characters, leaving others unmodified.)


These tricks are mostly only useful if hand-optimizing some text-processing with SIMD (e.g. SSE2 or NEON), after checking that none of the chars in a vector have their high bit set. (And thus none of the bytes are part of a multi-byte UTF-8 encoding for a single character, which might have different upper/lower-case inverses). If you find any, you can fall back to scalar for this chunk of 16 bytes, or for the rest of the string.

There are even some locales where toupper() or tolower() on some characters in the ASCII range produce characters outside that range, notably Turkish where I ↔ ı and İ ↔ i. In those locales, you'd need a more sophisticated check, or probably not trying to use this optimization at all.


But in some cases, you're allowed to assume ASCII instead of UTF-8, e.g. Unix utilities with LANG=C (the POSIX locale), not en_CA.UTF-8 or whatever.

But if you can verify it's safe, you can toupper medium-length strings much faster than calling toupper() in a loop (like 5x), and last I tested with Boost 1.58, much much faster than boost::to_upper_copy<char*, std::string>() which does a stupid dynamic_cast for every character.

@YSC 2019-02-05 14:25:51

This uses the fact than ASCII values have been chosen by really smart people.

foo ^= 32;

This flips the 6th lowest bit1 of foo (the uppercase flag of ASCII sort of), transforming an ASCII upper case to a lower case and vice-versa.

+---+------------+------------+
|   | Upper case | Lower case |  32 is 00100000
+---+------------+------------+
| A | 01000001   | 01100001   |
| B | 01000010   | 01100010   |
|            ...              |
| Z | 01011010   | 01111010   |
+---+------------+------------+

Example

'A' ^ 32

    01000001 'A'
XOR 00100000 32
------------
    01100001 'a'

And by property of XOR, 'a' ^ 32 == 'A'.

Notice

C++ is not required to use ASCII to represent characters. Another variant is EBCDIC. This trick only works on ASCII platforms. A more portable solution would be to use std::tolower and std::toupper, with the offered bonus to be locale-aware (it does not automagically solve all your problems though, see comments):

bool case_incensitive_equal(char lhs, char rhs)
{
    return std::tolower(lhs, std::locale{}) == std::tolower(rhs, std::locale{}); // std::locale{} optional, enable locale-awarness
}

assert(case_incensitive_equal('A', 'a'));

1) As 32 is 1 << 5 (2 to the power 5), it flips the 6th bit (counting from 1).

@Bathsheba 2019-02-05 14:35:56

EBCDIC was chosen by some very smart people too: works really nicely on punched cards cf. ASCII which is a mess. But this is a nice answer, +1.

@YSC 2019-02-05 14:37:21

@Bathsheba ASCII on punchcard? Who would? :D

@dan04 2019-02-05 17:54:20

I don't know about punch cards, but ASCII was used on paper tape. That's why the Delete character is encoded as 1111111: So you could mark any character as "deleted" by punching out all the holes in its column on the tape.

@Lord Farquaad 2019-02-05 18:48:28

@Bathsheba as someone who hasn't used a punchcard, it's very difficult to wrap my head around the idea that EBCDIC was intelligently designed.

@user3003999 2019-02-05 20:03:21

May I recommend re-checking which bit gets flipped?

@YSC 2019-02-05 20:05:02

@Rogem 😅 thank you

@Peteris 2019-02-05 21:25:07

@LordFarquaad IMHO the Wikipedia picture of how letters are written on a punchcard is an obvious illustration on how EBCDIC does make some (but not total, see / vs S) sense for this encoding. en.wikipedia.org/wiki/EBCDIC#/media/…

@arctiq 2019-02-05 22:01:25

The notice is incorrect. Even the reference page for std::tolower states so. Some characters have multiple equivalent forms - this solution will not work for them. It will also not handle for example 'á' and 'Á', even though it will accept them. Neither version is "portable."

@dan04 2019-02-05 23:35:51

And any one-character-at-a-time approach to case-folding will fail for, e.g., German ß and SS.

@marcelm 2019-02-05 23:46:21

"This trick only works on ASCII platforms." - True, a similar ^= 64 will work for EBCDIC though! (But not for ASCII anymore)

@Martin Bonner 2019-02-06 09:30:04

@dan04 Note to mention "what is the lower-case form of 'MASSE'?". For those that don't know, there are two words in German whose upper case form is MASSE; one is "Masse" and the other is "Maße". Proper tolower in German doesn't merely need a dictionary, it needs to be able to parse the meaning.

@phuclv 2019-02-06 13:53:37

@Deduplicator 2019-02-07 12:51:53

You are aware that std::tolower() is only defined for EOF and arguments in the unsigned char-range?

@YSC 2019-02-07 12:55:47

@Deduplicator You're talking about std::tolower (<cctype>) I think. This answer is about std::tolower (<clocale>).

@Deduplicator 2019-02-07 13:44:26

@YSC Where is the second Argument then?

@YSC 2019-02-07 14:17:11

@Deduplicator With the joy to be a kid and all the other things I forgot. (fixed)

@chux 2019-02-08 15:12:06

@marcelm foo ^= 'a' ^ 'A'; would work for ASCII and EBCDIC.

@Iiridayn 2019-02-06 22:35:08

See the second table at http://www.catb.org/esr/faqs/things-every-hacker-once-knew/#_ascii, and following notes, reproduced below:

The Control modifier on your keyboard basically clears the top three bits of whatever character you type, leaving the bottom five and mapping it to the 0..31 range. So, for example, Ctrl-SPACE, [email protected], and Ctrl-` all mean the same thing: NUL.

Very old keyboards used to do Shift just by toggling the 32 or 16 bit, depending on the key; this is why the relationship between small and capital letters in ASCII is so regular, and the relationship between numbers and symbols, and some pairs of symbols, is sort of regular if you squint at it. The ASR-33, which was an all-uppercase terminal, even let you generate some punctuation characters it didn’t have keys for by shifting the 16 bit; thus, for example, Shift-K (0x4B) became a [ (0x5B)

ASCII was designed such that the shift and ctrl keyboard keys could be implemented without much (or perhaps any for ctrl) logic - shift probably required only a few gates. It probably made at least as much sense to store the wire protocol as any other character encoding (no software conversion required).

The linked article also explains many strange hacker conventions such as And control H does a single character and is an old^H^H^H^H^H classic joke. (found here).

@Iiridayn 2019-02-07 19:48:48

Could implement a shift toggle for more of ASCII w/foo ^= (foo & 0x60) == 0x20 ? 0x10 : 0x20, though this is only ASCII and therefore unwise for reasons stated in other answers. It can probably also be improved w/branch-free programming.

@Iiridayn 2019-02-08 23:34:06

Ah, foo ^= 0x20 >> !(foo & 0x40) would be simpler. Also a good example of why terse code is often considered unreadable ^_^.

@Brian 2019-02-06 08:09:06

Plenty of good answers here that describe how this works, but why it works this way is to improve performance. Bitwise operations are faster than most other operations within a processor. You can quickly do a case insensitive comparison by simply not looking at the bit that determines case or change case to upper/lower simply by flipping the bit (those guys that designed the ASCII table were pretty smart).

Obviously, this isn't nearly as big of a deal today as it was back in 1960 (when work first began on ASCII) due to faster processors and Unicode, but there are still some low-cost processors that this could make a significant difference as long as you can guarantee only ASCII characters.

https://en.wikipedia.org/wiki/Bitwise_operation

On simple low-cost processors, typically, bitwise operations are substantially faster than division, several times faster than multiplication, and sometimes significantly faster than addition.

NOTE: I would recommend using standard libraries for working with strings for a number of reasons (readability, correctness, portability, etc). Only use bit flipping if you have measured performance and this is your bottleneck.

@Yves Daoust 2019-02-05 20:06:57

Xoring with 32 (00100000 in binary) sets or resets the sixth bit (from the right). This is strictly equivalent to adding or subtracting 32.

@Peter Cordes 2019-02-08 02:46:27

Another way to say this is that XOR is add-without-carry.

@Jack Aidley 2019-02-05 14:18:03

It works because, as it happens, the difference between 'a' and A' in ASCII and derived encodings is 32, and 32 is also the value of the sixth bit. Flipping the 6th bit with an exclusive OR thus converts between upper and lower.

@Blaze 2019-02-05 14:21:18

Most likely your implementation of the character set will be ASCII. If we look at the table:

enter image description here

We see that there's a difference of exactly 32 between the value of a lowercase and uppercase number. Therefore, if we do ^= 32 (which equates to toggling the 6th least significant bit), it changes between a lowercase and uppercase character.

Note that it works with all the symbols, not just the letters. It toggles a character with the respective character where the 6th bit is different, resulting in a pair of characters that is toggled back and forth between. For the letters, the respective upper/lowercase characters form such a pair. A NUL will change into Space and the other way around, and the @ toggles with the backtick. Basically any character in the first column on this chart toggles with the character one column over, and the same applies to the third and fourth columns.

I wouldn't use this hack though, as there's not guarantee that it's going to work on any system. Just use toupper and tolower instead, and queries such as isupper.

@Matthieu Brucher 2019-02-05 14:23:02

Well, it doesn't work for all letters that have a difference of 32. Otherwise, it would work between '@' and ' '!

@NathanOliver 2019-02-05 14:28:00

@MatthieuBrucher It is working, 32 ^ 32 is 0, not 64

@Matthieu Brucher 2019-02-05 14:29:02

@NathanOliver Yes, that's my point, it doesn't work between any two characters separated by 32, only those that have a specific pattern between them.

@Blaze 2019-02-05 14:31:22

@MatthieuBrucher it toggles @ with the backtick, not with space. Every char is part of a pair that is toggled back and forth between. Perhaps by explanation was unclear.

@Matthieu Brucher 2019-02-05 14:32:22

@Blaze yes, the explanation is not clear enough. People that don't know about logical binary operations may not understand. + the errors in the text itself.

@freedomn-m 2019-02-05 16:42:48

'@' and ' ' aren't "letters". Only [a-z] and [A-Z] are "letters". The rest are coincidences that follow the same rule. If someone asked you to "upper case ]", what would it be? it would still be "]" - "}" isn't the "upper case" of "]".

@Peter Cordes 2019-02-05 23:27:46

@MatthieuBrucher: Another way to make that point is that the lower-case and upper-case alphabetic ranges don't cross a %32 "alignment" boundary in the ASCII coding system. This is why bit 0x20 is the only difference between the upper/lower case versions of the same letter. If this wasn't the case, you'd need to add or subtract 0x20, not just toggle, and for some letters there would be carry-out to flip other higher bits. (And the same operation couldn't toggle, and checking for alphabetic characters in the first place would be harder because you couldn't |= 0x20 to force lcase.)

@Tom Blodget 2019-02-06 05:11:23

It is unlikely that the compiler's execution character encoding(charset) would be ASCII. It is unlikely that the locale would have ASCII as the character encoding (codeset).

@A C 2019-02-06 05:39:27

+1 for reminding me of all those visits to asciitable.com to stare at that exact graphic (and the extended ASCII version!!) for the last, I dunno, 15 or 20 years?

@Matthieu Brucher 2019-02-06 09:39:42

@PeterCordes yes, far better explanation, thanks!

@Bathsheba 2019-02-05 14:33:07

It's how ASCII works, that's all.

But in exploiting this, you are giving up portability as C++ doesn't insist on ASCII as the encoding.

This is why the functions std::toupper and std::tolower are implemented in the C++ standard library - you should use those instead.

@Alnitak 2019-02-05 14:59:04

There are protocols though, which require that ASCII is used, such as DNS. In fact, the "0x20 trick" is used by some DNS servers to insert additional entropy into a DNS query as an anti-spoofing mechanism. DNS is case insensitive, but also supposed to be case preserving, so if send a query with random case and get the same case back it's a good indication that the response hasn't been spoofed by a third party.

@Captain Man 2019-02-05 15:43:22

It's worth mentioning that a lot of encodings still have the same representation for the standard (not extended) ASCII characters. But still, if you're really worried about different encodings you should use the proper functions.

@Bathsheba 2019-02-05 15:44:27

@CaptainMan: Absolutely. UTF-8 is a thing of sheer beauty. Hopefully it gets "absorbed" into the C++ standard insofar that IEEE754 has for floating point.

@Hanjoung Lee 2019-02-05 14:22:46

Let's take a look at ASCII code table in binary.

A 1000001    a 1100001
B 1000010    b 1100010
C 1000011    c 1100011
...
Z 1011010    z 1111010

And 32 is 0100000 which is the only difference between lowercase and uppercase letters. So toggling that bit toggles the case of a letter.

@Mooing Duck 2019-02-06 00:49:43

"toggles the case" *only for ASCII

@dbkk 2019-02-06 04:00:02

@Mooing only for A-Za-z in ASCII. Lower case of "[" is not "{".

@KeyWeeUsr 2019-02-06 10:35:55

@dbkk { is shorter than [, so it is a "lower" case. No? Ok, I'll show myself out :D

@Guntram Blohm 2019-02-07 11:33:35

Trivia tidbit: In the 7 bit area, German computers had []{|} remapped to ÄÖÜäöü since we needed Umlauts more than those characters, so in that context, { (ä) actually was the lowercase [ (Ä).

@UKMonkey 2019-02-07 11:50:09

@KeyWeeUsr I'm now going to have to find a font to disprove your argument....

@ZeroKnight 2019-02-07 23:14:29

@GuntramBlohm Further trivia tidbit, this is why IRC servers consider foobar[] and foobar{} to be identical nicknames, as nicknames are case insensitive, and IRC has its origins in Scandinavia :)

@Andrea 2019-02-09 21:07:29

The phrase worth knowing is "ISO 646". Just as in the 8-bit era there were many national/regional ASCII supersets, in the 7-bit era ASCII was just one of many character sets that were 646 compatible. And thus, the ^= 32 trick actually works for (most?) ISO 646-based character sets, not just ASCII :D

Related Questions

Sponsored Content

3 Answered Questions

[SOLVED] converting uppercase letter to lowercase and vice versa for a string

  • 2016-08-03 05:48:54
  • Ashish Choudhary
  • 2843 View
  • 2 Score
  • 3 Answer
  • Tags:   c++ string

25 Answered Questions

[SOLVED] Convert a String In C++ To Upper Case

  • 2009-04-09 17:38:23
  • OrangeAlmondSoap
  • 459205 View
  • 249 Score
  • 25 Answer
  • Tags:   c++ string

7 Answered Questions

[SOLVED] How do uppercase and lowercase letters differ by only one bit?

4 Answered Questions

[SOLVED] How to convert a number to string and vice versa in C++

4 Answered Questions

[SOLVED] Converting wide char string to lowercase in C++

7 Answered Questions

[SOLVED] Convert binary to ASCII and vice versa

  • 2011-09-13 04:34:14
  • sbrichards
  • 197369 View
  • 69 Score
  • 7 Answer
  • Tags:   python binary ascii

1 Answered Questions

[SOLVED] uppercase to lowercase and vice versa

3 Answered Questions

[SOLVED] Lower case to upper case without toupper

  • 2014-07-31 16:13:07
  • user3896430
  • 1804 View
  • 0 Score
  • 3 Answer
  • Tags:   c++

1 Answered Questions

[SOLVED] PHP, uppercase a single character

  • 2014-02-05 13:50:42
  • AlexMorley-Finch
  • 635 View
  • 0 Score
  • 1 Answer
  • Tags:   php string char ascii

0 Answered Questions

Sponsored Content