#### [SOLVED] What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?

By Zxaos

If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the furthest left bit that is a 1), what is the quickest/most efficient method of finding out?

I know that POSIX supports a `ffs()` method in strings.h to find the first set bit, but there doesn't seem to be a corresponding `fls()` method.

Is there some really obvious way of doing this that I'm missing?

What about in cases where you can't use POSIX functions for portability?

Edit: What about a solution that works on both 32 and 64 bit architectures (many of the code listings seem like they'd only work on 32 bit ints).

#### @Antonin GAVREL 2018-01-24 15:55:34

I assume your question is for an integer (called v below) and not an unsigned integer.

``````int v = 612635685; // whatever value you wish

unsigned int get_msb(int v)
{
int r = 31;                         // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.

while (!(v & 0x80000000) && r--) {   // mask of the highest bit
v <<= 1;                        // multiply integer by 2.
}
return r;                           // will even return -1 if no bit was set, allowing error catch
}
``````

If you want to make it work without taking into account the sign you can add an extra 'v <<= 1;' before the loop (and change r value to 30 accordingly). Please let me know if I forgot anything. I haven't tested it but it should work just fine.

#### @chux - Reinstate Monica 2018-11-03 21:19:14

`v <<= 1` is undefined behavior (UB) when `v < 0`.

#### @M.M 2020-02-21 09:30:47

`0x8000000` , maybe you mean an extra 0 there .

#### @ephemient 2009-03-23 15:16:34

``` -- Built-in Function: int __builtin_clz (unsigned int x)
Returns the number of leading 0-bits in X, starting at the most
significant bit position.  If X is 0, the result is undefined.

-- Built-in Function: int __builtin_clzl (unsigned long)
Similar to `__builtin_clz', except the argument type is `unsigned
long'.

-- Built-in Function: int __builtin_clzll (unsigned long long)
Similar to `__builtin_clz', except the argument type is `unsigned
long long'.```

I'd expect them to be translated into something reasonably efficient for your current platform, whether it be one of those fancy bit-twiddling algorithms, or a single instruction.

A useful trick if your input can be zero is `__builtin_clz(x | 1)`: unconditionally setting the low bit without modifying any others makes the output `31` for `x=0`, without changing the output for any other input.

To avoid needing to do that, your other option is platform-specific intrinsics like ARM GCC's `__clz` (no header needed), or x86's `_lzcnt_u32` on CPUs that support the `lzcnt` instruction. (Beware that `lzcnt` decodes as `bsr` on older CPUs instead of faulting, which gives 31-lzcnt for non-zero inputs.)

There's unfortunately no way to portably take advantage of the various CLZ instructions on non-x86 platforms that do define the result for input=0 as 32 or 64 (according to the operand width). x86's `lzcnt` does that, too, while `bsr` produces a bit-index that the compiler has to flip unless you use `31-__builtin_clz(x)`.

(The "undefined result" is not C Undefined Behavior, just a value that isn't defined. It's actually whatever was in the destination register when the instruction ran. AMD documents this, Intel doesn't, but Intel's CPUs do implement that behaviour. But it's not whatever was previously in the C variable you're assigning to, that's not usually how things work when gcc turns C into asm. See also Why does breaking the "output dependency" of LZCNT matter?)

#### @ratchet freak 2015-12-17 15:48:56

MSVC will have _BitScanReverse

#### @Peter Cordes 2016-12-18 06:58:41

The undefined-on-zero behaviour lets them compile to a single BSR instruction on x86, even when LZCNT isn't available. This is a big advantage for `__builtin_ctz` over `ffs`, which compiles to a BSF and a CMOV to handle the input-was-zero case. On architectures without a short-enough implementation (e.g. old ARM without the `clz` instruction), gcc emits a call to a libgcc helper function.

#### @NO_NAME 2019-12-14 11:53:45

Here is a fast solution for C that works in GCC and Clang; ready to be copied and pasted.

``````#include <limits.h>

unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}

unsigned long flsl(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}

unsigned long long flsll(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
``````

And a little improved version for C++.

``````#include <climits>

constexpr unsigned int fls(const unsigned int value)
{
return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}

constexpr unsigned long fls(const unsigned long value)
{
return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}

constexpr unsigned long long fls(const unsigned long long value)
{
return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}
``````

The code assumes that `value` won't be `0`. If you want to allow 0, you need to modify it.

#### @SpartanWar 2019-06-15 22:20:41

My humble method is very simple:

MSB(x) = INT[Log(x) / Log(2)]

Translation: The MSB of x is the integer value of (Log of Base x divided by the Log of Base 2).

This can easily and quickly be adapted to any programming language. Try it on your calculator to see for yourself that it works.

#### @Mikko Rantalainen 2019-10-04 19:27:35

That works if all you're interested is developer efficiency. If you want runtime efficiency, you need alternative algorithm.

#### @benrg 2019-10-11 17:48:01

This can fail due to roundoff error. For example, in CPython 2 and 3, `int(math.log((1 << 48) - 1) / math.log(2))` is 48.

#### @Glenn Slayden 2017-10-26 13:41:10

Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.

The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using `Marshal.AllocHGlobal`. As you can see, for maximum performance, the whole example is written as native:

``````readonly static byte[] msb_tab_15;

// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)
// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.
// The table is compressed by half, so use (value >> 1) for indexing.
static MyStaticInit()
{
var p = new byte[0x8000];

for (byte n = 0; n < 16; n++)
for (int c = (1 << n) >> 1, i = 0; i < c; i++)
p[c + i] = n;

msb_tab_15 = p;
}
``````

The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).

Notice that the table entry for `0`, the sole integer for which the notion of 'highest set bit' is undefined, is given the value `-1`. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:

ulong (64-bit) Version

``````/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(this ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 0x40) - 1;      // handles cases v==0 and MSB==63

int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;
j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
``````

uint (32-bit) Version

``````/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(uint v)
{
if ((int)v <= 0)
return (int)((v >> 26) & 0x20) - 1;     // handles cases v==0 and MSB==31

int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;
return j + msb_tab_15[v >> (j + 1)];
}
``````

Various overloads for the above

``````public static int HighestOne(long v) => HighestOne((ulong)v);
public static int HighestOne(int v) => HighestOne((uint)v);
public static int HighestOne(ushort v) => msb_tab_15[v >> 1];
public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];
public static int HighestOne(char ch) => msb_tab_15[ch >> 1];
public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];
public static int HighestOne(byte v) => msb_tab_15[v >> 1];
``````

This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value `0` (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.

That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.

The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.

``` 1  candidates.HighestOne_Tab16A               622,496
2  candidates.HighestOne_Tab16C               628,234
3  candidates.HighestOne_Tab8A                649,146
4  candidates.HighestOne_Tab8B                656,847
5  candidates.HighestOne_Tab16B               657,147
6  candidates.HighestOne_Tab16D               659,650
7  _highest_one_bit_UNMANAGED.HighestOne_U    702,900
8  de_Bruijn.IndexOfMSB                       709,672
9  _old_2.HighestOne_Old2                     715,810
10  _test_A.HighestOne8                        757,188
11  _old_1.HighestOne_Old1                     757,925
12  _test_A.HighestOne5  (unsafe)              760,387
13  _test_B.HighestOne8  (unsafe)              763,904
14  _test_A.HighestOne3  (unsafe)              766,433
15  _test_A.HighestOne1  (unsafe)              767,321
16  _test_A.HighestOne4  (unsafe)              771,702
17  _test_B.HighestOne2  (unsafe)              772,136
18  _test_B.HighestOne1  (unsafe)              772,527
19  _test_B.HighestOne3  (unsafe)              774,140
20  _test_A.HighestOne7  (unsafe)              774,581
21  _test_B.HighestOne7  (unsafe)              775,463
22  _test_A.HighestOne2  (unsafe)              776,865
23  candidates.HighestOne_NoTab                777,698
24  _test_B.HighestOne6  (unsafe)              779,481
25  _test_A.HighestOne6  (unsafe)              781,553
26  _test_B.HighestOne4  (unsafe)              785,504
27  _test_B.HighestOne5  (unsafe)              789,797
28  _test_A.HighestOne0  (unsafe)              809,566
29  _test_B.HighestOne0  (unsafe)              814,990
30  _highest_one_bit.HighestOne                824,345
30  _bitarray_ext.RtlFindMostSignificantBit    894,069
31  candidates.HighestOne_Naive                898,865```

Notable is that the terrible performance of `ntdll.dll!RtlFindMostSignificantBit` via P/Invoke:

``````[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
public static extern int RtlFindMostSignificantBit(ulong ul);
``````

It's really too bad, because here's the entire actual function:

``````    RtlFindMostSignificantBit:
bsr rdx, rcx
mov eax,0FFFFFFFFh
movzx ecx, dl
cmovne      eax,ecx
ret
``````

I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) `short` (16-bit) direct-lookup tables over the 128-byte (and 256-byte) `byte` (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:

``````public static int HighestOne_Tab8A(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;

int j;
j =  /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;
j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;
j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;
return j + msb_tab_8[v >> j];
}
``````

The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:

``````const ulong N_bsf64 = 0x07EDD5E59A4E28C2,
N_bsr64 = 0x03F79D71B4CB0A89;

readonly public static sbyte[]
bsf64 =
{
63,  0, 58,  1, 59, 47, 53,  2, 60, 39, 48, 27, 54, 33, 42,  3,
61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22,  4,
62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,
56, 45, 25, 31, 35, 16,  9, 12, 44, 24, 15,  8, 23,  7,  6,  5,
},
bsr64 =
{
0, 47,  1, 56, 48, 27,  2, 60, 57, 49, 41, 37, 28, 16,  3, 61,
54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11,  4, 62,
46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
25, 39, 14, 33, 19, 30,  9, 24, 13, 18,  8, 12,  7,  6,  5, 63,
};

public static int IndexOfLSB(ulong v) =>
v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;

public static int IndexOfMSB(ulong v)
{
if ((long)v <= 0)
return (int)((v >> 57) & 64) - 1;

v |= v >> 1; v |= v >> 2;  v |= v >> 4;   // does anybody know a better
v |= v >> 8; v |= v >> 16; v |= v >> 32;  // way than these 12 ops?
return bsr64[(v * N_bsr64) >> 58];
}
``````

There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the `IndexOfMSB` functions here--not the deBruijn `IndexOfLSB`--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.

#### @Peter Cordes 2018-03-11 05:12:21

L1D cache on modern x86 CPUs is only 32kiB. A large LUT is likely to be worse than a small LUT unless you're using the same values repeatedly. If you aren't, you'll get frequent cache misses.

#### @Jonathan Mee 2015-01-12 19:05:36

has given us `log2`. This removes the need for all the special sauce `log2` implementations you see on this page. You can use the standard's `log2` implementation like this:

``````const auto n = 13UL;
const auto Index = (unsigned long)log2(n);

printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
``````

An `n` of `0UL` needs to be guarded against as well, because:

-∞ is returned and FE_DIVBYZERO is raised

I have written an example with that check that arbitrarily sets `Index` to `ULONG_MAX` here: https://ideone.com/u26vsi

The corollary to ephemient's gcc only answer is:

``````const auto n = 13UL;
unsigned long Index;

_BitScanReverse(&Index, n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)
``````

The documentation for `_BitScanReverse` states that `Index` is:

Loaded with the bit position of the first set bit (1) found

In practice I've found that if `n` is `0UL` that `Index` is set to `0UL`, just as it would be for an `n` of `1UL`. But the only thing guaranteed in the documentation in the case of an `n` of `0UL` is that the return is:

0 if no set bits were found

Thus, similarly to the preferable `log2` implementation above the return should be checked setting `Index` to a flagged value in this case. I've again written an example of using `ULONG_MAX` for this flag value here: http://rextester.com/GCU61409

#### @Peter Cordes 2018-03-11 05:02:59

No, `_BitScanReverse` returns 0 only if the input was `0`. This is like x86's `BSR` instruction, which sets ZF based only on the input, not the output. Interesting that MS words the docs as leaving `index` unset when no `1` bit is found; that matches the x86 asm behaviour of `bsr`, too. (AMD documents it as leaving the destination register unmodified on src=0, but Intel just says undefined output even though their CPUs do implement the leave-unmodified behaviour.) This is unlike x86's `lzcnt`, which gives `32` for not-found.

#### @Jonathan Mee 2018-03-12 12:52:04

@PeterCordes `_BitScanReverse` uses zero-based indexing, thus if `n` is 1 then the index of the set bit is in fact 0. Unfortunately, as you say if `n` is 0 then the output is also 0 :( This means there's no way to use the return to distinguish between an `n` of 1 or 0. That's what I was trying to communicate. Do you think there's a better way to say this?

#### @Peter Cordes 2018-03-12 22:23:28

I think you're talking about how it sets `Index`. That's not the return value. It returns a boolean that's false if the input was zero (and this is why Index is passed by reference instead of being returned normally). godbolt.org/g/gQKJdE. And I checked: despite the wording of MS's docs, `_BitScanReverse` doesn't leave Index unset on `n==0`: you just get whatever value was in the register it happened to use. (Which in your case was probably the same register it used for `Index` afterward, leading to you seeing a `0`).

#### @technosaurus 2018-03-13 01:41:00

This question is not tagged c++.

#### @Jonathan Mee 2018-03-13 12:45:29

@technosaurus Thanks, I forgot myself. Given that the question is C we've actually had `log2` since C99.

#### @fgrieu 2018-04-03 16:59:15

_BitScanReverse returns "Nonzero if Index was set, or 0 if no set bits were found". When n=0 is passed, index is not modified ans thus the answer's code gives undefined result when n=0 instead of 13.

#### @Jonathan Mee 2018-04-03 19:51:58

@fgrieu No, `Index` is zero'd if all bits are unset as well: rextester.com/BHFXS35929 Meaning that testing the return of `_BitScanReverse` is the only way to know what an out parameter of 0 actually means.

#### @Jonathan Mee 2018-04-03 19:55:50

@fgrieu Ugh I just reread my answer, it said return when it should have said out param. I've updated.

#### @fgrieu 2018-04-03 20:37:45

@JonathanMee: from both a recent initially bad experience, MS's spec and Intel's spec: when the input n is zero, _BitScanReverse finds no bit set, leaves Index unspecified (or is it unchanged, but that's not what I experienced), and returns 0 to warn about that condition. In my experience, actual results will depend on compiler options and surrounding code.

#### @Jonathan Mee 2018-04-04 12:16:01

@fgrieu I believe the right choice here is `log2` but I agree I'm depending on undocumented behavior on my `_BitScanReverse` example, which is questionable at best. I've updated my answer with a check that appropriately sets a flag value on `Index`.

#### @timday 2009-03-23 00:00:51

Assuming you're on x86 and game for a bit of inline assembler, Intel provides a `BSR` instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:

Searches the source operand for the most significant set bit (1 bit). If a most significant 1 bit is found, its bit index is stored in the destination operand. The source operand can be a register or a memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source operand. If the content source operand is 0, the content of the destination operand is undefined.

(If you're on PowerPC there's a similar `cntlz` ("count leading zeros") instruction.)

Example code for gcc:

``````#include <iostream>

int main (int,char**)
{
int n=1;
for (;;++n) {
int msb;
asm("bsrl %1,%0" : "=r"(msb) : "r"(n));
std::cout << n << " : " << msb << std::endl;
}
return 0;
}
``````

See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.

#### @rlbond 2009-03-23 03:10:54

Actually this instruction is usually microcoded into a loop and is rather slow.

#### @timday 2009-03-23 09:26:07

Which one ? BSR or CNTLZ ? As I read the x86-timing.pdf referenced above, BSR is only slow on the Netburst Pentiums. I know nothing about PowerPC though.

#### @timday 2009-03-23 09:29:23

...OK, on closer inspection make that "BSR is only fast on P3/Pentium-M/Core2 x86s". Slow on Netburst and AMD.

#### @Crashworks 2009-03-23 19:28:56

cntlzw is a pipelined (not microcoded) op on most PPCs. Looks like it's about as fast as an add on Xenon.

#### @Stephen Canon 2013-05-06 20:02:37

@timday: also slow on (some?) Atom µarches.

#### @Ira Baxter 2014-06-21 09:00:19

@rlbond: Yes, but is it slower than some clever instruction sequence? How do I find out?

#### @timday 2014-06-21 18:33:33

There's absolutely no substitute for benchmarking your own (realistic) use cases. Whether the clever algorithms are faster or not might depend on what else the CPU's execution units is tied up with from "nearby" code (if anything), what the distribution of the values you're testing is... all sorts of things.

#### @Peter Cordes 2016-12-18 06:51:28

If you're using GNU C anyway, you should use use `__builtin_clz` (or `__builtin_clzll`), which has the same undefined-on-zero behaviour that lets it compile to a single BSR on x86. Or LZCNT if available, because that's faster on more CPUs (e.g. on AMD it's fast even though BSR is slow, maybe because BSR has the weird behaviour of setting ZF according to the input, not the result). Or whatever is optimal on the target arch, since it's not limited to x86. Anyway, gcc.gnu.org/wiki/DontUseInlineAsm when you can avoid it, since it defeats constant-propagation and some other optimizations.

#### @Peter Cordes 2016-12-18 07:17:19

@rlbond: huh, BSR on P4 Prescott is 2 uops with 16 cycle latency(!), with one per 4c throughput. But on earlier Netburst, it's only 4 cycle latency (still 2 uops), and one per 2c throughput. (source: agner.org/optimize). On most CPUs, it also has a dependency on its output which gcc doesn't account for (when the input is zero, the actual behaviour is to leave the destination unchanged). This can lead to problems like stackoverflow.com/questions/25078285/…. IDK why gcc missed BSR when fixing that.

#### @SPWorley 2009-03-22 23:49:25

This is sort of like finding a kind of integer log. There are bit-twiddling tricks, but I've made my own tool for this. The goal of course is for speed.

My realization is that the CPU has an automatic bit-detector already, used for integer to float conversion! So use that.

``````double ff=(double)(v|1);
return ((*(1+(uint32_t *)&ff))>>20)-1023;  // assumes x86 endianness
``````

This version casts the value to a double, then reads off the exponent, which tells you where the bit was. The fancy shift and subtract is to extract the proper parts from the IEEE value.

It's slightly faster to use floats, but a float can only give you the first 24 bit positions because of its smaller precision.

To do this safely, without undefined behaviour in C++ or C, use `memcpy` instead of pointer casting for type-punning. Compilers know how to inline it efficiently.

``````// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");
// and also static_assert something about FLT_ENDIAN?

double ff=(double)(v|1);

uint32_t tmp;
memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));
return (tmp>>20)-1023;
``````

Or in C99 and later, use a `union {double d; uint32_t u[2];};`. But note that in C++, union type punning is only supported on some compilers as an extension, not in ISO C++.

This will usually be slower than a platform-specific intrinsic for a leading-zeros counting instruction, but portable ISO C has no such function. Some CPUs also lack a leading-zero counting instruction, but some of those can efficiently convert integers to `double`. Type-punning an FP bit pattern back to integer can be slow, though (e.g. on PowerPC it requires a store/reload and usually causes a load-hit-store stall).

This algorithm could potentially be useful for SIMD implementations, because fewer CPUs have SIMD `lzcnt`. x86 only got such an instruction with AVX512CD

#### @Crashworks 2009-03-23 00:43:59

Won't that cause a load hit store on moving between register sets?

#### @Michael Burr 2009-03-23 00:47:59

Isn't there a cost to loading/reading the FPU (as well as synchronizing with it)?

#### @MSN 2009-03-23 00:51:42

Yes. And gcc will do nasty things with code like this with -O2 due to type-aliasing optimizations.

#### @jalf 2009-03-23 01:23:43

casting between integer and floating point can be surprisingly expensive on x86 CPU's

#### @SPWorley 2009-03-23 02:02:21

Yep, the FPU costs are high. But actual time measurements showed this was faster than all-bit ops or especially any loops. Try it and take the fastest is always the best advice. I have not had a problem with GCC and -O2 with this though.

#### @Chris Dodd 2009-03-24 23:35:59

You can do this much more portably (and safely) with frexp

#### @R.. GitHub STOP HELPING ICE 2011-02-06 06:54:00

@Chris: And much more slowly. :-)

#### @dreamlax 2011-12-11 07:50:03

Isn't this undefined behaviour (reading a value through a pointer of an incompatible type)?

#### @D Coetzee 2012-01-03 08:35:23

Hacker's Delight explains how to correct for the error in 32-bit floats in 5-3 Counting Leading 0's. Here's their code, which uses an anonymous union to overlap asFloat and asInt: k = k & ~(k >> 1); asFloat = (float)k + 0.5f; n = 158 - (asInt >> 23); (and yes, this relies on implementation-defined behavior)

#### @David Given 2017-01-20 21:46:34

On the PowerPC, moving a value from a floating point register to an integer register (or vice versa) involves writing to memory and reading it back again.

#### @Peter Cordes 2018-03-11 04:25:04

This is a really neat idea, and potentially useful for a SIMD implementations without AVX512CD `VPLZCNTD` where type-punning an integer to float is free (because they use the same vector registers). For scalar of course, `bsr` is faster on all x86 CPUs newer than Pentium4 at least, so use an intrinsic for that. Anyway, this would be a much better answer if you avoided the strict-aliasing UB and the assumption that `unsigned long` was a 32-bit type.

#### @Sunsetquest 2019-10-20 15:40:17

Great idea - I had the same idea but then later realized I ran across your post before and that probably seeded my brain with the idea. I am pretty sure this is the fastest method in c# since lzcnt is not supported without high-overhead hacks. Since c#'s compiler is more locked down it would be a tad safer to use... except for maybe non-x86.

#### @Glenn Slayden 2020-04-07 07:24:17

warning: the code as shown returns the incorrect answer for values in the range `[7FFFFFFFFFFFFE00 - 7FFFFFFFFFFFFFFF]`; all of these return `63` instead of the correct MSB `62`. Taking `7FFFFFFFFFFFFFFF` with the code above, OR-ing with `1` has no effect, and the IEEE double representation is `43E0000000000000`. Applying unsigned `>> 20` to the high dword (or `>> 52` on the quad) gives `0x43e` which is 1086. Finally, `1086 - 1023 = 63`. This range of 512 error values are the only ones I'm currently aware of, but I haven't done an exhaustive check so there may be others.

#### @Glenn Slayden 2020-04-07 07:37:46

Another range of errors is `[FFFFFFFFFFFFFC00 - FFFFFFFFFFFFFFFF]`; these 1024 values all return `64` instead of the correct MSB index `63`.

#### @Glenn Slayden 2020-04-07 10:15:58

...and here's a fix that works correctly over the entire range of 64-bit integers: `double d = v & ~(v >> 1);` ... `int msb = (int)(*(ulong*)&d >> 52) - 1023;` ... You can stop here with just that if you don't mind that the (already dubious) input value 0 gives result `msb = -1023`. If you prefer a prettier -1 for this special case, just tack on the following line ... `msb |= (msb >> 31);` at the end.

#### @AMDG 2020-05-01 20:54:30

If the effects of GCC optimization on this are so important, `volatile` can help here to ensure desired behavior regardless of optimization level.

#### @Glenn Slayden 2020-05-01 22:01:20

@AMDG Are you responding to my comment? The errors I describe are not compiler errors, they are a property of the CPU, basically a hardware limitation of the IEEE technique shown in this answer. And it's not really the CPU's fault either; I believe what's going on is that information is lost because the IEEE spec is deliberately designed to provide more resolution near 1.0 at the expense of the +/- extremes, which is great for floating point values, but catastrophic when the FP unit is rudely co-opted for processing discrete integer values (and even worse, also under bitwise interpretation).

#### @AMDG 2020-05-02 01:13:43

@GlennSlayden No, actually MSN's comment. Thank you for your contribution.

#### @Harry Svensson 2017-05-27 20:35:03

Woaw, that was many answers. I am not sorry for answering on an old question.

``````int result = 0;//could be a char or int8_t instead
if(value){//this assumes the value is 64bit
if(0xFFFFFFFF00000000&value){  value>>=(1<<5); result|=(1<<5);  }//if it is 32bit then remove this line
if(0x00000000FFFF0000&value){  value>>=(1<<4); result|=(1<<4);  }//and remove the 32msb
if(0x000000000000FF00&value){  value>>=(1<<3); result|=(1<<3);  }
if(0x00000000000000F0&value){  value>>=(1<<2); result|=(1<<2);  }
if(0x000000000000000C&value){  value>>=(1<<1); result|=(1<<1);  }
if(0x0000000000000002&value){  result|=(1<<0);  }
}else{
result=-1;
}
``````

This answer is pretty similar to another answer... oh well.

#### @greybeard 2017-10-26 17:31:19

Writing the shift amounts as `1<<k` is a nice touch. What about the masks? `(1 << (1<<k-1)-1<< (1<<k-1)`? (`most optimal`? You compare a superlative?)

#### @Harry Svensson 2017-10-27 01:44:44

@greybeard If you look at the edits of this question you will see when I added the "optimal" part. I forgot to remove it as I changed my answer. Also I'm not sure why you are are talking about the masks? (What masks? I'm not following you)

#### @greybeard 2017-10-27 07:40:55

((bit)mask are values used to select/clear bits selectively/used in `&` and `&~`.) You could replace the hex constants by the likes of `((type)1<<(1<<k))-1<<(1<<k)`.

#### @Harry Svensson 2017-10-27 21:33:04

Oh right, I'm using masks, I totally forgot about that. I did answer this a couple of month's ago... - Hmmm, well since it's evaluated during compile time I say it's equivalent to the hex values. However, one is cryptic and one is hexadecimal.

#### @Finnegan 2016-12-30 01:17:58

I know this question is very old, but just having implemented an msb() function myself, I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:

Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01. See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.

It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).

Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).

Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:

``````template <typename T>
auto msb(T n) -> int
{
static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,
"msb<T>(): T must be an unsigned integral type.");

for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)
{
if ((n & mask) != 0)
return i;
}

return 0;
}
``````

Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size. My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.

1The argument goes like this (rough draft): Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.

If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:

(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n

This series of average iterations is actually convergent and has a limit of 2 for n towards infinity

Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.

#### @johnwbyrd 2017-10-28 21:29:36

I don't think it's necessarily a fair assumption that the inputs to msb functions tend to be evenly distributed. In practice, these inputs tend to be interrupt registers or bitboards or some other data structure with unevenly distributed values. For a fair benchmark I think it is safer to assume that the outputs (not the inputs) will be evenly distributed.

#### @greggo 2012-10-14 20:36:04

Putting this in since it's 'yet another' approach, seems to be different from others already given.

returns `-1` if `x==0`, otherwise `floor( log2(x))` (max result 31)

Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.

This is what I use when I don't want to use `__builtin_clz` because of portability issues.

To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.

``````int log2floor( unsigned x ){
static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};
int r = 0;
unsigned xk = x >> 16;
if( xk != 0 ){
r = 16;
x = xk;
}
// x is 0 .. 0xFFFF
xk = x >> 8;
if( xk != 0){
r += 8;
x = xk;
}
// x is 0 .. 0xFF
xk = x >> 4;
if( xk != 0){
r += 4;
x = xk;
}
// now x is 0..15; x=0 only if originally zero.
return r + wtab[x];
}
``````

#### @VoidStar 2016-08-16 06:33:17

Some overly complex answers here. The Debruin technique should only be used when the input is already a power of two, otherwise there's a better way. For a power of 2 input, Debruin is the absolute fastest, even faster than `_BitScanReverse` on any processor I've tested. However, in the general case, `_BitScanReverse` (or whatever the intrinsic is called in your compiler) is the fastest (on certain CPU's it can be microcoded though).

If the intrinsic function is not an option, here is an optimal software solution for processing general inputs.

``````u8  inline log2 (u32 val)  {
u8  k = 0;
if (val > 0x0000FFFFu) { val >>= 16; k  = 16; }
if (val > 0x000000FFu) { val >>= 8;  k |= 8;  }
if (val > 0x0000000Fu) { val >>= 4;  k |= 4;  }
if (val > 0x00000003u) { val >>= 2;  k |= 2;  }
k |= (val & 2) >> 1;
return k;
}
``````

Note that this version does not require a Debruin lookup at the end, unlike most of the other answers. It computes the position in place.

Tables can be preferable though, if you call it repeatedly enough times, the risk of a cache miss becomes eclipsed by the speedup of a table.

``````u8 kTableLog2[256] = {
0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
};

u8 log2_table(u32 val)  {
u8  k = 0;
if (val > 0x0000FFFFuL) { val >>= 16; k  = 16; }
if (val > 0x000000FFuL) { val >>=  8; k |=  8; }
k |= kTableLog2[val]; // precompute the Log2 of the low byte

return k;
}
``````

This should produce the highest throughput of any of the software answers given here, but if you only call it occasionally, prefer a table-free solution like my first snippet.

#### @Peter Cordes 2018-03-11 04:50:09

Some of the answers are branchless, but this probably will compile with conditional branches. Did you only benchmark with the same value repeatedly, or a simple pattern or something? Branch misprediction is a killer for performance. stackoverflow.com/questions/11227809/…

#### @user4924658 2015-05-21 12:32:17

thats some kind of binary search, it works with all kinds of (unsigned!) integer types

``````#include <climits>
#define UINT (unsigned int)
#define UINT_BIT (CHAR_BIT*sizeof(UINT))

int msb(UINT x)
{
if(0 == x)
return -1;

int c = 0;

for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x >> i))
{
x >>= i;
c |= i;
}

return c;
}
``````

to make complete:

``````#include <climits>
#define UINT unsigned int
#define UINT_BIT (CHAR_BIT*sizeof(UINT))

int lsb(UINT x)
{
if(0 == x)
return -1;

int c = UINT_BIT-1;

for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
if(static_cast<UINT>(x << i))
{
x <<= i;
c ^= i;
}

return c;
}
``````

#### @underscore_d 2015-10-11 16:21:11

Please consider not using ALL_CAPS for `typedef`s or indeed anything except preprocessor macros. This is a widely accepted convention.

#### @ChuckCottrill 2015-10-17 17:13:56

Note that what you are trying to do is calculate the integer log2 of an integer,

``````#include <stdio.h>
#include <stdlib.h>

unsigned int
Log2(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1; int k=0;
for( step = 1; step < bits; ) {
n |= (n >> step);
step *= 2; ++k;
}
//printf("%ld %ld\n",x, (x - (n >> 1)) );
return(x - (n >> 1));
}
``````

Observe that you can attempt to search more than 1 bit at a time.

``````unsigned int
Log2_a(unsigned long x)
{
unsigned long n = x;
int bits = sizeof(x)*8;
int step = 1;
int step2 = 0;
//observe that you can move 8 bits at a time, and there is a pattern...
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//if( x>1<<step2+8 ) { step2+=8;
//}
//}
//}
for( step2=0; x>1L<<step2+8; ) {
step2+=8;
}
//printf("step2 %d\n",step2);
for( step = 0; x>1L<<(step+step2); ) {
step+=1;
//printf("step %d\n",step+step2);
}
printf("log2(%ld) %d\n",x,step+step2);
return(step+step2);
}
``````

This approach uses a binary search

``````unsigned int
Log2_b(unsigned long x)
{
unsigned long n = x;
unsigned int bits = sizeof(x)*8;
unsigned int hbit = bits-1;
unsigned int lbit = 0;
unsigned long guess = bits/2;
int found = 0;

while ( hbit-lbit>1 ) {
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
//when value between guess..lbit
if( (x<=(1L<<guess)) ) {
//printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);
hbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
//when value between hbit..guess
//else
if( (x>(1L<<guess)) ) {
//printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);
lbit=guess;
guess=(hbit+lbit)/2;
//printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
}
}
if( (x>(1L<<guess)) ) ++guess;
printf("log2(x%ld)=r%d\n",x,guess);
return(guess);
}
``````

Another binary search method, perhaps more readable,

``````unsigned int
Log2_c(unsigned long x)
{
unsigned long v = x;
unsigned int bits = sizeof(x)*8;
unsigned int step = bits;
unsigned int res = 0;
for( step = bits/2; step>0; )
{
//printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);
while ( v>>step ) {
v>>=step;
res+=step;
//printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);
}
step /= 2;
}
if( (x>(1L<<res)) ) ++res;
printf("log2(x%ld)=r%ld\n",x,res);
return(res);
}
``````

And because you will want to test these,

``````int main()
{
unsigned long int x = 3;
for( x=2; x<1000000000; x*=2 ) {
//printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));
printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));
printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));
printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));
}
return(0);
}
``````

#### @jemin 2015-06-29 10:21:10

The code:

``````    // x>=1;
unsigned func(unsigned x) {
double d = x ;
int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;
printf( "The left-most non zero bit of %d is bit %d\n", x, p);
}
``````

Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1

#### @underscore_d 2015-10-11 16:20:19

uhhhhh. what? how does this function? is it in any way portable?

#### @jemin 2015-10-25 15:01:59

Codes in the window is portable. The function FYL2X() is a fpu instruction, but may be ported and may found in some FPU/math library.

#### @Jim Balter 2016-12-28 10:00:58

@underscore_d It works because floating point numbers are normalized ... converting to double shifts the mantissa bits to eliminate leading zeros, and this code extracts the exponent and adjusts it to determine the number of bits shifted. It certainly isn't architecture-independent, but it will probably work on any machine you come across.

#### @Peter Cordes 2018-03-11 05:09:23

This is an alternate version of this answer, see there for comments on performance and portability. (Specifically the non-portability of pointer casting for type-punning.) It uses address math to only reload the high 32 bits of the `double`, which is probably good if it does actually store/reload instead of type-pun some other way, e.g. with a `movq` instruction like you might get here on x86.

#### @Glenn Slayden 2020-04-07 07:27:55

Also note my [comment to that answer], where I offer the dire warning that this method gives the wrong answer for values in (at least) the range `[7FFFFFFFFFFFFE00 - 7FFFFFFFFFFFFFFF]`.

#### @user3177100 2014-11-24 09:44:41

A version in C using successive approximation:

``````unsigned int getMsb(unsigned int n)
{
unsigned int msb  = sizeof(n) * 4;
unsigned int step = msb;
while (step > 1)
{
step /=2;
if (n>>msb)
msb += step;
else
msb -= step;
}
if (n>>msb)
msb++;
return (msb - 1);
}
``````

Advantage: the running time is constant regardless of the provided number, as the number of loops are always the same. ( 4 loops when using "unsigned int")

#### @Peter Cordes 2018-03-11 04:58:05

If you write it with a ternary operator (`msb += (n>>msb) ? step : -step;`), more compilers are likely to make branchless asm, avoiding branch mispredicts on every step (stackoverflow.com/questions/11227809/…).

#### @David C. Rankin 2014-05-26 08:48:19

As the answers above point out, there are a number of ways to determine the most significant bit. However, as was also pointed out, the methods are likely to be unique to either 32bit or 64bit registers. The stanford.edu bithacks page provides solutions that work for both 32bit and 64bit computing. With a little work, they can be combined to provide a solid cross-architecture approach to obtaining the MSB. The solution I arrived at that compiled/worked across 64 & 32 bit computers was:

``````#if defined(__LP64__) || defined(_LP64)
# define BUILD_64   1
#endif

#include <stdio.h>
#include <stdint.h>  /* for uint32_t */

/* CHAR_BIT  (or include limits.h) */
#ifndef CHAR_BIT
#define CHAR_BIT  8
#endif  /* CHAR_BIT */

/*
* Find the log base 2 of an integer with the MSB N set in O(N)
* operations. (on 64bit & 32bit architectures)
*/
int
getmsb (uint32_t word)
{
int r = 0;
if (word < 1)
return 0;
#ifdef BUILD_64
union { uint32_t u[2]; double d; } t;  // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
#else
while (word >>= 1)
{
r++;
}
#endif  /* BUILD_64 */
return r;
}
``````

#### @David C. Rankin 2014-06-21 09:25:52

Wasn't int r; originally defined above the `#ifdef BUILD_64` flag? In which case it would not need redefinition within the conditional.

#### @Noldorin 2009-03-22 23:46:27

Although I would probably only use this method if I absolutely required the best possible performance (e.g. for writing some sort of board game AI involving bitboards), the most efficient solution is to use inline ASM. See the Optimisations section of this blog post for code with an explanation.

[...], the `bsrl` assembly instruction computes the position of the most significant bit. Thus, we could use this `asm` statement:

``````asm ("bsrl %1, %0"
: "=r" (position)
: "r" (number));
``````

#### @Noldorin 2009-03-23 00:05:05

To expand: the standard loop solution (shifting left and checking MSB) is probably the most readable. As in all cases involving bit twiddling, the speed of ASM can't be beaten, though there's no point cluttering your code unless necessary. Hacks are an in-between solution - go one way or the other.

#### @jalf 2009-03-23 01:22:43

I'd say taking the logarithm would be a perfectly readable solution (check the generated asm to see if the compiler can optimize it to use this asm instruction)

#### @rlbond 2009-03-23 04:39:37

Sometimes the inline ASM solution is slower, depending on the implementation in CPU microcode.

#### @Noldorin 2009-03-23 12:04:53

@rlbound: I can hardly believe that, although I may be mistaken. On any modern CPU one would think that it would get translated to a single instruction....

#### @harold 2017-10-26 13:54:49

@Noldorin it's a bit late but.. It's by definition a single instruction, but if it's microcoded as rlbond suggests then that single instruction could decode to a whole bunch of µops internally. That tends to be the case on AMD's microarchitectures, and Intel Atom, but on normal Intel microarchitectures it's a single operation all the way down.

#### @Noldorin 2017-10-26 17:12:41

@harold Thanks for the elaboration. Good to know.

#### @Marco Amagliani 2013-12-01 01:17:13

``````int highest_bit(unsigned int a) {
int count;
std::frexp(a, &count);
return count - 1;
}
``````

?

#### @Peter Cordes 2018-03-11 04:43:37

This is a slow (but more portable) version of this answer, which explains why it works.

#### @Kaz 2011-12-11 07:43:13

Kaz Kylheku here

I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.

(I happen to need this "find highest bit" for something, you see.)

I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.

The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.

The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.

Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).

``````int highest_bit_unrolled(long long n)
{
if (n & 0x7FFFFFFF00000000) {
if (n & 0x7FFF000000000000) {
if (n & 0x7F00000000000000) {
if (n & 0x7000000000000000) {
if (n & 0x4000000000000000)
return 63;
else
return (n & 0x2000000000000000) ? 62 : 61;
} else {
if (n & 0x0C00000000000000)
return (n & 0x0800000000000000) ? 60 : 59;
else
return (n & 0x0200000000000000) ? 58 : 57;
}
} else {
if (n & 0x00F0000000000000) {
if (n & 0x00C0000000000000)
return (n & 0x0080000000000000) ? 56 : 55;
else
return (n & 0x0020000000000000) ? 54 : 53;
} else {
if (n & 0x000C000000000000)
return (n & 0x0008000000000000) ? 52 : 51;
else
return (n & 0x0002000000000000) ? 50 : 49;
}
}
} else {
if (n & 0x0000FF0000000000) {
if (n & 0x0000F00000000000) {
if (n & 0x0000C00000000000)
return (n & 0x0000800000000000) ? 48 : 47;
else
return (n & 0x0000200000000000) ? 46 : 45;
} else {
if (n & 0x00000C0000000000)
return (n & 0x0000080000000000) ? 44 : 43;
else
return (n & 0x0000020000000000) ? 42 : 41;
}
} else {
if (n & 0x000000F000000000) {
if (n & 0x000000C000000000)
return (n & 0x0000008000000000) ? 40 : 39;
else
return (n & 0x0000002000000000) ? 38 : 37;
} else {
if (n & 0x0000000C00000000)
return (n & 0x0000000800000000) ? 36 : 35;
else
return (n & 0x0000000200000000) ? 34 : 33;
}
}
}
} else {
if (n & 0x00000000FFFF0000) {
if (n & 0x00000000FF000000) {
if (n & 0x00000000F0000000) {
if (n & 0x00000000C0000000)
return (n & 0x0000000080000000) ? 32 : 31;
else
return (n & 0x0000000020000000) ? 30 : 29;
} else {
if (n & 0x000000000C000000)
return (n & 0x0000000008000000) ? 28 : 27;
else
return (n & 0x0000000002000000) ? 26 : 25;
}
} else {
if (n & 0x0000000000F00000) {
if (n & 0x0000000000C00000)
return (n & 0x0000000000800000) ? 24 : 23;
else
return (n & 0x0000000000200000) ? 22 : 21;
} else {
if (n & 0x00000000000C0000)
return (n & 0x0000000000080000) ? 20 : 19;
else
return (n & 0x0000000000020000) ? 18 : 17;
}
}
} else {
if (n & 0x000000000000FF00) {
if (n & 0x000000000000F000) {
if (n & 0x000000000000C000)
return (n & 0x0000000000008000) ? 16 : 15;
else
return (n & 0x0000000000002000) ? 14 : 13;
} else {
if (n & 0x0000000000000C00)
return (n & 0x0000000000000800) ? 12 : 11;
else
return (n & 0x0000000000000200) ? 10 : 9;
}
} else {
if (n & 0x00000000000000F0) {
if (n & 0x00000000000000C0)
return (n & 0x0000000000000080) ? 8 : 7;
else
return (n & 0x0000000000000020) ? 6 : 5;
} else {
if (n & 0x000000000000000C)
return (n & 0x0000000000000008) ? 4 : 3;
else
return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);
}
}
}
}
}

int highest_bit(long long n)
{
const long long mask[] = {
0x000000007FFFFFFF,
0x000000000000FFFF,
0x00000000000000FF,
0x000000000000000F,
0x0000000000000003,
0x0000000000000001
};
int hi = 64;
int lo = 0;
int i = 0;

if (n == 0)
return 0;

for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {
int mi = lo + (hi - lo) / 2;

if ((n >> mi) != 0)
lo = mi;
else if ((n & (mask[i] << lo)) != 0)
hi = mi;
}

return lo + 1;
}
``````

Quick and dirty test program:

``````#include <stdio.h>
#include <time.h>
#include <stdlib.h>

int highest_bit_unrolled(long long n);
int highest_bit(long long n);

main(int argc, char **argv)
{
long long n = strtoull(argv[1], NULL, 0);
int b1, b2;
long i;
clock_t start = clock(), mid, end;

for (i = 0; i < 1000000000; i++)
b1 = highest_bit_unrolled(n);

mid = clock();

for (i = 0; i < 1000000000; i++)
b2 = highest_bit(n);

end = clock();

printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);

printf("time1 = %d\n", (int) (mid - start));
printf("time2 = %d\n", (int) (end - mid));
return 0;
}
``````

Using only -O2, the difference becomes greater. The decision tree is almost four times faster.

I also benchmarked against the naive bit shifting code:

``````int highest_bit_shift(long long n)
{
int i = 0;
for (; n; n >>= 1, i++)
; /* empty */
return i;
}
``````

This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!

On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.

I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.

#### @Surt 2014-10-02 15:29:05

I'm not saying this isn't good, but Your test program here only tests on the same number, which after 2-3 iterations will have set the branch predictors to their final position and after that they will make perfect branch predictions. The good thing is that with a totally random distribution half the numbers will have close to perfect prediction, namely bit63.

#### @dangermouse 2011-10-14 12:29:54

I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.

``````int highest_bit(unsigned int a) {
static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };
int l, h;

if (a == 0) return -1;

l = 0;
h = 32;

do {
int m = l + (h - l) / 2;

if ((a >> m) != 0) l = m;
else if ((a & (*mask << l)) != 0) h = m;

} while (l < h - 1);

return l;
}
``````

#### @JonesD 2011-07-09 08:14:18

Expanding on Josh's benchmark... one can improve the clz as follows

``````/***************** clz2 ********************/

#define NUM_OF_HIGHESTBITclz2(a) ((a)                              \
? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \
: 0)
``````

Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.

#### @Josh 2011-07-08 14:20:26

Here are some (simple) benchmarks, of algorithms currently given on this page...

The algorithms have not been tested over all inputs of unsigned int; so check that first, before blindly using something ;)

On my machine clz (__builtin_clz) and asm work best. asm seems even faster then clz... but it might be due to the simple benchmark...

``````//////// go.c ///////////////////////////////
// compile with:  gcc go.c -o go -lm
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/***************** math ********************/

#define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */    \
((unsigned) log2(a))         /* thus: do not use if a <= 0 */

#define NUM_OF_HIGHESTBITmath(a) ((a)               \
? (1U << POS_OF_HIGHESTBITmath(a))    \
: 0)

/***************** clz ********************/

unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);
#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */

#define NUM_OF_HIGHESTBITclz(a) ((a)                    \
? (1U << POS_OF_HIGHESTBITclz(a))  \
: 0)

/***************** i2f ********************/

double FF;
#define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023)

#define NUM_OF_HIGHESTBITi2f(a) ((a)                    \
? (1U << POS_OF_HIGHESTBITi2f(a))  \
: 0)

/***************** asm ********************/

unsigned OUT;
#define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT)

#define NUM_OF_HIGHESTBITasm(a) ((a)                    \
? (1U << POS_OF_HIGHESTBITasm(a))  \
: 0)

/***************** bitshift1 ********************/

#define NUM_OF_HIGHESTBITbitshift1(a) (({   \
OUT = a;                  \
OUT |= (OUT >> 1);                \
OUT |= (OUT >> 2);                \
OUT |= (OUT >> 4);                \
OUT |= (OUT >> 8);                \
OUT |= (OUT >> 16);               \
}), (OUT & ~(OUT >> 1)))          \

/***************** bitshift2 ********************/
int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};

#define POS_OF_HIGHESTBITbitshift2(a) (({   \
OUT = a;                  \
OUT |= OUT >> 1;              \
OUT |= OUT >> 2;              \
OUT |= OUT >> 4;              \
OUT |= OUT >> 8;              \
OUT |= OUT >> 16;             \
OUT = (OUT >> 1) + 1;             \
}), POS[(OUT * 0x077CB531UL) >> 27])

#define NUM_OF_HIGHESTBITbitshift2(a) ((a)              \
? (1U << POS_OF_HIGHESTBITbitshift2(a)) \
: 0)

#define LOOPS 100000000U

int main()
{
time_t start, end;
unsigned ui;
unsigned n;

/********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/
printf("math\n");
for (ui = 0U; ui < 18; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui));

printf("\n\n");

printf("clz\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui));

printf("\n\n");

printf("i2f\n");
for (ui = 0U; ui < 18U; ++ui)
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui));

printf("\n\n");

printf("asm\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui));
}

printf("\n\n");

printf("bitshift1\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui));
}

printf("\n\n");

printf("bitshift2\n");
for (ui = 0U; ui < 18U; ++ui) {
printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui));
}

/************************* Simple clock() benchmark ******************/
start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITmath(ui);
end = clock();
printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITclz(ui);
end = clock();
printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITi2f(ui);
end = clock();
printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITasm(ui);
end = clock();
printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift1(ui);
end = clock();
printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

start = clock();
for (ui = 0; ui < LOOPS; ++ui)
n = NUM_OF_HIGHESTBITbitshift2(ui);
end = clock();
printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n");

return EXIT_SUCCESS;
}
``````

#### @Quinn Taylor 2011-02-11 15:31:26

Since 2^N is an integer with only the Nth bit set (1 << N), finding the position (N) of the highest set bit is the integer log base 2 of that integer.

http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious

``````unsigned int v;
unsigned r = 0;

while (v >>= 1) {
r++;
}
``````

This "obvious" algorithm may not be transparent to everyone, but when you realize that the code shifts right by one bit repeatedly until the leftmost bit has been shifted off (note that C treats any non-zero value as true) and returns the number of shifts, it makes perfect sense. It also means that it works even when more than one bit is set — the result is always for the most significant bit.

If you scroll down on that page, there are faster, more complex variations. However, if you know you're dealing with numbers with a lot of leading zeroes, the naive approach may provide acceptable speed, since bit shifting is rather fast in C, and the simple algorithm doesn't require indexing an array.

NOTE: When using 64-bit values, be extremely cautious about using extra-clever algorithms; many of them only work correctly for 32-bit values.

#### @Johan 2011-09-20 09:43:58

It seems to work, but I do not fully understand why we don't shift out bits forever and get stuck in the while loop.

#### @Quinn Taylor 2011-09-21 19:31:44

@Johan Stepping through with a debugger can help explain why the loop exits. Basically, its' because the expression in the condition evaluates to 0 (which is treated as false) once the last 1 bit has been shifted off the right.

#### @Johan 2011-09-22 06:27:46

Nice idea to use the end result like that :)

#### @Xantix 2012-09-03 05:56:18

note: must be unsigned, for signed integers the right shift fails for negative numbers.

#### @Chase 2014-01-03 20:31:53

Xantix: The shift in C/C++ is a logical shift, so it works fine. For Java, JavaScript, or D, you need to use the logical shift operator `>>>`. Plus probably the comparator `!= 0`, and some unspecified number of parenthesis.

#### @Gizmo 2014-11-18 15:42:20

might anyone be interested, this is 2 times faster than `return (unsigned int)log2(val);` (no optimizations)

#### @Tim Čas 2015-02-12 20:57:57

@Chase: No it's not. It's a logical shift for unsigned. For signed, it may or may not be a logical shift (and it's usually arithmetic, in fact).

#### @Jim Balter 2016-12-28 10:12:36

"this is 2 times faster than return (unsigned int)log2(val)" -- the faintest praise.

#### @Protagonist 2009-03-23 00:32:11

This should be lightning fast:

``````int msb(unsigned int v) {
static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,
30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v = (v >> 1) + 1;
return pos[(v * 0x077CB531UL) >> 27];
}
``````

#### @jalf 2009-03-23 01:30:04

7 bit shifts, 5 or instructions, a multiplty and a potential cache miss. :) Did you benchmark it, or look at the assembler generated? It could end up quite slow, depending on how much of it the compiler can eliminate.

#### @Eduardo 2009-03-23 19:33:25

Hi, just wondering jalf, how do you now the thing with the possible cache miss?

#### @Protagonist 2009-03-23 19:56:01

Arno's answer doesn't work beyond 24 bits, rlbond's answer doesn't even work (it returns an answer where the most significant one bit is turned on and all the other bits are turned off--not the answer desired), and Vasil's code fragment isn't complete.

#### @Protagonist 2009-03-23 19:59:22

I'm new here. I don't get the negative votes guys. I have provided the only answer with source code that actually works.

#### @unwind 2009-03-24 15:00:32

The "possible cache miss" is probably due to this code requiring access to its lookup table. If that table is not cached when this is called, there will be a stall while it's fetched. This might make the worst-case performance far worse than the solutions not using a LUT.

#### @j_random_hacker 2011-07-08 06:05:20

@jalf: With you on the multiply, and the dependency chain of shifts is not ideal, but you're worried about cache misses? LOL... The only time a cache miss would hurt performance is if this routine was being called often in a tight loop, in which case the entire table would most likely be cached by call #2! Cache performance on a read-only table of 32 ints is not ever something to worry about.

#### @jalf 2011-07-08 07:46:07

not really the point. It uses a lot more data cache than necessary (more than one cache line, even), and more instruction cache than necessary. You'll likely get cache misses that could have been avoided the first time you call the function, and it will pollute the cache more than necessary, so after the call, other code might encounter more misses than necessary. LUT's often aren't worth the trouble because cache misses are expensive. But I only said it was something I'd want to benchmark before I claimed it was "lightning fast". Not that it is definitely a problem.

#### @Gizmo 2014-11-18 15:46:46

it does seem that this is 1.85 times faster than Quin Taylors solution

#### @t0rakka 2015-01-09 14:08:18

The problem is at the end where you extract the most significant bit of the generated mask. You do (v >> 1) + 1, which does erraneously set the least significant bit to one (the +1..). If you change the order, (v + 1) >> 1, now you miss the case where msb is 0x80000000. 0x80000000 expanded into bit mask is 0xffffffff, + 1, is zero, not 0x80000000. The correct way to handle this is to do the expansion into mask like you do.. but instead of isolating the highest set bit use the DeBruijn variant where you get unique hash with mask instead of single isolated bit. Both are perfect unique.

#### @t0rakka 2015-01-09 14:09:23

static const uint8 table[] = { 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31 }; return table[(v * 0x07c4acdd) >> 27];

#### @t0rakka 2015-01-09 14:32:45

Never mind, it was my bad; I compiled this on a 64 bit compiler so the 0x077CB531UL had to be changed into 0x077CB531 to get a correct result. The mask-debruijn variant saves two instructions because the highest set bit doesn't need to be isolated. I recommend 32 bit multiplication even when using 64 bit compiler. :)

#### @ChuckCottrill 2015-10-14 01:31:58

The table has 32 entries, and every value is < 255 (127), so define the table as type unsigned char, and it will fit in a single 32 byte L1 cache line. And the whole thing fits in two cache lines.

#### @chux - Reinstate Monica 2018-11-03 20:25:20

Re: have provided the only answer with source code that actually works, this answer fails when `unsigned` is not 32-bit. Good, but not universal.

#### @Stefan Reich 2020-06-23 15:41:20

@Protagonist Nobody gets the downvotes on StackOverflow.

#### @Vasil 2009-03-22 23:51:08

Think bitwise operators.

I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:

``````position = sizeof(int)*8
while(!(n & cmp)){
n <<=1;
position--;
}
``````

#### @dreamlax 2009-03-23 03:31:51

What do you mean converting to a string? The definition of ffs takes an int and returns an int. Where would the conversion be? And what purpose would the conversion serve if we are looking for bits in a word?

#### @Vasil 2009-03-23 03:37:38

I didn't know of that function.

#### @Peter Cordes 2018-03-11 04:53:59

The `8` should be `CHAR_BIT`. This is very unlikely to be the fastest way, because branch misprediction will happen on exiting the loop unless this is used with the same input repeatedly. Also, for small inputs (lots of zeros), it has to loop a lot. This is like the fallback way that you'd use as the easy-to-verify version in a unit test to compare against optimized versions.

#### @rlbond 2009-03-23 03:21:45

``````unsigned int
msb32(register unsigned int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return(x & ~(x >> 1));
}
``````

1 register, 13 instructions. Believe it or not, this is usually faster than the BSR instruction mentioned above, which operates in linear time. This is logarithmic time.

#### @Protagonist 2009-03-23 21:55:22

The above code doesn't answer the question. It returns an unsigned integer where the most significant on bit in x remains on and all the other bits are turned off. The question was to return the position of the most significant on bit.

#### @R.. GitHub STOP HELPING ICE 2011-02-06 06:55:21

You can then use a De Bruijn sequence approach to find the index of the bit that's set. :-)

#### @rlbond 2011-02-08 01:27:14

@Protagonist, he said in a comment that either suffices.

#### @Quinn Taylor 2011-02-11 16:14:59

This one (from that same page) would do what you need, but it requires an additional function. aggregate.org/MAGIC/#Log2%20of%20an%20Integer

#### @Peter Cordes 2016-12-18 07:07:13

BSR is fast on Intel CPUs since Core2 at least. LZCNT is fast on AMD CPUs, and gcc uses it for `__builtin_clz` if it's enabled with `-march=native` or something (since it's fast on every CPU that supports it). Even on CPUs like AMD Bulldozer-family where BSR is "slow", it's not that slow: 7 m-ops with 4 cycle latency and one per 4c throughput. On Atom, BSR is really slow: 16 cycles. On Silvermont, it's 10 uops with 10 cycle latency. This might be slightly lower latency than BSR on Silvermont, but IDK.

#### @Jim Balter 2016-12-28 10:56:36

"this is usually faster than the BSR instruction" -- not relevant, since they don't do the same thing.

### [SOLVED] Easy interview question got harder: given numbers 1..100, find the missing number(s) given exactly k are missing

• 2010-08-16 10:26:58
• polygenelubricants
• 276361 View
• 1170 Score
• Tags:   algorithm math