By Matt Howells


2008-09-20 19:04:38 8 Comments

8 bits representing the number 7 look like this:

00000111

Three bits are set.

What are algorithms to determine the number of set bits in a 32-bit integer?

30 comments

@Bamidele Alegbe 2019-08-25 22:00:15

This is the implementation in golang

func CountBitSet(n int) int {


    count := 0
    for n > 0 {
      count += n & 1
      n >>= 1

    }
    return count
}

@Ciro Santilli 新疆改造中心法轮功六四事件 2019-07-31 07:44:33

C++20 std::popcount

The following proposal has been merged http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0553r4.html and should add it to a the <bit> header.

I expect the usage to be like:

#include <bit>
#include <iostream>

int main() {
    std::cout << std::popcount(0x55) << std::endl;
}

I'll give it a try when support arrives to GCC, GCC 9.1.0 with g++-9 -std=c++2a still doesn't support it.

The proposal says:

Header: <bit>

namespace std {

  // 25.5.6, counting
  template<class T>
    constexpr int popcount(T x) noexcept;

and:

template<class T>
  constexpr int popcount(T x) noexcept;

Constraints: T is an unsigned integer type (3.9.1 [basic.fundamental]).

Returns: The number of 1 bits in the value of x.

std::rotl and std::rotr were also added to do circular bit rotations: Best practices for circular shift (rotate) operations in C++

@Varun Gusain 2017-08-23 17:31:11

what you can do is

while(n){
    n=n&(n-1);
    count++;
}

the logic behind this is the bits of n-1 is inverted from rightmost set bit of n. if n=6 i.e 110 then 5 is 101 the bits are inverted from rightmost set bit of n. so if we & these two we will make the rightmost bit 0 in every iteration and always go to the next rightmost set bit.Hence, counting the set bit.The worst time complexity will be O(logn) when every bit is set.

@Arjun Singh 2019-06-06 16:12:25

Simple algorithm to count the number of set bits:

int countbits(n){
     int count = 0;
     while(n != 0){
        n = n & (n-1);
        count++;
   }
   return count;
}

Take the example of 11 (1011) and try manually running through the algorithm. Should help you a lot!

@John Dimm 2013-12-20 06:55:35

The Hacker's Delight bit-twiddling becomes so much clearer when you write out the bit patterns.

unsigned int bitCount(unsigned int x)
{
  x = ((x >> 1) & 0b01010101010101010101010101010101)
     + (x       & 0b01010101010101010101010101010101);
  x = ((x >> 2) & 0b00110011001100110011001100110011)
     + (x       & 0b00110011001100110011001100110011); 
  x = ((x >> 4) & 0b00001111000011110000111100001111)
     + (x       & 0b00001111000011110000111100001111); 
  x = ((x >> 8) & 0b00000000111111110000000011111111)
     + (x       & 0b00000000111111110000000011111111); 
  x = ((x >> 16)& 0b00000000000000001111111111111111)
     + (x       & 0b00000000000000001111111111111111); 
  return x;
}

The first step adds the even bits to the odd bits, producing a sum of bits in each two. The other steps add high-order chunks to low-order chunks, doubling the chunk size all the way up, until we have the final count taking up the entire int.

@Nopik 2014-08-22 07:38:28

This solution seem to have minor problem, related to operator precedence. For each term it should say: x = (((x >> 1) & 0b01010101010101010101010101010101) + (x & 0b01010101010101010101010101010101)); (i.e. extra parens added).

@decrypto 2018-02-11 13:12:57

This will also work fine :

int ans = 0;
while(num){
 ans += (num &1);
 num = num >>1;
}    
return ans;

@Bergi 2018-02-11 13:17:41

How is this different from this existing answer?

@greybeard 2018-02-11 19:10:58

Both without a mention of S.E. Anderson's Bit Twiddling Hacks or B. Kernighan. Sigh.

@S.S. Anne 2019-09-19 17:31:38

This only counts the number of initial set bits.

@stacktay 2017-11-08 14:50:34

  private int get_bits_set(int v)
    {
      int c; // c accumulates the total bits set in v
        for (c = 0; v>0; c++)
        {
            v &= v - 1; // clear the least significant bit set
        }
        return c;
    }

@Nils Pipenbrinck 2008-09-20 19:23:05

Also consider the built-in functions of your compilers.

On the GNU compiler for example you can just use:

int __builtin_popcount (unsigned int x);
int __builtin_popcountll (unsigned long long x);

In the worst case the compiler will generate a call to a function. In the best case the compiler will emit a cpu instruction to do the same job faster.

The GCC intrinsics even work across multiple platforms. Popcount will become mainstream in the x86 architecture, so it makes sense to start using the intrinsic now. Other architectures have the popcount for years.


On x86, you can tell the compiler that it can assume support for popcnt instruction with -mpopcnt or -msse4.2 to also enable the vector instructions that were added in the same generation. See GCC x86 options. -march=nehalem (or -march= whatever CPU you want your code to assume and to tune for) could be a good choice. Running the resulting binary on an older CPU will result in an illegal-instruction fault.

To make binaries optimized for the machine you build them on, use -march=native (with gcc, clang, or ICC).

MSVC provides an intrinsic for the x86 popcnt instruction, but unlike gcc it's really an intrinsic for the hardware instruction and requires hardware support.


Using std::bitset<>::count() instead of a built-in

In theory, any compiler that knows how to popcount efficiently for the target CPU should expose that functionality through ISO C++ std::bitset<>. In practice, you might be better off with the bit-hack AND/shift/ADD in some cases for some target CPUs.

For target architectures where hardware popcount is an optional extension (like x86), not all compilers have a std::bitset that takes advantage of it when available. For example, MSVC has no way to enable popcnt support at compile time, and always uses a table lookup, even with /Ox /arch:AVX (which implies SSE4.2, although technically there is a separate feature bit for popcnt.)

But at least you get something portable that works everywhere, and with gcc/clang with the right target options, you get hardware popcount for architectures that support it.

#include <bitset>
#include <limits>
#include <type_traits>

template<typename T>
//static inline  // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value,  unsigned >::type 
popcount(T x)
{
    static_assert(std::numeric_limits<T>::radix == 2, "non-binary type");

    // sizeof(x)*CHAR_BIT
    constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
    // std::bitset constructor was only unsigned long before C++11.  Beware if porting to C++03
    static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor");

    typedef typename std::make_unsigned<T>::type UT;        // probably not needed, bitset width chops after sign-extension

    std::bitset<bitwidth> bs( static_cast<UT>(x) );
    return bs.count();
}

See asm from gcc, clang, icc, and MSVC on the Godbolt compiler explorer.

x86-64 gcc -O3 -std=gnu++11 -mpopcnt emits this:

unsigned test_short(short a) { return popcount(a); }
    movzx   eax, di      # note zero-extension, not sign-extension
    popcnt  rax, rax
    ret
unsigned test_int(int a) { return popcount(a); }
    mov     eax, edi
    popcnt  rax, rax
    ret
unsigned test_u64(unsigned long long a) { return popcount(a); }
    xor     eax, eax     # gcc avoids false dependencies for Intel CPUs
    popcnt  rax, rdi
    ret

PowerPC64 gcc -O3 -std=gnu++11 emits (for the int arg version):

    rldicl 3,3,0,32     # zero-extend from 32 to 64-bit
    popcntd 3,3         # popcount
    blr

This source isn't x86-specific or GNU-specific at all, but only compiles well for x86 with gcc/clang/icc.

Also note that gcc's fallback for architectures without single-instruction popcount is a byte-at-a-time table lookup. This isn't wonderful for ARM, for example.

@Mike F 2008-09-25 03:29:58

I agree that this is good practice in general, but on XCode/OSX/Intel I found it to generate slower code than most of the suggestions posted here. See my answer for details.

@Calyth 2009-01-09 23:38:30

AFAIK the only x86 CPU that could do a pop-count in a single instruction would be the AMD Phenom/Barcelona (Family 10h). has about a latency of 4 cycles or so?

@matja 2009-11-24 10:31:53

The Intel i5/i7 has the SSE4 instruction POPCNT which does it, using general purpose registers. GCC on my system does not emit that instruction using this intrinsic, i guess because of no -march=nehalem option yet.

@Nils Pipenbrinck 2009-11-24 13:29:35

@matja, my GCC 4.4.1 emits the popcnt instruction if I compile with -msse4.2

@deft_code 2010-09-04 18:18:38

use c++'s std::bitset::count. after inlining this compiles to a single __builtin_popcount call.

@nlucaroni 2013-07-24 17:11:30

the intrinsic mentioned (_popcnt32/64) is located in immintrin.h and available if one has the POPCNT CPUID Feature Flag. It's not really part of SSE --at least that is my interpretation of the information in the Intrinsics Guide 3.0.1 provided by Intel)

@Nils Pipenbrinck 2013-07-24 19:45:44

@nlucaroni Well, yes. Times are changing. I've wrote this answer in 2008. Nowadays we have native popcount and the intrinsic will compile down to a single assembler statement if the platform allows that.

@Michael 2015-12-24 16:46:41

Unfortunately, call/return can be costly, so for inner loops I would prefer an inline version of Matt's NumberOfSetBits().

@Peter Cordes 2017-08-06 19:04:26

@Michael: __builtin functions aren't real functions that get called. If compiling for a target that supports it as a single instruction (e.g. x86 with -mpopcnt), it inlines to just that. However, without that gcc may emit a call to a libgcc.a function like __popcountdi2 instead of inlining those instructions. It's up to the compiler, though; clang4.0 chooses to inline, just like it does for std::bitset.count(). godbolt.org/g/TueMQt

@BeeOnRope 2017-08-06 20:25:14

Just to update the "state of popcnt instruction performance" mentioned in some of the above comments, the last several generations of Intel CPUs have been able to issue 1 popcnt per cycle, with a latency of 3 cycles, and AMD Zen architecture can issue 4 (!!) popcnt instructions per cycle with a latency of 1 cycle. So we can say that popcnt is "really fast" on modern hardware, and in AMD's case as fast as trivial instructions like or and add.

@Matt Howells 2008-09-20 19:05:22

This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.

The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. The parallel instructions (like x86's popcnt, on CPUs where it's supported) will almost certainly be fastest. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed).

A pre-populated table lookup method can be very fast if your CPU has a large cache and/or you are doing lots of these instructions in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory.

If you know that your bytes will be mostly 0's or mostly 1's then there are very efficient algorithms for these scenarios.

I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):

int numberOfSetBits(int i)
{
     // Java: use >>> instead of >>
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
     return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it.


This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)

However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).

On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.

References:

https://graphics.stanford.edu/~seander/bithacks.html

https://en.wikipedia.org/wiki/Hamming_weight

http://gurmeet.net/puzzles/fast-bit-counting-routines/

http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count)

@j_random_hacker 2009-05-25 12:23:35

+1. The first line in your NumberOfSetBits() is very cool -- just 3 instructions, instead of the 4 you would need if you separately masked out the even- and odd-numbered bits and added them (appropriately shifted) together.

@Jason S 2009-11-22 06:51:19

ha! love the NumberOfSetBits() function, but good luck getting that through a code review. :-)

@Craig McQueen 2009-12-15 02:18:41

Maybe it should use unsigned int, to easily show that it is free of any sign bit complications. Also would uint32_t be safer, as in, you get what you expect on all platforms?

@Maciej Hehl 2010-04-25 19:56:02

I checked a few things while answering to this question stackoverflow.com/questions/2709430/… and I think there is a bug. last line: return ((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; should be changed to: return (((i + (i >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24; (first the sum and bitwise & later)

@Matt Howells 2010-04-28 10:04:26

@Maciej: Can you please provide a case where it does not return the expected result?

@Maciej Hehl 2010-04-29 14:19:51

@Matt Howells I'm sorry. It looks like I got lost among those parenthesis. I had a bug in my implementation. it didn't work for numbers > 15. I checked the wikipedia article and there were those parenthesis. I was sure that was my problem. I fixed the parenthesis and it started to work. Looks like I fixed something else. But I think I learned something. Inability to reproduce the "bug" made me check the precedence of the operators :). Thank You and I apologize for the mess

@Robert S. Barnes 2011-03-29 07:54:12

Is there a portable version of this code? For instance, how does it behave with 9 bit bytes or other unusual architectures?

@R.. GitHub STOP HELPING ICE 2011-05-14 21:55:39

@nonnb: Actually, as written, the code is buggy and needs maintenance. >> is implementation-defined for negative values. The argument needs to be changed (or cast) to unsigned, and since the code is 32-bit-specific, it should probably be using uint32_t.

@Matt Howells 2011-06-14 11:53:16

@R.. Who says it is C++?

@Peter Hosey 2011-08-04 05:36:58

I took the “write-only” bit, true as it was, as a challenge, and set to decode the code. I got about halfway; I'm not sure exactly what the additions and multiplication do in the context of this function. I've edited the answer to include the explanation as far as I could get it; I invite anyone smarter than me to finish it.

@Matt Howells 2011-08-22 08:58:38

@Peter Hosey: Sorry, but I feel your comments in the code don't add much value and perhaps the explanation would be better in a comment.

@dash-tom-bang 2012-12-05 00:42:48

It's not really magic. It's adding sets of bits but doing so with some clever optimizations. The wikipedia link given in the answer does a good job of explaining what's going on but I'll go line by line. 1) Count up the number of bits in every pair of bits, putting that count in that pair of bits (you'll have 00, 01, or 10); the "clever" bit here is the subtract that avoids one mask. 2) Add pairs of those sums of bitpairs into their corresponding nibbles; nothing clever here but each nibble will now have a value 0-4. (cont'd)

@dash-tom-bang 2012-12-05 00:43:05

3) does too much on one line, but up to the mask, the nibbles are added up into bytes, then the multiply adds all of the bytes into the high byte, which is then shifted down, leaving the result.

@dash-tom-bang 2012-12-05 00:48:15

Another note, this extends to 64 and 128 bit registers by simply extending the constants appropriately. Interestingly (to me), those constants are also ~0 / 3, 5, 17, and 255; the former three being 2^n+1. This all makes more sense the more you stare at it and think about it in the shower. :)

@greggo 2012-12-14 14:26:01

Another note, there's complaints above about >> extension being unspecified for negative ints, however, all of the results from >> are masked in such a way as to discard the extended bits .. except for the final >> which is OK since the upper 2 bits of its input are mathematically guaranteed to be zero. This may apply to the Jave concern as well.

@anon58192932 2013-02-09 02:42:53

What if our input is a byte? I don't believe it's working for me when I cast my input from a byte to an int.

@Lefteris E 2013-06-13 10:07:29

The solution by bcdabcd987 is the same algorithm before being optimized to the point of becoming illegible...

@Jingguo Yao 2013-12-25 05:53:44

For a explanation of this algorithm, refer to page 179-180 of Software Optimization Guide for AMD Athlon™ 64 and Opteron™ Processors.

@Apriori 2014-06-08 20:45:28

@R: greggo is correct. If ones are shifted in then they are masked away. The final shift follows a multiplication summing every byte into the high byte. Since the high byte is summing bits set, it will never exceed 32 and only requires at most 6 bits. So the high bit will never be set and the number will never be negative.

@keyser 2015-02-06 16:34:10

I find it weird that this answer isn't closer to this one in terms of upvotes. Performance-wise it seems really stable to depend on the number of bits set. Maybe I misunderstood just how fast this one can be.

@Maarten Bodewes 2015-03-18 18:42:09

For Java you can just call Integer.bitCount(int) (since 1.5 but that's a while back already).

@Lewis Diamond 2016-05-30 17:06:08

You can also generate a neat lookup table using variadic templates in C++. Still doesn't perform as well as the builtin though. bitbucket.org/ldiamond/popcount/src

@corsiKa 2016-07-22 19:39:08

@JasonS If you cite where you get it from, for example Numerical Recipes or Art of Programming, and someone still denies it from code review as unmaintainable, that person shouldn't be reviewing code. There are some bibles/oracles we simply trust as sources of good algorithms so we don't go reinventing things ourselves.

@Jason S 2016-07-23 19:56:43

Even with citations, that doesn't mean the code has been copied correctly, or that the original code did not contain errors for edge cases.

@greggo 2017-03-22 15:33:56

It's not hard to write a test suite that just goes through all 2**32 possible inputs and compares to a transparent reference implementation. Given that there are no loops or conditionals, no variable shifts (and thus no room for UD behavior, if you are using unsigned) that counts as proof. Not so much for the 64-bit version though. Incidentally, __builtin_popcount() (gcc,clang) will generate something very much like this when there is no popcount instruction.

@Albert van der Horst 2019-03-05 10:52:50

Never use shift operators and masks on signed quantities.

@Doin 2019-11-09 15:42:17

If implementing this in Javascript, change the first line to i = (i|0) - ((i >> 1) & 0x55555555); The |0 coerces the value to an int32, giving a significant speed-up. Also using >>> over >> seems slightly faster (although as previous comments have pointed out, both will in fact work correctly in this algorithm). See performance tests here: jsperf.com/fast-int32-bit-count

@cipilo 2017-07-07 23:51:50

I have not seen this approach anywhere:

int nbits(unsigned char v) {
    return ((((v - ((v >> 1) & 0x55)) * 0x1010101) & 0x30c00c03) * 0x10040041) >> 0x1c;
}

It works per byte, so it would have to be called 4 times for a 32-bit integer. It is derived from the sideways addition but uses two 32-bit multiplications to reduce the number of instructions to only 7.

Most current C compilers will optimize this function using SIMD (SSE2) instructions when it is clear that the number of requests is a multiple of 4, and it becomes quite competitive. It is portable, can be defined as a macro or inline function and does not need data tables.

This approach can be extended to work on 16 bits at a time, using 64-bit multiplications. However, it fails when all 16 bits are set, returning zero, so it can be used only when the 0xffff input value is not present. It is also slower due to the 64-bit operations and does not optimize well.

@rashedcs 2017-06-15 12:01:54

You can use built in function named __builtin_popcount(). There is no__builtin_popcount in C++ but it is a built in function of GCC compiler. This function return the number of set bit in an integer.

int __builtin_popcount (unsigned int x);

Reference : Bit Twiddling Hacks

@diugalde 2016-11-03 21:02:05

I always use this in Competitive Programming and it's easy to write and efficient:

#include <bits/stdc++.h>

using namespace std;

int countOnes(int n) {
    bitset<32> b(n);
    return b.count();
}

@Jonatan Kaźmierczak 2016-10-04 21:05:15

In Java 8 or 9 just invoke Integer.bitCount .

@Manish Mulani 2012-05-05 07:12:53

I use the below code which is more intuitive.

int countSetBits(int n) {
    return !n ? 0 : 1 + countSetBits(n & (n-1));
}

Logic : n & (n-1) resets the last set bit of n.

P.S : I know this is not O(1) solution, albeit an interesting solution.

@ealfonso 2018-01-14 19:55:48

this is good for "sparse" numbers with a low number of bits, as it is O(ONE-BITS). It is indeed O(1) since there are at most 32 one-bits.

@Burhan ARAS 2015-09-25 17:31:37

public class BinaryCounter {

private int N;

public BinaryCounter(int N) {
    this.N = N;
}

public static void main(String[] args) {

    BinaryCounter counter=new BinaryCounter(7);     
    System.out.println("Number of ones is "+ counter.count());

}

public int count(){
    if(N<=0) return 0;
    int counter=0;
    int K = 0;
    do{
        K = biggestPowerOfTwoSmallerThan(N);
        N = N-K;
        counter++;
    }while (N != 0);
    return counter;

}

private int biggestPowerOfTwoSmallerThan(int N) {
    if(N==1) return 1;
    for(int i=0;i<N;i++){
        if(Math.pow(2, i) > N){
            int power = i-1;
            return (int) Math.pow(2, power);
        }
    }
    return 0;
}
}

@Jean-François Fabre 2017-02-08 21:18:29

(int) Math.pow(2, power) is that a joke?

@Shree Harsha 2019-12-19 04:57:45

Don't use an axe to cut your nails when you have a nailclipper.

@Erorr 2016-06-01 02:16:13

I think the Brian Kernighan's method will be useful too... It goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.

int countSetBits(unsigned int n) { 
    unsigned int n; // count the number of bits set in n
    unsigned int c; // c accumulates the total bits set in n
    for (c=0;n>0;n=n&(n-1)) c++; 
    return c; 
}

Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don Knuth pointed out to me that this method "was first published by Peter Wegner in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach.)"

@ErmIg 2016-02-15 14:33:01

I found an implementation of bit counting in an array with using of SIMD instruction (SSSE3 and AVX2). It has in 2-2.5 times better performance than if it will use __popcnt64 intrinsic function.

SSSE3 version:

#include <smmintrin.h>
#include <stdint.h>

const __m128i Z = _mm_set1_epi8(0x0);
const __m128i F = _mm_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m128i T = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m128i _sum =  _mm128_setzero_si128();
    for (size_t i = 0; i < size; i += 16)
    {
        //load 16-byte vector
        __m128i _src = _mm_loadu_si128((__m128i*)(src + i));
        //get low 4 bit for every byte in vector
        __m128i lo = _mm_and_si128(_src, F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m128i hi = _mm_and_si128(_mm_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, hi)));
    }
    uint64_t sum[2];
    _mm_storeu_si128((__m128i*)sum, _sum);
    return sum[0] + sum[1];
}

AVX2 version:

#include <immintrin.h>
#include <stdint.h>

const __m256i Z = _mm256_set1_epi8(0x0);
const __m256i F = _mm256_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m256i T = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 
                                   0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m256i _sum =  _mm256_setzero_si256();
    for (size_t i = 0; i < size; i += 32)
    {
        //load 32-byte vector
        __m256i _src = _mm256_loadu_si256((__m256i*)(src + i));
        //get low 4 bit for every byte in vector
        __m256i lo = _mm256_and_si256(_src, F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m256i hi = _mm256_and_si256(_mm256_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, hi)));
    }
    uint64_t sum[4];
    _mm256_storeu_si256((__m256i*)sum, _sum);
    return sum[0] + sum[1] + sum[2] + sum[3];
}

@KeineKaefer 2016-02-04 11:45:31

How about converting the integer to a binary string and count the ones?

php solution:

substr_count( decbin($integer), '1' );

@gexicide 2016-07-21 10:55:11

Sounds like the javascript way. I would suggest using a webservice instead!

@Clearer 2018-04-18 11:18:08

So, first do an expensive conversion and then do a set of expensive comparisons? Sounds like a very slow method.

@daniel 2008-09-20 19:10:30

Why not iteratively divide by 2?

count = 0
while n > 0
  if (n % 2) == 1
    count += 1
  n /= 2  

I agree that this isn't the fastest, but "best" is somewhat ambiguous. I'd argue though that "best" should have an element of clarity

@Matt Howells 2008-09-20 19:40:13

That'll work and is easy to understand, but there are faster methods.

@Mike F 2008-09-20 21:50:32

Unless you do this a LOT, the performance impact would be negligible. So all things being equal, I agree with daniel that 'best' implies "doesn't read like gibberish".

@Matt Howells 2008-09-21 17:47:38

I deliberately didn't define 'best', to get a variety of methods. Lets face it if we have got down to the level of this sort of bit-twiddling we are probably looking for something uber-fast that looks like a chimp has typed it.

@Mecki 2008-09-25 10:35:31

Bad code. A compiler might make good one out of it, but in my tests GCC did not. Replace (n%2) with (n&1); AND being much faster than MODULO. Replace (n/=2) with (n>>=1); bitshifting much faster than division.

@Mike F 2008-09-25 13:32:50

@Mecki: In my tests, gcc (4.0, -O3) did do the obvious optimisations.

@chux - Reinstate Monica 2013-10-15 15:22:58

Negative values of n always return 0.

@greggo 2017-03-28 19:03:06

Regarding optimization of %2 and /2: n%2 and n/2 are only equivalent to n&1 and n>>1 when n >=0. So, with unsigned it's fine (and the example is only really correct with unsigned, anyway). In this example, even with int, the compiler could prove that those ops are never reached unless n>=0, so it's not a reflection on the general case; if compiler can't prove n>=0, you will get something like (n -(n>>31))>>1 for the n/2, and similar weirdness for the n%2. In general case, if( (n%2)!=0) could optimize to (n&1)!=0 but if( (n%2)==1) cannot.

@Ronald Souza 2019-09-15 06:52:59

The 2 statements within the loop (namely, "if (n % 2) == 1" and "count += 1") could be replaced by a single, faster one: "count += (n % 2) == 1", leveraging the C++ support to implicit cast between bool and int. Interesting answer anyway. Upvoted.

@Anders Cedronius 2015-11-18 22:45:59

Another Hamming weight algorithm if you're on a BMI2 capable CPU

the_weight=__tzcnt_u64(~_pext_u64(data[i],data[i]));

Have fun!

@greybeard 2015-11-18 23:07:33

(cram set bits to the low end, invert, count unset bits from the low end)

@Peter Cordes 2017-09-22 06:42:10

Fun, but of no practical value. All BMI2 CPUs have popcnt. pext same,same to pack the bits could be an interesting building-block for something else, but tzcnt and pext both run on the same port as popcnt on Intel CPUs, and pext is very slow on AMD. (agner.org/optimize). You can sort of emulate pext x,x with (1ULL << popcnt(x)) - 1, except for the x==0 case. x86 shifts can't shift out all the bits, because they mask the shift count, and you have to watch out for C undefined behaviour with out of range counts.

@Robert S. Barnes 2011-03-29 08:04:29

Here is a portable module ( ANSI-C ) which can benchmark each of your algorithms on any architecture.

Your CPU has 9 bit bytes? No problem :-) At the moment it implements 2 algorithms, the K&R algorithm and a byte wise lookup table. The lookup table is on average 3 times faster than the K&R algorithm. If someone can figure a way to make the "Hacker's Delight" algorithm portable feel free to add it in.

#ifndef _BITCOUNT_H_
#define _BITCOUNT_H_

/* Return the Hamming Wieght of val, i.e. the number of 'on' bits. */
int bitcount( unsigned int );

/* List of available bitcount algorithms.  
 * onTheFly:    Calculate the bitcount on demand.
 *
 * lookupTalbe: Uses a small lookup table to determine the bitcount.  This
 * method is on average 3 times as fast as onTheFly, but incurs a small
 * upfront cost to initialize the lookup table on the first call.
 *
 * strategyCount is just a placeholder. 
 */
enum strategy { onTheFly, lookupTable, strategyCount };

/* String represenations of the algorithm names */
extern const char *strategyNames[];

/* Choose which bitcount algorithm to use. */
void setStrategy( enum strategy );

#endif

.

#include <limits.h>

#include "bitcount.h"

/* The number of entries needed in the table is equal to the number of unique
 * values a char can represent which is always UCHAR_MAX + 1*/
static unsigned char _bitCountTable[UCHAR_MAX + 1];
static unsigned int _lookupTableInitialized = 0;

static int _defaultBitCount( unsigned int val ) {
    int count;

    /* Starting with:
     * 1100 - 1 == 1011,  1100 & 1011 == 1000
     * 1000 - 1 == 0111,  1000 & 0111 == 0000
     */
    for ( count = 0; val; ++count )
        val &= val - 1;

    return count;
}

/* Looks up each byte of the integer in a lookup table.
 *
 * The first time the function is called it initializes the lookup table.
 */
static int _tableBitCount( unsigned int val ) {
    int bCount = 0;

    if ( !_lookupTableInitialized ) {
        unsigned int i;
        for ( i = 0; i != UCHAR_MAX + 1; ++i )
            _bitCountTable[i] =
                ( unsigned char )_defaultBitCount( i );

        _lookupTableInitialized = 1;
    }

    for ( ; val; val >>= CHAR_BIT )
        bCount += _bitCountTable[val & UCHAR_MAX];

    return bCount;
}

static int ( *_bitcount ) ( unsigned int ) = _defaultBitCount;

const char *strategyNames[] = { "onTheFly", "lookupTable" };

void setStrategy( enum strategy s ) {
    switch ( s ) {
    case onTheFly:
        _bitcount = _defaultBitCount;
        break;
    case lookupTable:
        _bitcount = _tableBitCount;
        break;
    case strategyCount:
        break;
    }
}

/* Just a forwarding function which will call whichever version of the
 * algorithm has been selected by the client 
 */
int bitcount( unsigned int val ) {
    return _bitcount( val );
}

#ifdef _BITCOUNT_EXE_

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/* Use the same sequence of pseudo random numbers to benmark each Hamming
 * Weight algorithm.
 */
void benchmark( int reps ) {
    clock_t start, stop;
    int i, j;
    static const int iterations = 1000000;

    for ( j = 0; j != strategyCount; ++j ) {
        setStrategy( j );

        srand( 257 );

        start = clock(  );

        for ( i = 0; i != reps * iterations; ++i )
            bitcount( rand(  ) );

        stop = clock(  );

        printf
            ( "\n\t%d psudoe-random integers using %s: %f seconds\n\n",
              reps * iterations, strategyNames[j],
              ( double )( stop - start ) / CLOCKS_PER_SEC );
    }
}

int main( void ) {
    int option;

    while ( 1 ) {
        printf( "Menu Options\n"
            "\t1.\tPrint the Hamming Weight of an Integer\n"
            "\t2.\tBenchmark Hamming Weight implementations\n"
            "\t3.\tExit ( or cntl-d )\n\n\t" );

        if ( scanf( "%d", &option ) == EOF )
            break;

        switch ( option ) {
        case 1:
            printf( "Please enter the integer: " );
            if ( scanf( "%d", &option ) != EOF )
                printf
                    ( "The Hamming Weight of %d ( 0x%X ) is %d\n\n",
                      option, option, bitcount( option ) );
            break;
        case 2:
            printf
                ( "Please select number of reps ( in millions ): " );
            if ( scanf( "%d", &option ) != EOF )
                benchmark( option );
            break;
        case 3:
            goto EXIT;
            break;
        default:
            printf( "Invalid option\n" );
        }

    }

 EXIT:
    printf( "\n" );

    return 0;
}

#endif

@user1222021 2012-10-10 16:12:38

I like very much your plug-in, polymorphic approach, as well as the switch to build as a reusable library or stand-alone, test executable. Very well thought =)

@herohuyongtao 2014-01-14 12:53:05

This can be done in O(k), where k is the number of bits set.

int NumberOfSetBits(int n)
{
    int count = 0;

    while (n){
        ++ count;
        n = (n - 1) & n;
    }

    return count;
}

@Adrian Mole 2019-08-23 13:05:48

This is essentially Brian Kernighan's (remember him?) algorithm, with the minor change that he used the more succinct n &= (n-1) form.

@abelenky 2015-02-12 04:19:05

int countBits(int x)
{
    int n = 0;
    if (x) do n++;
           while(x=x&(x-1));
    return n;
}   

Or also:

int countBits(int x) { return (x)? 1+countBits(x&(x-1)): 0; }

@vidit 2013-04-12 19:14:12

I think the fastest way—without using lookup tables and popcount—is the following. It counts the set bits with just 12 operations.

int popcount(int v) {
    v = v - ((v >> 1) & 0x55555555);                // put count of each 2 bits into those 2 bits
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // put count of each 4 bits into those 4 bits  
    return c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

It works because you can count the total number of set bits by dividing in two halves, counting the number of set bits in both halves and then adding them up. Also know as Divide and Conquer paradigm. Let's get into detail..

v = v - ((v >> 1) & 0x55555555); 

The number of bits in two bits can be 0b00, 0b01 or 0b10. Lets try to work this out on 2 bits..

 ---------------------------------------------
 |   v    |   (v >> 1) & 0b0101   |  v - x   |
 ---------------------------------------------
   0b00           0b00               0b00   
   0b01           0b00               0b01     
   0b10           0b01               0b01
   0b11           0b01               0b10

This is what was required: the last column shows the count of set bits in every two bit pair. If the two bit number is >= 2 (0b10) then and produces 0b01, else it produces 0b00.

v = (v & 0x33333333) + ((v >> 2) & 0x33333333); 

This statement should be easy to understand. After the first operation we have the count of set bits in every two bits, now we sum up that count in every 4 bits.

v & 0b00110011         //masks out even two bits
(v >> 2) & 0b00110011  // masks out odd two bits

We then sum up the above result, giving us the total count of set bits in 4 bits. The last statement is the most tricky.

c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;

Let's break it down further...

v + (v >> 4)

It's similar to the second statement; we are counting the set bits in groups of 4 instead. We know—because of our previous operations—that every nibble has the count of set bits in it. Let's look an example. Suppose we have the byte 0b01000010. It means the first nibble has its 4bits set and the second one has its 2bits set. Now we add those nibbles together.

0b01000010 + 0b01000000

It gives us the count of set bits in a byte, in the first nibble 0b01100010 and therefore we mask the last four bytes of all the bytes in the number (discarding them).

0b01100010 & 0xF0 = 0b01100000

Now every byte has the count of set bits in it. We need to add them up all together. The trick is to multiply the result by 0b10101010 which has an interesting property. If our number has four bytes, A B C D, it will result in a new number with these bytes A+B+C+D B+C+D C+D D. A 4 byte number can have maximum of 32 bits set, which can be represented as 0b00100000.

All we need now is the first byte which has the sum of all set bits in all the bytes, and we get it by >> 24. This algorithm was designed for 32 bit words but can be easily modified for 64 bit words.

@chux - Reinstate Monica 2013-10-15 15:40:06

What is the c = about? Looks like is should be eliminated. Further, suggest an extra paren set A"(((v + (v >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24" to avoid some classic warnings.

@chux - Reinstate Monica 2013-10-15 15:49:05

An important feature is that this 32-bit routine works for both popcount(int v) and popcount(unsigned v). For portability, consider popcount(uint32_t v), etc. Really like the *0x1010101 part.

@v.oddou 2015-03-31 01:34:39

sauce ? (book, link, invetors's names etc) would be VERY welcomed. Because then we can paste that in our codebases with a comment to where it comes from.

@emem 2016-02-06 09:02:10

I think for better clarity the last line should be written as: return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24; so we don't need to count letters to see what you are actually doing (since you discarded the first 0, I accidentally thought you used the wrong (flipped) bit pattern as mask - that is until I noted there are only 7 letters and not 8).

@George Koehler 2018-01-03 03:05:29

That multiplication by 0x01010101 might be slow, depending on processor. For example, in my old PowerBook G4, 1 multiplication was about as slow as 4 additions (not as bad as division, where 1 division was about as slow as 23 additions).

@Nikhil Katre 2015-01-09 02:40:46

I am giving two algorithms to answer the question,

  package countSetBitsInAnInteger;

    import java.util.Scanner;

    public class UsingLoop {

    public static void main(String[] args) {
        Scanner in = new Scanner(System.in);
        try{
        System.out.println("Enter a integer number to check for set bits in it");
        int n = in.nextInt();
        System.out.println("Using while loop, we get the number of set bits as: "+usingLoop(n));
        System.out.println("Using Brain Kernighan's Algorithm, we get the number of set bits as: "+usingBrainKernighan(n));
        System.out.println("Using ");
        }
        finally{
        in.close();
        }
    }
    private static int usingBrainKernighan(int n) {
        int count = 0;
        while(n>0){
            n&=(n-1);
            count++;
        }
        return count;
    }/*
    Analysis:
        Time complexity = O(lgn)
        Space complexity = O(1)
    */
    private static int usingLoop(int n) {
        int count = 0;
        for(int i=0;i<32;i++){
            if((n&(1<<i))!=0)
                count++;
        }
        return count;
    }
    /*
    Analysis:
        Time Complexity = O(32) // Maybe the complexity is O(lgn)
        Space Complexity = O(1)
    */
    }

@Noether 2008-09-20 19:14:39

If you happen to be using Java, the built-in method Integer.bitCount will do that.

@Vallabh Patade 2013-04-12 10:19:08

When sun provided different APIs, it must be using some logic on background, right?

@Marco Bolis 2015-01-05 16:37:39

As a side note, Java's implementation uses the same algorithm pointed out by Kevin Little.

@divillysausages 2017-05-04 10:08:42

Implementation aside, this is probably the clearest message of intent for developers maintaining your code after you (or when you come back to it 6 months later)

@Mufaddal Kagda 2014-06-07 16:41:10

int bitcount(unsigned int n)
{ 
      int count=0;
      while(n)
      {
           count += n & 0x1u;
           n >>= 1;
      }
      return  count;
 }

Iterated 'count' runs in time proportional to the total number of bits. It simply loops through all the bits, terminating slightly earlier because of the while condition. Useful, if 1'S or the set bits are sparse and among the least significant bits.

@Michael Dorfman 2008-09-23 09:20:15

The function you are looking for is often called the "sideways sum" or "population count" of a binary number. Knuth discusses it in pre-Fascicle 1A, pp11-12 (although there was a brief reference in Volume 2, 4.6.3-(7).)

The locus classicus is Peter Wegner's article "A Technique for Counting Ones in a Binary Computer", from the Communications of the ACM, Volume 3 (1960) Number 5, page 322. He gives two different algorithms there, one optimized for numbers expected to be "sparse" (i.e., have a small number of ones) and one for the opposite case.

@dadhi 2014-01-30 11:32:25

Fast C# solution using pre-calculated table of Byte bit counts with branching on input size.

public static class BitCount
{
    public static uint GetSetBitsCount(uint n)
    {
        var counts = BYTE_BIT_COUNTS;
        return n <= 0xff ? counts[n]
             : n <= 0xffff ? counts[n & 0xff] + counts[n >> 8]
             : n <= 0xffffff ? counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff]
             : counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff] + counts[(n >> 24) & 0xff];
    }

    public static readonly uint[] BYTE_BIT_COUNTS = 
    {
        0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
    };
}

@user924272 2016-01-10 23:23:18

Ironically, that table could have been created by any of the algorithms posted in this thread! Nevertheless, using tables like this means constant-time performance. Going one step further and creating a 64K translation table would therefore halve the AND, SHIFT and ADD operations necessary. An interesting subject for bit manipulators!

@greggo 2017-03-28 18:42:29

Bigger tables can be slower (and not constant-time) due to cache issues. You can 'look up' 3 bits at a time with (0xe994 >>(k*2))&3,without memory access...

@Stefan 2013-11-26 23:26:10

Here is a solution that has not been mentioned so far, using bitfields. The following program counts the set bits in an array of 100000000 16-bit integers using 4 different methods. Timing results are given in parentheses (on MacOSX, with gcc -O3):

#include <stdio.h>
#include <stdlib.h>

#define LENGTH 100000000

typedef struct {
    unsigned char bit0 : 1;
    unsigned char bit1 : 1;
    unsigned char bit2 : 1;
    unsigned char bit3 : 1;
    unsigned char bit4 : 1;
    unsigned char bit5 : 1;
    unsigned char bit6 : 1;
    unsigned char bit7 : 1;
} bits;

unsigned char sum_bits(const unsigned char x) {
    const bits *b = (const bits*) &x;
    return b->bit0 + b->bit1 + b->bit2 + b->bit3 \
         + b->bit4 + b->bit5 + b->bit6 + b->bit7;
}

int NumberOfSetBits(int i) {
    i = i - ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

#define out(s) \
    printf("bits set: %lu\nbits counted: %lu\n", 8*LENGTH*sizeof(short)*3/4, s);

int main(int argc, char **argv) {
    unsigned long i, s;
    unsigned short *x = malloc(LENGTH*sizeof(short));
    unsigned char lut[65536], *p;
    unsigned short *ps;
    int *pi;

    /* set 3/4 of the bits */
    for (i=0; i<LENGTH; ++i)
        x[i] = 0xFFF0;

    /* sum_bits (1.772s) */
    for (i=LENGTH*sizeof(short), p=(unsigned char*) x, s=0; i--; s+=sum_bits(*p++));
    out(s);

    /* NumberOfSetBits (0.404s) */
    for (i=LENGTH*sizeof(short)/sizeof(int), pi=(int*)x, s=0; i--; s+=NumberOfSetBits(*pi++));
    out(s);

    /* populate lookup table */
    for (i=0, p=(unsigned char*) &i; i<sizeof(lut); ++i)
        lut[i] = sum_bits(p[0]) + sum_bits(p[1]);

    /* 256-bytes lookup table (0.317s) */
    for (i=LENGTH*sizeof(short), p=(unsigned char*) x, s=0; i--; s+=lut[*p++]);
    out(s);

    /* 65536-bytes lookup table (0.250s) */
    for (i=LENGTH, ps=x, s=0; i--; s+=lut[*ps++]);
    out(s);

    free(x);
    return 0;
}

While the bitfield version is very readable, the timing results show that it is over 4x slower than NumberOfSetBits(). The lookup-table based implementations are still quite a bit faster, in particular with a 65 kB table.

@Kevin Cox 2014-02-10 02:21:57

Note that this is a microbenchmark and should be taken with a grain of salt. For example, setting up the lookup table just before using it is priming the cache in a way that may be rare in real code.

Related Questions

Sponsored Content

50 Answered Questions

9 Answered Questions

45 Answered Questions

29 Answered Questions

[SOLVED] How do I create a URL shortener?

  • 2009-04-12 16:29:15
  • caw
  • 246817 View
  • 649 Score
  • 29 Answer
  • Tags:   algorithm url

28 Answered Questions

[SOLVED] How do you set, clear, and toggle a single bit?

7 Answered Questions

[SOLVED] Ukkonen's suffix tree algorithm in plain English

36 Answered Questions

[SOLVED] How to pair socks from a pile efficiently?

38 Answered Questions

[SOLVED] Find an integer not among four billion given ones

24 Answered Questions

[SOLVED] Image Processing: Algorithm Improvement for 'Coca-Cola Can' Recognition

9 Answered Questions

[SOLVED] How to find time complexity of an algorithm

Sponsored Content