[SOLVED] Given a 32 bit number, what is an efficient way to scale each byte by a certain factor?

Given a uint32 number 0x12345678 (e.g., an RGBW color value), how could I efficiently and dynamically scale each byte in it (given a scaling factor `0 <= f <= 1` (or an equivalent integer divisor)?

I know I could do this a longer way (break the number into its components, perhaps via a struct, and loop to manipulate each in turn), but is there a way you to do it faster, without looping? (Static value mapping would be another way, but a dynamic method is preferable.)

Edit: C++ (C ideas are interesting as well), embedded, hundreds or thousands of pixels (not millions). Specifically scaling RGBW leds.

One other thing that came up - this is with gcc, so type punning is allowed (I already use it for similar things - I just wanted to see if there's a better way than that).

Edit again: This is for embedded platforms (microcontrollers). While I'm all for answers that help a wider audience, I purposely asked about this in the context of the language(s) and algorithms rather than optimizations for specific platforms and instruction sets, as platform-specific optimizations can vary if present at all.

@Edgar Bonet 2019-07-09 10:37:17

It has been mentioned in the discussion that the optimal solution can be architecture specific. Someone also suggested to code it in assembly. Assembly has a cost in terms of portability, but it also begs the question of whether (and by how much) you can beat the compiler's optimizer.

I did an experiment on an Arduino, which is based on an AVR microcontroller. This is a very limited 8-bit, Harvard, RISC MCU, with an 8 × 8 → 16-bit hardware multiplier.

Here is the straightforward implementation, using type-punning to multiply the individual bytes:

``````static inline uint32_t scale_pixel(uint32_t rgbw, uint16_t scale)
{
union {
uint32_t value;
uint8_t bytes[4];
} x = { .value = rgbw };
x.bytes[0] = x.bytes[0] * scale >> 8;
x.bytes[1] = x.bytes[1] * scale >> 8;
x.bytes[2] = x.bytes[2] * scale >> 8;
x.bytes[3] = x.bytes[3] * scale >> 8;
return x.value;
}
``````

Compiled with gcc at `-Os` (typical in these memory-constrained devices) this takes 28 CPU cycles to execute, i.e. 7 cycles per byte. The compiler is smart enough to allocate `rgbw` and `x` to the same CPU registers and thus avoid a copy.

Here is the version based on harold's answer:

``````static inline uint32_t scale_pixel(uint32_t rgbw, uint16_t scale)
{
uint32_t rb = rgbw & 0x00FF00FF;
uint32_t gw = (rgbw >> 8) & 0x00FF00FF;
rb *= scale;
gw *= scale;
uint32_t out = ((rb >> 8) & 0x00FF00FF) | (gw & 0xFF00FF00);
return out;
}
``````

This is a very smart optimization that is likely to pay off on a 32-bit MCU. However, on this small 8-bitter, it took 176 CPU cycles to execute! The generated assembly features two calls to a library function that implements a full 32-bit multiplication, along with many moving and clearing registers.

Lastly, here is my inline assembly version:

``````static inline uint32_t scale_pixel(uint32_t rgbw, uint16_t scale)
{
asm(
"tst %B[scale]           \n\t"  // test high byte of scale
"brne 0f                 \n\t"  // if non zero, we are done
"mul %A[rgbw], %A[scale] \n\t"  // multiply LSB
"mov %A[rgbw], r1        \n\t"  // move result into place
"mul %B[rgbw], %A[scale] \n\t"  // same with three other bytes
"mov %B[rgbw], r1        \n\t"  // ...
"mul %C[rgbw], %A[scale] \n\t"
"mov %C[rgbw], r1        \n\t"
"mul %D[rgbw], %A[scale] \n\t"
"mov %D[rgbw], r1        \n"
"0:"
: [rgbw] "+r" (rgbw)   // output
: [scale] "r" (scale)  // input
: "r0", "r1"  // clobbers
);
return rgbw;
}
``````

This one uses the fact that the scale factor can be no larger than 256. In fact, any factor greater than 256 is treated as 256, which could be considered a feature. The execution takes 14 cycles, and only 3 cycles if the scale is 256.

Summary:

• 176 cycles for the version optimized for a 32-bit core
• 28 cycles for the naive type-punning version
• 14 cycles for the assembly version

My conclusion from this experiment is that you are looking here at the kind of micro-optimization where architecture really matters. You can't seriously try to optimize this at the C level without any assumption about the architecture it will run on. Also, if a factor 2 in the speed matters to you, it is worth trying an implementation in assembly. Use conditional compilation to enable the asm implementation in the targeted architecture, and fall back to a generic C implementation in any other architecture.

@caf 2019-07-08 14:15:41

You can directly calculate the power-of-two fractions of the input values with shifts and masks:

``````unsigned long src_2 = ((src >> 1) & 0x7f7f7f7fUL) + (src & 0x01010101UL);
unsigned long src_4 = ((src >> 2) & 0x3f3f3f3fUL) + ((src >> 1) & 0x01010101UL);
unsigned long src_8 = ((src >> 3) & 0x1f1f1f1fUL) + ((src >> 2) & 0x01010101UL);
unsigned long src_16 = ((src >> 4) & 0x0f0f0f0fUL) + ((src >> 3) & 0x01010101UL);
unsigned long src_32 = ((src >> 5) & 0x07070707UL) + ((src >> 4) & 0x01010101UL);
unsigned long src_64 = ((src >> 6) & 0x03030303UL) + ((src >> 5) & 0x01010101UL);
unsigned long src_128 = ((src >> 7) & 0x01010101UL) + ((src >> 6) & 0x01010101UL);
unsigned long src_256 = ((src >> 7) & 0x01010101UL);
``````

(Here `src_2` is `src` with each field individually divided by 2, `src_4` is `src` with each field individually divided by 4 and so on).

Any of the other fractions from 0/256 through 255/256 can be made through optionally adding each of these values (eg 0.75 is `src_2 + src_4`). This might be useful if your embedded system does not have a fast multiplier (you can precalculate the necessary masks from the scaling factor once before processing all pixels), or if you only really need a limited set of scaling factors (you can just hardcode the combinations of power-of-two-fractions you need into a set of specialised scaling functions).

For example a specialised scale-by-0.75 function in its inner loop would just do:

``````dest = ((src >> 1) & 0x7f7f7f7fUL) + (src & 0x01010101UL) +
((src >> 2) & 0x3f3f3f3fUL) + ((src >> 1) & 0x01010101UL);
``````

Although not applicable to your use case, this method can also be used to precalculate masks that apply different scaling factors to each component of the vector as well.

@harold 2019-07-08 07:48:10

The number of multiplications can be reduced by using the multiplications more effectively, on more "full" bits at once, not wasting as many bits on emptiness. Some padding bits are still needed to ensure that the product for one channel doesn't corrupt the result for an other channel. Using a 8bit fixed-point scale, and since there are 8 bits per channel, the output is 16 bits per channel, so two of them fit into the `uint32_t` side by side. That needs 8 bits of padding. So R and B (with 8 zeroes between them) can be scaled with one multiplication together, same for G and W. The result is the high 8 bits of the 16 bit result per channel. So something like this (not tested):

``````uint32_t RB = RGBW & 0x00FF00FF;
uint32_t GW = (RGBW >> 8) & 0x00FF00FF;
RB *= scale;
GW *= scale;
uint32_t out = ((RB >> 8) & 0x00FF00FF) | (GW & 0xFF00FF00);
``````

The `scale` is a number from 0..256 which is interpreted as 0..1, in steps of 1/256. So `scale = 128` corresponds to halving the channel values and so on.

It's possible to add a rounding step, just by adding a suitable bias after multiplying.

The multiplication does this, where the `x` results are not used:

Here is a quickbench to compare various scaling methods, from Timo in comments.

@Scheff 2019-07-08 07:50:44

...and even without type-punning... ;-)

@4386427 2019-07-08 08:00:37

Seems to me that there is something wrong with the calculation of the final result. Shifting RB will destroy the "result" bits. Further GW was shifted when initialized but not shifted when calculating the result.

@harold 2019-07-08 08:01:32

@4386427 that's supposed to happen, the result bits are the high 8 bits of the 16 bit results, so in a way they implicitly got shifted left by 8. E: I'll sketch a diagram maybe that'll help

@4386427 2019-07-08 08:03:02

@harold hmm... how can the result be in the 8 high bits when the scaling factor is between zero and 1 ?

@harold 2019-07-08 08:06:27

@4386427 because it's an integer between 0 and 255 now (well or 256 I suppose for "scale by 1")

@Timo 2019-07-08 08:16:44

If the scale is 0..1 shouldn't your method do a division instead of a multiplication? To quote OP: "given a scaling factor 0 <= f <= 1 (or an equivalent integer divisor)"

@harold 2019-07-08 08:23:01

@Timo I took that as not knowing about this option (so OP tried to not limit the options, but accidentally did so anyway, depending on the interpretation), but scale is just `f * 256` so it's not any harder to use than if there was a divisor `1 / f`

@4386427 2019-07-08 08:27:07

@harold okay, I see.... I think the answer would be better if it included the calculation of `scale` from `f`. At least I got confused as I thought `scale` was in range [0..1].

@Timo 2019-07-08 08:35:48

I see now, didn't notice you were using the MSB of the result instead of the LSB. Here is a little comparison of different methods. Your algorithm (scale6) wins by far (13 instructions on clang) even without any SIMD.

@harold 2019-07-08 09:02:33

@Timo it doesn't work right. Using `dump` like that is not sufficient to force re-evaluation in a loop. Even with `benchmark::DoNotOptimize`, what ends up happening is that the result is computed once, and then stored to a variable a million times. Not sure what to do here. Though in any case, OP mentioned using embedded, so benchmarks on x64 may not be super relevant..

@harold 2019-07-08 09:07:10

@Timo ok how's this? (I put the `factor * 256` outside the loop which was my intent)

@joH1 2019-07-08 09:07:29

@Timo adding `volatile` qualifiers to `dump` variables in benchmakrs functions, like here changes drastically the execution time, so you're definitely right about function being evaluated only once. The result is that all 6 benchmarks perform equivalently now

@harold 2019-07-08 09:22:16

@MaxLanghof ok thanks, so that's how that `DoNotOptimize` is supposed to be used then. I see it's re-evaluating the `f * 256` too now, so which option is a more accurate reflection of reality depends on how OP is actually using this, with the same scale for all pixels vs a changing scale

@supercat 2019-07-08 20:05:22

@Scheff: Type punning could improve performance if it were necessary to multiply pairs of pixels by the same value on a 64-bit processor.

@caf 2019-07-09 01:32:38

You could do a single multiplication by expanding the input into one `uint64_t`, if the target has native 64 bit arithmetic that could be an improvement (or equally it might not - from what I can tell that would require an additional shift).

@MartyMacGyver 2019-07-09 04:22:43

To be clear, as @harold and others understood, this means transforming the original range [0-255] to some smaller integer range starting at zero (it's easy enough to offset if needed). So, a factor of 50%, 0.5, 1/2, 128/256 or such scales [0-255] to [0-127]. Trivial scalings are easy - I was curious if there's a more general algorithm and as this is embedded, processor- and compiler-specific hacks wouldn't be very portable (gcc punning being a handy exception: it seems to be present on all the platforms I'd use). Native 64-bit processing on such platforms is uncommon.