Given a vector of three (or four) floats. What is the fastest way to sum them?
Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add instructions in SSE3 worth it?
What's the cost to moving to the FPU, then faddp, faddp? What's the fastest specific instruction sequence?
"Try to arrange things so you can sum four vectors at a time" will not be accepted as an answer. :-) e.g. for summing an array, you can use multiple vector accumulators for vertical sums (to hide addps latency), and reduce down to one after the loop, but then you need to horizontally sum that last vector.