You have a vector of three (or four) floats. What is the fastest way to sum them?
Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add instructions in SSE4.2 worth it? What's the cost to moving to the FPU, then faddp, faddp? What's the fastest specific instruction sequence?
"Try to arrange things so you can sum four vectors at a time" will not be accepted as an answer. :-)