There are some operations where exposing fused multiply/add (FMA3 / FMA4) directly would be useful.

Looks like you're already using it internally to optimize some of the math functions, so it would be nice to be able to take advantage of that in more higher-level functions as well.

Here's one use case.