#simd notes
most the the basic functionality I want is available as of SSE3, which came out in 2005-2006. Assume any vectorization has at least that.
Has nothing handy for me
has Dot product and streaming loads, both handy, perhaps
nothing handy here
in XMM land just lifts all the SSEn stuff to AVX nondestructive input land
has pretty much everything i need, except for the 2 lane shuffle as a single op vpermilpd , vperm2f128 are needed to do the 2 lane shuffle, no single op
has Gather (strided reads) and and 2 lane shuffle as a single op
for now lest ignore non temporal loads and stores.