You might have seen that .NET 8 will offer AVX-512 (Advanced Vector Extensions) support for high-performance vector multiplication as they occur in image processing or AI applications.
When are you are .NET guy, you might have asked yourself if it’s really worth using AVX features for vector multiplication in order to improve your application performance?
In my blog I compared the use of an AVX-128 bit command from System.Runtime.Intrinsics with the manual implementation of an vector multiplication. In my example on an Intel Xeon the AVX command translate to vpmulld in native code, which is significant superior compared to the classical approach being more than 20 times faster in its execution.
Here is a short example how to use AVX commands in your implementation:
using System.Runtime.Intrinsics;
var s1 = new ReadOnlySpan<int>(new[] { 1, 0, 1, 0 });
var s2 = new ReadOnlySpan<int>(new[] { 2, 1, 2, 1 });
var v1 = Vector128.Create(s1);
var v2 = Vector128.Create(s2);
var result1 = v1 * v2;
You might think ok, that looks much more complicated than just programming the scalar multiplication of two vectors manually as shown below.
var s1 = new ReadOnlySpan<int>(new[] { 1, 0, 1, 0 });
var s2 = new ReadOnlySpan<int>(new[] { 2, 1, 2, 1 });
var result2 = new int[] { s1[0] * s2[0], s1[1] * s2[1], s1[2] * s2[2], s1[3] * s2[3] };
Is it worth the effort?
In order to determine the better solution we look at the generated code on an Intel Intel Xeon Platinum 8370C virtual machine. For the sake of clarity, I use the direct translation without additional optimization.
The multiplication for result1 is translated to (multiplication in bold):
00007FF9C9747997 vmovapd xmm0,xmmword ptr [rbp+0C0h]
00007FF9C974799F vpmulld xmm0,xmm0,xmmword ptr [rbp+0B0h]
00007FF9C97479A8 vmovapd xmmword ptr [rbp+40h],xmm0
00007FF9C97479AD vmovapd xmm0,xmmword ptr [rbp+40h]
00007FF9C97479B2 vmovapd xmmword ptr [rbp+0A0h],xmm0
The multiplication for result2 is translated to (multiplication in bold):
00007FF9C71E7EE8 mov rcx,7FF9C71CBF38h
00007FF9C71E7EF2 mov edx,4
... many lines ...
00007FF9C71E804D mov qword ptr [rbp+78h],rax
00007FF9C71E8051 mov ecx,dword ptr [rbp+84h]
00007FF9C71E8057 mov rdx,qword ptr [rbp+78h]
00007FF9C71E805B imul ecx,dword ptr [rdx]
00007FF9C71E805E mov dword ptr [rbp+74h],ecx
00007FF9C71E8061 mov rcx,qword ptr [rbp+0D8h]
00007FF9C71E8068 mov edx,dword ptr [rbp+90h]
00007FF9C71E806E cmp edx,dword ptr [rcx+8]
... many lines ...
00007FF9C71E80E4 lea rcx,[rcx+r8*4+10h]
00007FF9C71E80E9 mov edx,dword ptr [rbp+54h]
00007FF9C71E80EC mov dword ptr [rcx],edx
00007FF9C71E80EE mov rcx,qword ptr [rbp+0D8h]
00007FF9C71E80F5 mov qword ptr [rbp+168h],rcx
In the first translation the AVX vector multiplication takes place in a single opcode (vpmulld) with 5 op codes in total. In the second translation the vector multiplication needs 3 op codes of multiplication (imul) and 106 op codes in total. I am leaving the exact op cycle analysis for now, but it’s not becoming better for the “classical” unoptimized approach. In total the AVX vector multiplication in my example is more than 20 times faster!
I would say that’s worth the effort of using System.Runtime.Intrinsics when having an computing intensive application!
Happy coding 🚀! If you like my post follow me on LinkedIn 🔔!