Efficient Vector / Tensor Multiplication with AVX in C# / .NET

You might have seen that .NET 8 will offer AVX-512 (Advanced Vector Extensions) support for high-performance vector multiplication as they occur in image processing or AI applications.

When are you are .NET guy, you might have asked yourself if it’s really worth using AVX features for vector multiplication in order to improve your application performance?

In my blog I compared the use of an AVX-128 bit command from System.Runtime.Intrinsics with the manual implementation of an vector multiplication. In my example on an Intel Xeon the AVX command translate to vpmulld in native code, which is significant superior compared to the classical approach being more than 20 times faster in its execution.

Here is a short example how to use AVX commands in your implementation:

using System.Runtime.Intrinsics;

var s1 = new ReadOnlySpan<int>(new[] { 1, 0, 1, 0 });
var s2 = new ReadOnlySpan<int>(new[] { 2, 1, 2, 1 });

var v1 = Vector128.Create(s1);
var v2 = Vector128.Create(s2);

var result1 = v1 * v2;

You might think ok, that looks much more complicated than just programming the scalar multiplication of two vectors manually as shown below.

var s1 = new ReadOnlySpan<int>(new[] { 1, 0, 1, 0 });
var s2 = new ReadOnlySpan<int>(new[] { 2, 1, 2, 1 });

var result2 = new int[] { s1[0] * s2[0], s1[1] * s2[1], s1[2] * s2[2], s1[3] * s2[3] };

Is it worth the effort?

In order to determine the better solution we look at the generated code on an Intel Intel Xeon Platinum 8370C virtual machine. For the sake of clarity, I use the direct translation without additional optimization.

The multiplication for result1 is translated to (multiplication in bold):

00007FF9C9747997  vmovapd     xmm0,xmmword ptr [rbp+0C0h]  
00007FF9C974799F  vpmulld     xmm0,xmm0,xmmword ptr [rbp+0B0h]  
00007FF9C97479A8  vmovapd     xmmword ptr [rbp+40h],xmm0  
00007FF9C97479AD  vmovapd     xmm0,xmmword ptr [rbp+40h]  
00007FF9C97479B2  vmovapd     xmmword ptr [rbp+0A0h],xmm0 

The multiplication for result2 is translated to (multiplication in bold):

00007FF9C71E7EE8  mov         rcx,7FF9C71CBF38h  
00007FF9C71E7EF2  mov         edx,4  

... many lines ...

00007FF9C71E804D  mov         qword ptr [rbp+78h],rax  
00007FF9C71E8051  mov         ecx,dword ptr [rbp+84h]  
00007FF9C71E8057  mov         rdx,qword ptr [rbp+78h]  
00007FF9C71E805B  imul        ecx,dword ptr [rdx]  
00007FF9C71E805E  mov         dword ptr [rbp+74h],ecx  
00007FF9C71E8061  mov         rcx,qword ptr [rbp+0D8h]  
00007FF9C71E8068  mov         edx,dword ptr [rbp+90h]  
00007FF9C71E806E  cmp         edx,dword ptr [rcx+8]   

... many lines ...

00007FF9C71E80E4  lea         rcx,[rcx+r8*4+10h]  
00007FF9C71E80E9  mov         edx,dword ptr [rbp+54h]  
00007FF9C71E80EC  mov         dword ptr [rcx],edx  
00007FF9C71E80EE  mov         rcx,qword ptr [rbp+0D8h]  
00007FF9C71E80F5  mov         qword ptr [rbp+168h],rcx  

In the first translation the AVX vector multiplication takes place in a single opcode (vpmulld) with 5 op codes in total. In the second translation the vector multiplication needs 3 op codes of multiplication (imul) and 106 op codes in total. I am leaving the exact op cycle analysis for now, but it’s not becoming better for the “classical” unoptimized approach. In total the AVX vector multiplication in my example is more than 20 times faster!

I would say that’s worth the effort of using System.Runtime.Intrinsics when having an computing intensive application!

Happy coding 🚀! If you like my post follow me on LinkedIn 🔔!