I saw this blog entry about the relative times to compute hashes in various languages. It’s not a particularly good test because it is testing the performance of the crypto libraries associated with the platform, but is posing as a language benchmark.

For fun, I decided to do a somewhat more thorough examination of the performance of the languages instead of just the platform.

For C#, I used the version of MD5 implemented by Bouncy Castle.

As far as code goes, it’s OK, if not a bit repetitive.

In my first round of F#, I decided to try to make the code clean or at least cleaner than the C# reference implementation, so I decided to make the core routine, ProcessBlock, be monadic and operate on an immutable type. In addition, I refactored the code to table drive the cycles of the hash to eliminate copy/paste code while taking advantage of F#’s partial function application:

type MD5Hash = { H1 : uint32; H2 : uint32; H3 : uint32; H4 : uint32; } let newMD5 = { H1 = 0x67452301u; H2 = 0xefcdab89u; H3 = 0x98badcfeu; H4 = 0x10325476u; } let inline rotateLeft x n = (x <<< n) ||| (x >>> (32 - n)) let inline F u v w = (u &&& v) ||| (~~~u &&& w) let inline G u v w = (u &&& w) ||| (v &&& ~~~w) let inline H u v w = u ^^^ v ^^^ w let inline K u v w = v ^^^ (u ||| ~~~w) let round1Indexes = [| 0..15 |] let round1Addends = [| 0xd76aa478u; 0xe8c7b756u; 0x242070dbu; 0xc1bdceeeu; 0xf57c0fafu; 0x4787c62au; 0xa8304613u; 0xfd469501u; 0x698098d8u; 0x8b44f7afu; 0xffff5bb1u; 0x895cd7beu; 0x6b901122u; 0xfd987193u; 0xa679438eu; 0x49b40821u |] let round1Rotates = [| 7; 12; 17; 22 |] let round2Indexes = [| 1; 6; 11; 0; 5; 10; 15; 4; 9; 14; 3; 8; 13; 2; 7; 12 |] let round2Addends = [| 0xf61e2562u; 0xc040b340u; 0x265e5a51u; 0xe9b6c7aau; 0xd62f105du; 0x02441453u; 0xd8a1e681u; 0xe7d3fbc8u; 0x21e1cde6u; 0xc33707d6u; 0xf4d50d87u; 0x455a14edu; 0xa9e3e905u; 0xfcefa3f8u; 0x676f02d9u; 0x8d2a4c8au; |] let round2Rotates = [| 5; 9; 14; 20 |] let round3Indexes = [| 5; 8; 11; 14; 1; 4; 7; 10; 13; 0; 3; 6; 9; 12; 15; 2 |] let round3Addends = [| 0xfffa3942u; 0x8771f681u; 0x6d9d6122u; 0xfde5380cu; 0xa4beea44u; 0x4bdecfa9u; 0xf6bb4b60u; 0xbebfbc70u; 0x289b7ec6u; 0xeaa127fau; 0xd4ef3085u; 0x04881d05u; 0xd9d4d039u; 0xe6db99e5u; 0x1fa27cf8u; 0xc4ac5665u; |] let round3Rotates = [| 4; 11; 16; 23 |] let round4Indexes = [| 0; 7; 14; 5; 12; 3; 10; 1; 8; 15; 6; 13; 4; 11; 2; 9 |] let round4Addends = [| 0xf4292244u; 0x432aff97u; 0xab9423a7u; 0xfc93a039u; 0x655b59c3u; 0x8f0ccc92u; 0xffeff47du; 0x85845dd1u; 0x6fa87e4fu; 0xfe2ce6e0u; 0xa3014314u; 0x4e0811a1u; 0xf7537e82u; 0xbd3af235u; 0x2ad7d2bbu; 0xeb86d391u; |] let round4Rotates = [| 6; 10; 15; 21 |] let inline cycle func (X:uint32[]) (xindexes:int[]) (xaddends:uint32[]) (rotates:int[]) (a, b, c, d) = let a = (rotateLeft (a + (func b c d) + X.[xindexes.[0]] + xaddends.[0]) (rotates.[0])) + b let d = (rotateLeft (d + (func a b c) + X.[xindexes.[1]] + xaddends.[1]) (rotates.[1])) + a let c = (rotateLeft (c + (func d a b) + X.[xindexes.[2]] + xaddends.[2]) (rotates.[2])) + d let b = (rotateLeft (b + (func c d a) + X.[xindexes.[3]] + xaddends.[3]) (rotates.[3])) + c let a = (rotateLeft (a + (func b c d) + X.[xindexes.[4]] + xaddends.[4]) (rotates.[0])) + b let d = (rotateLeft (d + (func a b c) + X.[xindexes.[5]] + xaddends.[5]) (rotates.[1])) + a let c = (rotateLeft (c + (func d a b) + X.[xindexes.[6]] + xaddends.[6]) (rotates.[2])) + d let b = (rotateLeft (b + (func c d a) + X.[xindexes.[7]] + xaddends.[7]) (rotates.[3])) + c let a = (rotateLeft (a + (func b c d) + X.[xindexes.[8]] + xaddends.[8]) (rotates.[0])) + b let d = (rotateLeft (d + (func a b c) + X.[xindexes.[9]] + xaddends.[9]) (rotates.[1])) + a let c = (rotateLeft (c + (func d a b) + X.[xindexes.[10]] + xaddends.[10]) (rotates.[2])) + d let b = (rotateLeft (b + (func c d a) + X.[xindexes.[11]] + xaddends.[11]) (rotates.[3])) + c let a = (rotateLeft (a + (func b c d) + X.[xindexes.[12]] + xaddends.[12]) (rotates.[0])) + b let d = (rotateLeft (d + (func a b c) + X.[xindexes.[13]] + xaddends.[13]) (rotates.[1])) + a let c = (rotateLeft (c + (func d a b) + X.[xindexes.[14]] + xaddends.[14]) (rotates.[2])) + d let b = (rotateLeft (b + (func c d a) + X.[xindexes.[15]] + xaddends.[15]) (rotates.[3])) + c (a, b, c, d) let round1 X = cycle F X round1Indexes round1Addends round1Rotates let round2 X = cycle G X round2Indexes round2Addends round2Rotates let round3 X = cycle H X round3Indexes round3Addends round3Rotates let round4 X = cycle K X round4Indexes round4Addends round4Rotates let processBlock md (X:uint32[]) = let (a, b, c, d) = (md.H1, md.H2, md.H3, md.H4) |> round1 X |> round2 X |> round3 X |> round4 X { H1 = md.H1 + a; H2 = md.H2 + b; H3 = md.H3 + c; H4 = md.H4 + d }

Once this was all said and done, the F# code ran 20% slower than the equivalent C#. This was unexpected. I figured the compiler would have a field day on the invariants and be able to expand things out. Weird. Turns out that a chunk of the time was going into the partial function application, so I eliminated that, which also meant eliminating that nice factoring that let me table drive everything to cut down the repetition. After doing that, the F# was 10% slower than the equivalent C#. In running performance analysis, I F# was hitting the garbage collector routinely, whereas C# was not, so I re-refactored processBlock and eliminated all allocation. Still slower. I looked at the IL and realized that I was paying a penalty for all the locals. If you look at the original version, there 5 bindings to each of a, b, c, and d – that means 20 locals. Once expanded into a full version of all 4 rounds, there are 80 locals.

I re-re-refactored the code to mutate a, b, c, and d as locals. Once this was done, the F# code ran 5% faster than the C#, almost certainly due to the inline keyword eliminating method calls.

Using the same benchmark test as Vasily in his blog, I found on my machine that C# will run his test in 5.7 seconds, F# in 5.1s and the .NET Crypto library in 2.1. The difference between the C#/F# and the .NET library version has everything to do with the .NET library version likely being written in C++ or assembly. The ratio of 2.4:1 between F# and .NET library code is consistent with results that I’ve seen for comparing pure IL to a similar native C++ implementation of equivalent code.

Here is the much uglier version of the F# code:

let processBlock1 (hash:uint32[]) (X:uint32[]) = let mutable a = hash.[0] let mutable b = hash.[1] let mutable c = hash.[2] let mutable d = hash.[3] // round 1 a <- (rotateLeft (a + (F b c d) + X.[0] + 0xd76aa478u) 7) + b d <- (rotateLeft (d + (F a b c) + X.[1] + 0xe8c7b756u) 12) + a c <- (rotateLeft (c + (F d a b) + X.[2] + 0x242070dbu) 17) + d b <- (rotateLeft (b + (F c d a) + X.[3] + 0xc1bdceeeu) 22) + c a <- (rotateLeft (a + (F b c d) + X.[4] + 0xf57c0fafu) 7) + b d <- (rotateLeft (d + (F a b c) + X.[5] + 0x4787c62au) 12) + a c <- (rotateLeft (c + (F d a b) + X.[6] + 0xa8304613u) 17) + d b <- (rotateLeft (b + (F c d a) + X.[7] + 0xfd469501u) 22) + c a <- (rotateLeft (a + (F b c d) + X.[8] + 0x698098d8u) 7) + b d <- (rotateLeft (d + (F a b c) + X.[9] + 0x8b44f7afu) 12) + a c <- (rotateLeft (c + (F d a b) + X.[10] + 0xffff5bb1u) 17) + d b <- (rotateLeft (b + (F c d a) + X.[11] + 0x895cd7beu) 22) + c a <- (rotateLeft (a + (F b c d) + X.[12] + 0x6b901122u) 7) + b d <- (rotateLeft (d + (F a b c) + X.[13] + 0xfd987193u) 12) + a c <- (rotateLeft (c + (F d a b) + X.[14] + 0xa679438eu) 17) + d b <- (rotateLeft (b + (F c d a) + X.[15] + 0x49b40821u) 22) + c // round 2 a <- (rotateLeft (a + (G b c d) + X.[1] + 0xf61e2562u) 5) + b d <- (rotateLeft (d + (G a b c) + X.[6] + 0xc040b340u) 9) + a c <- (rotateLeft (c + (G d a b) + X.[11] + 0x265e5a51u) 14) + d b <- (rotateLeft (b + (G c d a) + X.[0] + 0xe9b6c7aau) 20) + c a <- (rotateLeft (a + (G b c d) + X.[5] + 0xd62f105du) 5) + b d <- (rotateLeft (d + (G a b c) + X.[10] + 0x02441453u) 9) + a c <- (rotateLeft (c + (G d a b) + X.[15] + 0xd8a1e681u) 14) + d b <- (rotateLeft (b + (G c d a) + X.[4] + 0xe7d3fbc8u) 20) + c a <- (rotateLeft (a + (G b c d) + X.[9] + 0x21e1cde6u) 5) + b d <- (rotateLeft (d + (G a b c) + X.[14] + 0xc33707d6u) 9) + a c <- (rotateLeft (c + (G d a b) + X.[3] + 0xf4d50d87u) 14) + d b <- (rotateLeft (b + (G c d a) + X.[8] + 0x455a14edu) 20) + c a <- (rotateLeft (a + (G b c d) + X.[13] + 0xa9e3e905u) 5) + b d <- (rotateLeft (d + (G a b c) + X.[2] + 0xfcefa3f8u) 9) + a c <- (rotateLeft (c + (G d a b) + X.[7] + 0x676f02d9u) 14) + d b <- (rotateLeft (b + (G c d a) + X.[12] + 0x8d2a4c8au) 20) + c // round 3 a <- (rotateLeft (a + (H b c d) + X.[5] + 0xfffa3942u) 4) + b d <- (rotateLeft (d + (H a b c) + X.[8] + 0x8771f681u) 11) + a c <- (rotateLeft (c + (H d a b) + X.[11] + 0x6d9d6122u) 16) + d b <- (rotateLeft (b + (H c d a) + X.[14] + 0xfde5380cu) 23) + c a <- (rotateLeft (a + (H b c d) + X.[1] + 0xa4beea44u) 4) + b d <- (rotateLeft (d + (H a b c) + X.[4] + 0x4bdecfa9u) 11) + a c <- (rotateLeft (c + (H d a b) + X.[7] + 0xf6bb4b60u) 16) + d b <- (rotateLeft (b + (H c d a) + X.[10] + 0xbebfbc70u) 23) + c a <- (rotateLeft (a + (H b c d) + X.[13] + 0x289b7ec6u) 4) + b d <- (rotateLeft (d + (H a b c) + X.[0] + 0xeaa127fau) 11) + a c <- (rotateLeft (c + (H d a b) + X.[3] + 0xd4ef3085u) 16) + d b <- (rotateLeft (b + (H c d a) + X.[6] + 0x04881d05u) 23) + c a <- (rotateLeft (a + (H b c d) + X.[9] + 0xd9d4d039u) 4) + b d <- (rotateLeft (d + (H a b c) + X.[12] + 0xe6db99e5u) 11) + a c <- (rotateLeft (c + (H d a b) + X.[15] + 0x1fa27cf8u) 16) + d b <- (rotateLeft (b + (H c d a) + X.[2] + 0xc4ac5665u) 23) + c // round 4 a <- (rotateLeft (a + (K b c d) + X.[0] + 0xf4292244u) 6) + b d <- (rotateLeft (d + (K a b c) + X.[7] + 0x432aff97u) 10) + a c <- (rotateLeft (c + (K d a b) + X.[14] + 0xab9423a7u) 15) + d b <- (rotateLeft (b + (K c d a) + X.[5] + 0xfc93a039u) 21) + c a <- (rotateLeft (a + (K b c d) + X.[12] + 0x655b59c3u) 6) + b d <- (rotateLeft (d + (K a b c) + X.[3] + 0x8f0ccc92u) 10) + a c <- (rotateLeft (c + (K d a b) + X.[10] + 0xffeff47du) 15) + d b <- (rotateLeft (b + (K c d a) + X.[1] + 0x85845dd1u) 21) + c a <- (rotateLeft (a + (K b c d) + X.[8] + 0x6fa87e4fu) 6) + b d <- (rotateLeft (d + (K a b c) + X.[15] + 0xfe2ce6e0u) 10) + a c <- (rotateLeft (c + (K d a b) + X.[6] + 0xa3014314u) 15) + d b <- (rotateLeft (b + (K c d a) + X.[13] + 0x4e0811a1u) 21) + c a <- (rotateLeft (a + (K b c d) + X.[4] + 0xf7537e82u) 6) + b d <- (rotateLeft (d + (K a b c) + X.[11] + 0xbd3af235u) 10) + a c <- (rotateLeft (c + (K d a b) + X.[2] + 0x2ad7d2bbu) 15) + d b <- (rotateLeft (b + (K c d a) + X.[9] + 0xeb86d391u) 21) + c hash.[0] <- hash.[0] + a hash.[1] <- hash.[1] + b hash.[2] <- hash.[2] + c hash.[3] <- hash.[3] + d

and here is the wrapping to make it into a Bouncy Castle GeneralDigest object:

type MD5FSharp() = inherit BCStuff.GeneralDigest() let mutable X = Array.create 16 0u let mutable hash = [| 0x67452301u; 0xefcdab89u; 0x98badcfeu; 0x10325476u |] let mutable xOff = 0 override this.AlgorithmName = "MD5" override this.GetDigestSize() = 16 override this.ProcessBlock() = processBlock1 hash X xOff <- 0 override this.ProcessWord(input, inOff) = X.[xOff] <- Pack.LE_To_UInt32(input, inOff) xOff <- xOff + 1 if xOff = 16 then this.ProcessBlock() override this.ProcessLength(bitLength) = if xOff > 14 then if xOff = 15 then X.[15] <- 0u this.ProcessBlock() for i in xOff .. 13 do X.[i] <- 0u X.[14] <- (uint32 (uint64 bitLength)) X.[15] <- (uint32 ((uint64 bitLength) >>> 32)) override this.DoFinal(output, outOff) = this.Finish() Pack.UInt32_To_LE(hash.[0], output, outOff); Pack.UInt32_To_LE(hash.[1], output, outOff + 4); Pack.UInt32_To_LE(hash.[2], output, outOff + 8); Pack.UInt32_To_LE(hash.[3], output, outOff + 12); this.Reset() this.GetDigestSize() override this.Reset() = base.Reset() hash <- [|0x67452301u; 0xefcdab89u; 0x98badcfeu; 0x10325476u |] xOff <- 0 Array.fill X 0 X.Length 0u

For the actual timing, I used a C# stub which is nearly identical to Vasily’s rig:

static TimeSpan DigestIt(byte[] input, byte[] output, GeneralDigest dig) { Stopwatch sw = Stopwatch.StartNew(); for (int i=0; i < 100; i++) { dig.Reset(); dig.BlockUpdate(input, 0, input.Length); dig.DoFinal(output, 0); } sw.Stop(); return sw.Elapsed; } static TimeSpan DigestIt(byte[] input, byte[] output, System.Security.Cryptography.HashAlgorithm dig) { Stopwatch sw = Stopwatch.StartNew(); for (int i = 0; i < 100; i++) { dig.Initialize(); dig.TransformFinalBlock(input, 0, input.Length); } sw.Stop(); return sw.Elapsed; } static void Main(string[] args) { GeneralDigest csDig = new MD5Digest(); GeneralDigest fsDig = new MD5.MD5FSharp(); byte[] input = new byte[10000000]; byte[] output = new byte[csDig.GetDigestSize()]; System.Security.Cryptography.MD5CryptoServiceProvider netDig = new System.Security.Cryptography.MD5CryptoServiceProvider(); TimeSpan cs = DigestIt(input, output, csDig); TimeSpan fs = DigestIt(input, output, fsDig); TimeSpan net = DigestIt(input, output, netDig); Console.WriteLine("CS: {0} FS: {1} .NET {2} ", cs, fs, net); Console.WriteLine("Done."); }

As a conclusion, when you’re running benchmarks, you need to be aware of the landscape that you’re testing. Are you testing the language or are you testing the libraries? If you’re testing the language, make sure you’re doing the actual work in the language. I was hoping for better from the initial implementation in F# as it is significantly prettier than the final implementation, but I’ll note that syntax aside, the same approach could have been done in C# and it would have read just as well. This particular problem is also not especially well-suited for F#’s strengths. Generally speaking, performance tuned number-crunching tends to look bad in any language. I had a very similar experience working with writing a PngDecoder in C#, wherein I found that one of the biggest bottlenecks in decoding was the Paeth filter, which could only be improved by manually inlining code, but the result was significant enough to matter. This, as is turns out, is the crux of many problems that I encounter in image processing. If the inside of your loop is going to run a hundred million times, everything you do inside that loop will get magnified a hundred million times. Function calls are not free when you make them 100 million times. And this is exactly why I was able to make the F# 5% faster in this code. Instead of calling the functions F, G, H, K, and rotateLeft as the C# does, the F# compiler uses them inline. This becomes significant especially with rotateLeft as this is not a particularly big function and the overhead of making the call and return are about the same as the actual work.

## About the Author

Follow on Twitter More Content by Steve Hawley