Benchmarking in .NET or How I Learned to Stop Worrying and Love Perf Regression Testing

·

7 min read

In an earlier post we were looking at the performance of function pointers introduced in .NET 5/C# 9.0. To do that we used BenchmarkDotNet (docs), the de-facto .NET benchmarking framework. This is a shotgun dump of notes to get you writing your own benchmarks- mainly for performance regression testing.

Basics

# Install benchmark template
dotnet new -i BenchmarkDotNet.Templates
# Create new benchmark project `benchmarks`
dotnet new benchmark -o benchmarks --console-app -v 0.12.1

Confusingly, you'll find references to installing BenchmarkDotNet.Tool global tool which has been deprecated (see issue 1670 and PR 1572). Likewise, the dotnet project template BenchmarkDotNet.Templates defaults to: NET Core App 3.0 (which is out of support), and version 0.12.0 of BenchmarkDotNet (which doesn't support NET 5.0 and may fail with System.NotSupportedException: Unknown .NET Runtime). Open benchmarks/benchmarks.csproj and replace:

-<TargetFramework>netcoreapp3.0</TargetFramework>
+<TargetFramework>net5.0</TargetFramework>

The template is pretty simple, so it's just as easy to run dotnet new classlib -o benchmarks and populate benchmarks.csproj with:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>net5.0</TargetFramework>
    <OutputType>Exe</OutputType>
  </PropertyGroup>
  <PropertyGroup>
    <PlatformTarget>AnyCPU</PlatformTarget>
    <DebugType>portable</DebugType>
    <DebugSymbols>true</DebugSymbols>
    <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
    <Optimize>true</Optimize>
    <Configuration>Release</Configuration>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="BenchmarkDotNet" Version="0.12.1" />
    <PackageReference Include="BenchmarkDotNet.Diagnostics.Windows" Version="0.12.1" Condition="'$(OS)' == 'Windows_NT'" />
  </ItemGroup>
</Project>

In benchmarks/Program.cs, replace the main body with the following to get the same functionality as the benchmark runner CLI (from the docs):

using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Running;

namespace benchmarks
{
    class Program
    {
        static void Main(string[] args) =>
            BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }
}

Create a benchmark in Benchmarks.cs:

using BenchmarkDotNet;
using BenchmarkDotNet.Attributes;

namespace benchmarks
{
    public class Delegate
    {
        [Benchmark]
        public void CallDelegate()
        {
            // Do nothing for now
        }
    }
}

Try running the benchmarks:

dotnet run --project ./Benchmarks/benchmarks.csproj -- --filter "benchmarks.*"

If you leave off --filter it will wait for input from stdin.

To see the list of CLI options:

dotnet run --project ./Benchmarks/benchmarks.csproj -- -h

More

Invariably, you're testing some other library as opposed to writing stand-alone benchmarks. And then at some point it may fail with Assembly XXX which defines benchmarks references non-optimized YYY. Everything needs to be compiled for release:

# Add a project reference to assembly we're benchmarking
dotnet add benchmarks reference ./nng.NET.Shared/nng.NET.Shared.csproj
# Note --configuration
dotnet run --project ./Benchmarks/benchmarks.csproj --configuration release -- --filter "benchmarks.*"

Similar to other testing frameworks, you can have initialization and cleanup (that is excluded from benchmark timings) with [GlobalSetup]/[GlobalCleanup] and [IterationSetup]/[IterationCleanup] for per-class and per-iteration behavior, respectively (docs):

public class Delegate
{
    delegate int nng_aio_set_output_delegate(nng_aio aio, int index, IntPtr arg);
    nng_aio_set_output_delegate nng_aio_set_output;

    [GlobalSetup]
    public void GlobalSetup()
    {
        var handle = DllImport.Load();
        var ptr = NativeLibrary.GetExport(handle, "nng_aio_set_output");
        nng_aio_set_output = Marshal.GetDelegateForFunctionPointer<nng_aio_set_output_delegate>(ptr);
    }

    [Benchmark]
    public void CallDelegate()
    {
        var _ = nng_aio_set_output(nng_aio.Null, 9, IntPtr.Zero);
    }
}

There are two main approaches to defining benchmarks:

Which allows you to configure all sorts of interesting things: benchmark parameters, runtime jobs (e.g. .NET version, JIT/GC), etc.

If you've got multiple classes the CLI results are spread out. To get a single, tidy table at the end:

var config = ManualConfig.Create(DefaultConfig.Instance)
    .WithOptions(ConfigOptions.JoinSummary);
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args, config);

ManualConfig has numerous more options. Everything in ConfigOptions can be specified as CLI arguments after --. For example, JoinSummary corresponds to --join. Reference -- -h for others.

To quickly iterate at the expense of accuracy use: --job Dry, --runOncePerIteration, or some combination of --warmupCount 1 --iterationCount 1 --invocationCount 1 --unrollFactor 1.

Reports

By default, csv, html, and markdown results are written to BenchmarkDotNet.Artifacts/results/. Other formats are possible with -e, and the base output path can be specified with -a/--artifacts. The markdown is suitable for posts such as this, CI web interfaces, etc.:

BenchmarkDotNet=v0.12.1, OS=macOS 11.3.1 (20E241) [Darwin 20.4.0]
Intel Core i5-8279U CPU 2.40GHz (Coffee Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.101
[Host]     : .NET Core 5.0.1 (CoreCLR 5.0.120.57516, CoreFX 5.0.120.57516), X64 RyuJIT
DefaultJob : .NET Core 5.0.1 (CoreCLR 5.0.120.57516, CoreFX 5.0.120.57516), X64 RyuJIT
TypeMethodMeanErrorStdDevRatioRatioSD
DelegateCallDelegate16.934 ns0.1687 ns0.1578 ns16.230.23
DllImportCallDllImport6.713 ns0.0666 ns0.0590 ns6.440.09
InterfaceCallInterfaceToDllImport9.884 ns0.1892 ns0.2652 ns9.550.35
PointerCallFunctionPointer6.764 ns0.1519 ns0.1347 ns6.490.18
DllImportCallManaged1.042 ns0.0125 ns0.0110 ns1.000.00

To get graphs, install the R programming language and use -e rplot. If rscript is in your PATH environment variable, it will automatically output several png files, otherwise you can manually run rscript BenchmarkDotNet.Artifacts/results/BuildPlots.R:

Comparisons

What's "fast" (or "slow")? Benchmarks are sometimes useless without something to compare them to.

You can define a baseline:

[Benchmark(Baseline = true)]
public void CallManaged()
{
    var _ = Managed(IntPtr.Zero, 9, IntPtr.Zero);
}

// Prevent inline-ing
[MethodImpl(MethodImplOptions.NoInlining)]
static nint Managed(nint _unused0, int _unused1, nint _unused2) => 0;

Having a baseline also gives you "ratio" and "ratioSD" (StdDev) columns in the output.

BenchmarkDotnet doesn't natively support comparing results from different runs (i.e. before and after code changes), but you can use ResultsComparer from the .NET project:

# Clone repo and build ResultsComparer
git clone https://github.com/dotnet/performance dotnet_perf
pushd dotnet_perf/src/tools/ResultsComparer
dotnet build -c release
popd
# View CLI options
dotnet ./dotnet_perf/artifacts/bin/ResultsComparer/Release/net5.0/ResultsComparer.dll -h

# Run benchmarks with full JSON reports
dotnet run --project ./Benchmarks/benchmarks.csproj -c release -- -f "benchmarks.*" -e fulljson
# Compare with results in ./base
dotnet ./dotnet_perf/artifacts/bin/ResultsComparer/Release/net5.0/ResultsComparer.dll --base ./base --diff ./BenchmarkDotNet.Artifacts --threshold '5%'

ResultsComparer requires "full" JSON reports. So, you'll have to do one of:

  • Run with -e fulljson
  • Call AddExporter(BenchmarkDotNet.Exporters.Json.JsonExporter.Full) on a config
  • Apply [BenchmarkDotNet.Attributes.JsonExporterAttribute.Full] attribute

--threshold sets the permitted variation between results. It can be a percentage, or length of time like 1ms.

Stdout contains markdown results:

Slowerdiff/baseBase Median (ns)Diff Median (ns)Modality
benchmarks.Interface.CallInterfaceToDllImport112104.1011.051239001.46

Oops, looks like my Thread.Sleep(1) had a measureable impact.

ResultsComparer has some unexpected behavior. It throws an exception when there's multiple reports in the same folder, so you may need to clear results from repeated runs. And it doesn't work with results from "dry" jobs.

CI

We're looking to add this as part of our Github Action workflows, but it applies equally well to any other CI/CD platform.

It's tempting to fail a build/PR if any benchmark is slower. But, sometimes it's unavoidable with a bug fix or "correct" implementation. IF we wanted to do it anyway, check if any test is "Slower":

# Output comparison as CSV `./csv`
dotnet ResultsComparer.dll --base ./base --diff ./BenchmarkDotNet.Artifacts --threshold '5%' --csv ./csv
# Parse csv, give it a header (it doesn't have one), and get 3rd column
$column = Import-csv -Path ./csv -Delimiter ';' -Header 'Benchmark','Source','Result' `
    | Select-Object 'Result'
# Check for "Slower"
($column | % { $_.Result }).Contains("Slower")

What we'd really like is a summary in the comments of a pull request- similar to our code coverage reports. There's multiple ways to add a comment from a workflow:

Using HTTP REST, PowerShell works on all platforms and requires no external dependencies:

# Run ResultsComparer and capture markdown output
$output = dotnet ResultsComparer.dll --base ./BenchmarkDotNet.Artifacts/ --diff ./diff --threshold '5%'
# Convert string[] output to single string with line-breaks
$body = $output -join '`n'
$secure_token = ConvertTo-SecureString -String $GITHUB_TOKEN -AsPlainText
# POST comment to PR
Invoke-WebRequest `
    -Method Post `
    -Headers @{accept= 'application/vnd.github.v3+json'} `
    -Uri https://api.github.com/repos/$USER/$REPO/issues/$PR_ID/comments `
    -Body (ConvertTo-Json @{body=$body}) `
    -Authentication OAuth -Token $secure_token

Where $PR_ID can be obtained from ${{github.event.issue.number}} in the workflow.

Assuming everything is correct:

We've now got a bunch pieces, but haven't really settled on how we want the process to work. Similar to before, we can attach any of the generated reports with upload-artifact/download-artifact or upload-release-asset as build artifacts or release assets, respectively. And then generate comparison reports. But, it feels best to use a wide gamut of fine-grained benchmarks to catch performance regressions in PRs, and then a smaller number of broader, generalized tests to highlight changes between releases. This calls for more experimentation.