This is a follow up to my original post, Improving FreeImage.Standard performance with Span
Where did we get to?
In the previous post, we saw that by using
Span<T>, it was possible to greatly reduce memory allocations - resulting in significant performance improvements.
I noticed when profiling memory allocations that there were still 10,000 instances of
byte. This was an interesting number because I was profiling 10K iterations of the loop - which meant that each iteration was only allocating a single
byte . I decided to see if I could eliminate this last allocation.
Shared ArrayPool limits
Looking at the code a bit further, it turned out that I had missed an important note on the
ArrayPool<T>.Shared property - the arrays in this pool have a maximum length of 2^20 bytes (1.048576 MB). Adam Sitnik's excellent article "pooling large arrays with ArrayPool" has some more detail on this.
Requesting an array larger than this from the pool results in a new array being allocated, and crucially, this new array will not return to the pool when calling
.Return() - the reference is simply abandoned, and the array will eventually be garbage collected.
My code was falling into this trap - it would request a buffer up to the full length of the image, which might be much larger than this limit.
By ensuring that the requested buffer size is always <= 2^20 bytes, I was able to eliminate all heap allocations from the stream I/O code.
This is significant because every time the garbage collector runs, all managed threads within the process are paused while the GC does it's work. Avoiding this improves application performance, particularly in heavily multi-threaded scenarios.
My optimized code now runs in 25% or less of the original time and avoids 2 - 3 garbage collections per operation (across all generations):
| Method | Job | Runtime | Mean | Error | StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated | |------------ |----- |-------- |-----------:|----------:|----------:|------:|---------:|---------:|---------:|----------:| | WriteLegacy | Clr | Clr | 2,895.0 us | 57.605 us | 72.852 us | 1.00 | 234.3750 | 234.3750 | 234.3750 | 3001944 B | | WriteSpan | Clr | Clr | 480.0 us | 8.374 us | 7.833 us | 0.17 | - | - | - | - | | | | | | | | | | | | | | WriteLegacy | Core | Core | 2,593.3 us | 7.686 us | 7.190 us | 1.00 | 332.0313 | 332.0313 | 332.0313 | 3000128 B | | WriteSpan | Core | Core | 654.7 us | 5.418 us | 5.068 us | 0.25 | - | - | - | - |
Although the BenchmarkDotNet results showed that the updated code no longer made any allocations, it also threw up an unexpected result (which the sharp-eyed reader might have already spotted in the table above): .NET Core 2.1 took longer to execute the benchmark than .NET Framework 4.7.2.
This was surprising because Microsoft have made performance a central tenet of .NET Core and have dedicated a lot of engineering effort in this area.
I was able to iteratively remove code from my benchmark until I was left with a very small amount of code that seemed to be responsible for the performance that I was seeing - the root cause appeared to be calls to
MemoryStream.Write(buffer, offset, length).
| Method | Job | Runtime | Mean | Error | StdDev | |------------ |----- |-------- |----------:|----------:|----------:| | StreamWrite | Clr | Clr | 68.21 us | 0.4611 us | 0.4088 us | | StreamWrite | Core | Core | 118.42 us | 0.5408 us | 0.5058 us |
I re-ran the benchmark using Core versions 2.2 and 3.0, and in all cases the results were consistent.
Just for good measure, I also ran the benchmark on my work PC - this was where things got interesting, as it showed Core and Framework taking almost exactly the same amount of time to run!
| Method | Job | Runtime | Mean | Error | StdDev | |------------ |----- |-------- |---------:|----------:|----------:| | StreamWrite | Clr | Clr | 75.37 us | 0.9133 us | 0.7626 us| | StreamWrite | Core | Core | 75.39 us | 0.8624 us | 0.8067 us|
The main difference between my home and work PCs is the CPU - an AMD Ryzen and an Intel i7. I was able to reproduce the first set of results on another Ryzen-based PC, so i opened an issue on Github.
At this point, I believed the issue to be related to .NET Core on a Ryzen CPU. However, I was then able to reproduce again, this time on a virtual machine running on top of an Intel Xeon processor.
| Method | Job | Runtime | Mean | Error | StdDev | |------------ |----- |-------- |----------:|----------:|----------:| | StreamWrite | Clr | Clr | 89.87 us | 1.7903 us | 2.1312 us | | StreamWrite | Core | Core | 110.88 us | 0.8309 us | 0.7772 us |
I'm really interested to see what the CoreCLR team find here.
During my investigation into the .NET Core performance issue I made a number of changes to the code to try to eliminate possible sources of error. In several cases this actually resulted in code that executes marginally faster.
- Partial loop unrolling
- Eliminated complex / floating point arithmetic
- Eliminated branching inside loops
All of these techniques improve performance slightly at the expense of readability. It's important not to get carried away - only optimize code that needs to be fast.
Benchmarks aren't real life
Having spent quite a bit of time attempting to improve stream I/O performance, I found that my changes only provide noticeable improvements in benchmark execution speed when reading bitmap files (i.e. .bmp).
When reading and writing more complex formats (JPEG, TIFF, etc), FreeImage spends a lot of time encoding / decoding on the CPU. Although I had succeeded in reducing the CPU time spent in the I/O methods, these take microseconds to execute - which isn't a noticeable gain when decoding a decent-sized JPEG takes 10ms or more.
My production code that uses FreeImage.Standard mostly works with JPEG and TIFF images, so the microseconds saved in I/O don't gain me anything directly. However, I am seeing greatly improved performance overall due to a large reduction in the number of garbage collections being triggered:
- Gen0 collections reduced by 97.3%
- Gen1 collections reduced by 96.9%
- Gen2 collections reduced by 75.0%
- Total memory allocated per operation reduced by 65.8%
- Daily throughput up by 22% (1338 ops/day)
- Daily memory usage down by 8.6TB (54.7%)