With the introduction of both Apple Silicon (M1) and AWS Graviton, the use of ARM processors has really started to enter the mainstream desktop and server market. Long-term, things are looking really good for ARM in the datacenter, but are we there yet?
The ARM-based Apple Silicon has been praised for delivering good performance with low power usage and heat generation, which is perfect for a portable device. AWS Graviton 2 performs well in a performance to cost perspective as well. Even Microsoft is dipping its toes in the water, so obviously there’s something to this. So is it any good for our Halon MTA? Well, the short answer is that it depends.
Let’s start with the desktop side of things. With our Visual Studio Code extensions it’s very easy to run a local containerised MTA for testing purposes. Some of us here at Halon use Apple laptops with the new M1 chip, and everything works fine in x86 mode under Rosetta 2, but we of course wanted to test it natively! Did it work without issues? Absolutely. Was it exceptional? Yes, in a few ways it was. Benchmarking Halon MTA on Apple M1 against similar x86-based desktops and even Xeon servers reveals excellent per-core performance. DKIM signing and verification performance really stood out, beating the latest (and significantly more power hungry) Intel Xeon processors with the same numbers of cores. However, the workload that you throw at it when testing your configuration and scripts is honestly not enough to really make any change in performance noticeable. In the screenshot below, you can see Halon MTA running natively and containerised on an ARM-based Mac.
Given what you just read about Halon MTA running on Apple’s ARM chips, surely the ARM-based c6g (Graviton 2) instances must be perfect for running Halon MTA on Amazon AWS? Well, here it gets a little complicated. During our first benchmarks of c6g.2xlarge instances (ARM) against c5.2xlarge (traditional x86) we saw a steady performance increase of around 20%, while being roughly 20% cheeper. Fantastic! TLS speeds with AES-128 were also phenomenal, about 30% faster. But that was until we started testing more complex workflows.
When testing isolated DKIM signing with RSA 1024 and EdDSA (elliptic curve), the x86-based c5.2xlarge was about 20% faster than the ARM-based c6g.2xlarge. With RSA 2048, the difference was even bigger; with the x86-based instance being more than twice as fast. Consequentially, when benchmarking a typical email sending setup with DKIM signing (using RSA 2048) and queuing to EBS storage, the signing slowed down the ARM-based c6g.2xlarge instance to the point where it no longer made economical sense. This was verified using OpenSSL’s speed command testing RSA 2048 per-core performance, listed in the table below.
Sign / s
Verify / s
Apple Mac Mini M1
This might appear very strange, given the exceptional per-core RSA performance of Apple’s M1 chip. Evidently, it’s more about specific CPUs rather than the architecture itself. The following tweet suggests the Amazon AWS team are aware of the RSA performance issues.
got a response from AWS: “Our Graviton teams are aware of a performance gap for RSA and have acknowledged it to be a sub-optimal workload for Graviton2.”
It’s possible that this should be attributed to ARM’s Neoverse N1 platform on which Graviton 2 is based, because Cloudflare’s benchmarks of another N1-based CPU called Altra from Ampere seems to have similar RSA performance.
To summarise; ARM can be fast, and is almost certainly here to stay. Is it good for running an MTA server such as Halon? It could be, but it very much depends on the workload, configuration and the type of ARM chip. Fingers crossed for Graviton 3 addressing those RSA issues.
All tests where performed on AWS c5.2xlarge and c6g.2xlarge instances, as well as a Mac Mini with the M1 chip.
The Halon MTA is a flexible email operations and security platform.
It enables organisations that operate large-scale email services to offer competitive features by rapid implementation
and to lower maintenance costs through reliable deployment and reduced complexity.