-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ChaCha20: add optimized versions for amd64 (SSSE3 & AVX2) #392
base: 1.1
Are you sure you want to change the base?
Conversation
Speeds up local iperf3 in Linux network namespaces by roughly 20%.
I tried the NEON implementation from the same author. It's doesn't have as much of an edge over the generic one, at least on Neoverse-N1. So iperf3 speeds are comparable. If anyone has an old Raspberry Pi and wants to test this on a slower CPU — be my guest. The code is in pretty bad shape since it's work in progress, but it does work. Otherwise, I see no reason to merge NEON support. generic
NEON
Badly modified `ns_ping.py` for running iperf3#!/usr/bin/env python3
"""Create two network namespaces and run ping between them."""
import os
import signal
import subprocess as subp
import typing as T
from testlib import external as ext, util, template, cmd
from testlib.log import log
from testlib.proc import Tinc, Script
from testlib.test import Test
util.require_root()
util.require_command("ip", "netns", "list")
util.require_path("/dev/net/tun")
IP_FOO = "192.168.1.1"
IP_BAR = "192.168.1.2"
MASK = 24
def init(ctx: Test) -> T.Tuple[Tinc, Tinc]:
"""Initialize new test nodes."""
foo, bar = ctx.node(), ctx.node()
log.info("create network namespaces")
assert ext.netns_add(foo.name)
assert ext.netns_add(bar.name)
log.info("initialize two nodes")
stdin = f"""
init {foo}
set Port 0
set Subnet {IP_FOO}
set Interface {foo}
set Address localhost
set AutoConnect no
"""
foo.cmd(stdin=stdin)
foo.add_script(Script.TINC_UP, template.make_netns_config(foo.name, IP_FOO, MASK))
foo.start()
stdin = f"""
init {bar}
set Port 0
set Subnet {IP_BAR}
set Interface {bar}
set Address localhost
set AutoConnect no
"""
bar.cmd(stdin=stdin)
bar.add_script(Script.TINC_UP, template.make_netns_config(bar.name, IP_BAR, MASK))
cmd.exchange(foo, bar)
return foo, bar
def ping(namespace: str, ip_addr: str) -> int:
"""Send pings between two network namespaces."""
log.info("pinging node from netns %s at %s", namespace, ip_addr)
proc = subp.run(
["ip", "netns", "exec", namespace, "iperf3", "-t", "60", "-c", ip_addr], check=False
)
log.info("ping finished with code %d", proc.returncode)
return proc.returncode
with Test("ns-ping") as context:
foo_node, bar_node = init(context)
bar_node.cmd("start")
log.info("waiting for nodes to come up")
bar_node[Script.TINC_UP].wait()
log.info("add script foo/host-up")
bar_node.add_script(foo_node.script_up)
log.info("add ConnectTo clause")
bar_node.cmd("add", "ConnectTo", foo_node.name)
log.info("bar waits for foo")
bar_node[foo_node.script_up].wait()
subp.Popen(["ip", "netns", "exec", bar_node.name, "iperf3", "-s"], stdout=subp.DEVNULL, stderr=subp.DEVNULL)
log.info("ping must work after connection is up")
assert not ping(foo_node.name, IP_BAR) |
The benchmark code you wrote is not really measuring the usage pattern tinc has. The Performance measurements using debug builds are mostly useless. Have you tried configuring the build system with I also have a branch (PR #360) that adds AES-256-GCM support to SPTPS, which depends on OpenSSL, but it will then also use OpenSSL for Chacha20-Poly1305. It might be interesting to compare the speed with this optimized version as well. |
Oh yes. If you're willing to reintroduce dependency on libssl, it would of course be best. We get highly optimized code for all possible architectures for free. TL;DR:
*: probably a debug build detailed resultsgeneric
avx2
alt (OpenSSL 1.1.1.o)
alt (OpenSSL 3.1)I don't remember what flags were used to build the library. It may be a debug build, so don't pay much attention to it.
I'll leave it here for now until after #360 is merged. |
With the 'new' protocol, ChaCha is taking a decent amount of CPU time, at least in debug build (optimization makes perf output unreadable):
tincd is using the lowest common denominator implementation of this function. Let's add a couple of optimized ones based on compiler intrinsics.
All the hard work has been done by Romain Dolbeau. I just copied it with some adjustments.
Compatibility
x86 / amd64
We'll be shipping three versions of the function (or two, with old compilers without avx2 support):
The right one is picked at runtime depending on current CPU capabilities.
Other architectures
Only the old C implementation is used. ARM Neon could be added later.
Benchmarks
performance
CPU governor, as few processes as possible, all the basic benchmarking stuffTL;DR: 20-22% increase in throughput.
bench_chacha.c
Percentage is relative to generic C implementation.
C
SSE
AVX2
Always resolving the correct function (instead of doing it once and storing in a pointer) is a bit slower:
iperf3:
buildtype=release
C
SSSE3
AVX2
I thought that it might be possible that optimizing for a specific CPU or auto-vectorization that is performed at
-O3
would remove the need of writing assembly:buildtype=release + -march=native -mtune=native
C
AVX2
-O3 + -march=native -mtune=native
C
AVX2
Not really.