Here are some of the performance optimizations specifically done on Blitzping:
* Pre-Generation : All the static parts of the packet buffer get generated once, outside of the sendto() tightloop;
* Asynchronous : Configuring raw sockets to be non-blocking by default;
* Multithreading : Polling the same socket in sendto() from multiple threads; and
* Compiler Flags : Compiling with -Ofast, -flto, and -march=native (though these actually had little effect; by this point, the bottleneck is on the Kernel's own sendto() routine).
Shown below are comparisons between the three software across two CPUs (more details at the GitHub repository):
# Quad-Core "Rockchip RK3328" CPU @ 1.3 GHz. (ARMv8-A) #
+--------------------+--------------+--------------+---------------+
| ARM (4 x 1.3 GHz) | nping | hping3 | Blitzping |
+--------------------+ -------------+--------------+---------------+
| Num. Instances | 4 (1 thread) | 4 (1 thread) | 1 (4 threads) |
| Pkts. per Second | ~65,000 | ~80,000 | ~275,000 |
| Bandwidth (MiB/s) | ~2.50 | ~3.00 | ~10.50 |
+--------------------+--------------+--------------+---------------+
# Single-Core "Qualcomm Atheros QCA9533" SoC @ 650 MHz. (MIPS32r2) #
+--------------------+--------------+--------------+---------------+
| MIPS (1 x 650 MHz) | nping | hping3 | Blitzping |
+----------------------+------------+--------------+---------------+
| Num. Instances | 1 (1 thread) | 1 (1 thread) | 1 (1 thread) |
| Pkts. per Second | ~5,000 | ~10,000 | ~25,000 |
| Bandwidth (MiB/s) | ~0.20 | ~0.40 | ~1.00 |
+--------------------+--------------+--------------+---------------+
I tested Blitzping against both hpign3 and nping on two different routers, both running OpenWRT 23.05.03 (Linux Kernel v5.15.150) with the "masquerading" option (i.e., NAT) turned off in firewall; one device was a single-core 32-bit MIPS SoC, and another was a 64-bit quad-core ARMv8 CPU. On the quad-core CPU, because both hping3 and nping were designed without multithreading capabilities (unlike Blitzping), I made the competition "fairer" by launching them as four individual processes, as opposed to Blitzping only using one. Across all runs and on both devices, CPU usage remained at 100%, entirely dedicated to the currently running program. Finally, the connection speed itself was not a bottleneck: both devices were connected to an otherwise-unused 200 Mb/s (23.8419 MiB/s) download/upload line through a WAN ethernet interface.It is important to note that Blitzping was not doing any less than hping3 and nping; in fact, it was doing more. While hping3 and nping only randomized the source IP and port of each packet to a fixed address, Blitzping randomized not only the source port but also the IP within an CIDR range---a capability that is more computionally intensive and a feature that both hping3 and nping lacked in the first place. Lastly, hping3 and nping were both launched with the "best-case" command-line parameters as to maximize their speed and disable runtime stdio logging.