Cheers,
for a week now we have our first NGF cluster in production use. The system seems to behave ok but I am a bit unsure if the current CPU and I/O load would be considered "normal" or if something could be done to lower the load. Since this is not really a support case (maybe this is completely normal behaviour but it does seem high to me), I thought that maybe I could check with the community.
I apologize in advance if this seems boring or unnecessarily specific.
The system is a cluster of two NG Firewalls (7.1.1), installed on vSphere 6.5 with 2 CPU and 4GB RAM for each VM. Only the Firewall service is active.
The current Firewall throughput is roughly like this:
- 3000 - 7000 concurrent sessions
- between 50 and 130 MBit/sec throughput for the Forwarding Firewall
With this throughput, the active firewall node has the following load
- continuous load average of about 2
- according to top: CPU %us and %sy each around 5%, but %wa (I/O waits) usually around 30% or more, %si (Soft Interrupts) between 20-30%
- iotop shows a Write throughput of around 500 KB/sec with regular bursts of up to 4 MB/sec (but not more)
- the process doing almost all the IO (about 90%, based on iotop) is kjournald
Compared to that, the secondary node has an average load of 0.1
My current analysis and understanding is that "something" is doing lots of I/O on the active node which manifests as high I/O waits seen with top, resulting in the curiously high overall load of 2 (since the CPU itself isn't doing all that much).
My first assumption was that the underlying storage might have an issue - the VMs are on a shared datastore, provided by vSAN.
But a quick test with bonnie++ on another VM on the same datastore showed a Write througput of well above 100 MB/sec (at least 25x higher than the usual throughput of the firewall VM). So I am fairly certain that the current disk I/O of the active firewall node shouldn't be an issue (from a throughput POV).
What is also puzzling to me is the fact that kjournald is shown as the culprit with the most IO. If another process was the culprit, I could just lsof that process to check out the current files, but using lsof on kjournald yields basically nothing.
I suspected the activity log as the most likely candidate and deactivated logging to the activity log for a couple of rules that produced the majority of the activity log entries - but this didn't help a bit.
I am a bit confused if this is expected behaviour based on the mentioned load or if this sould be significantly less. There shouldn't be much more additional load or throughput on this firewall in the foreseable future, but we will be implementing additional NGF clusters next year and I would like to have a good understanding of the expected load profile beforehand before we run into performance trouble then.
Sorry for the long post - but maybe someone can give my a good hint of what might be causing this and if there is an easy way to bring the load down.
Cheers and thx,
Michael