Jump to content


Photo

What load / IO waits are considered "normal" or expected?

NGF load I/O

  • Please log in to reply
1 reply to this topic

#1 Michael Szczuka

Michael Szczuka
  • Members
  • 1 posts

Posted 15 November 2017 - 09:51 AM

Cheers,

 

for a week now we have our first NGF cluster in production use. The system seems to behave ok but I am a bit unsure if the current CPU and I/O load would be considered "normal" or if something could be done to lower the load. Since this is not really a support case (maybe this is completely normal behaviour but it does seem high to me), I thought that maybe I could check with the community.

I apologize in advance if this seems boring or unnecessarily specific.

 

The system is a cluster of two NG Firewalls (7.1.1), installed on vSphere 6.5 with 2 CPU and 4GB RAM for each VM. Only the Firewall service is active.

The current Firewall throughput is roughly like this: 

  • 3000 - 7000 concurrent sessions
  • between 50 and 130 MBit/sec throughput for the Forwarding Firewall

With this throughput, the active firewall node has the following load

  • continuous load average of about 2 
  • according to top: CPU %us and %sy each around 5%, but %wa (I/O waits) usually around 30% or more, %si (Soft Interrupts) between 20-30%
  • iotop shows a Write throughput of around 500 KB/sec with regular bursts of up to 4 MB/sec (but not more)
  • the process doing almost all the IO (about 90%, based on iotop) is kjournald

Compared to that, the secondary node has an average load of 0.1

 

My current analysis and understanding is that "something" is doing lots of I/O on the active node which manifests as high I/O waits seen with top, resulting in the curiously high overall load of 2 (since the CPU itself isn't doing all that much).

 

My first assumption was that the underlying storage might have an issue - the VMs are on a shared datastore, provided by vSAN.

But a quick test with bonnie++ on another VM on the same datastore showed a Write througput of well above 100 MB/sec (at least 25x higher than the usual throughput of the firewall VM). So I am fairly certain that the current disk I/O of the active firewall node shouldn't be an issue (from a throughput POV).

What is also puzzling to me is the fact that kjournald is shown as the culprit with the most IO. If another process was the culprit, I could just lsof that process to check out the current files, but using lsof on kjournald yields basically nothing.

 

I suspected the activity log as the most likely candidate and deactivated logging to the activity log for a couple of rules that produced the majority of the activity log entries - but this didn't help a bit.

 

I am a bit confused if this is expected behaviour based on the mentioned load or if this sould be significantly less. There shouldn't be much more additional load or throughput on this firewall in the foreseable future, but we will be implementing additional NGF clusters next year and I would like to have a good understanding of the expected load profile beforehand before we run into performance trouble then.

 

Sorry for the long post - but maybe someone can give my a good hint of what might be causing this and if there is an easy way to bring the load down.

 

Cheers and thx,

Michael

 



#2 Manuel Huber

Manuel Huber
  • Members
  • 155 posts

Posted 30 January 2018 - 04:17 PM

I´d like to bump up this topic because I´d be very interested in a reply from someone with more knowledge, maybe Barracuda could give us an insight?
From my experience, a load of 2 for this amount of sessions/throughput seems realistic.
Example on a comparable setup: 20000-25000 sessions - 80 MBit/s throughput (both according to statistics of firewall) --> load 1.8
But of course this is highly dependent on which services are active. And sometimes I can see higher load on boxes for no obvious reasons, so for example there are a few F600 which show a funny pattern: 4h low load of around 0.2, then 2h load of 1.2, then again 0.2 for 4h and so on, without any relation to sessions, throughput or scheduled tasks (these F600 have relatively low traffic at the moment). Even the passive HA boxes show these patterns. So in short, I have no clue why there is load on these boxes (and Barracuda support also couldn´t give an insight).
 
I´m bringing this topic on top again also because we have some big firewalls with a pretty constant load >1 per core, most likely due to IO, and even though they are equipped with fast SSDs we don´t know how to prepare for higher needs except turning off features. But actually we´d like to turn on more!