Optimize Linux Kernel to reduce softirqs using Red Hat Tuned

Open Table of contents

Overview
Explanation
Experiment Methods
How sysjitter works
Conclusion
References

Overview

The ultimate goal is to reduce the number of soft interrupts on the live Quant Trading Machines. The overall idea of the experiment is to change the parameters of cmdline, sysctl, sysfs, CPU, etc. by changing tuned, and then gradually compare the tracing history of the CPU against the tracing history of sysjitter (a program that detects jitter on each core of the CPU) in order to determine the impact of different parameters on the soft interrupts. Impact of Different Parameters on Soft Interrupts

The problem so far is that only a few instructions are clearly useful, such as tsc=reliable and nohz_full=…, but the rest of the instructions, even if they are not useful, have no effect on the soft interrupts. But other instructions, even if

However, even if other instructions are run several times with the same tuned configuration, the difference in results may be relatively large.

All the tests were done on core 1. (Rocky Linux 9.2, AMD 7950X, same configuration as the live trading machine we use for quantitative trading)

Heuristically determined optimal Tuned configuration so far:

[bootloader]
#Command line arguments passed to the kernel at system boot time
#Any changes here will only take effect after reboot.
cmdline = processor.max _cstate=0 intel_idle.max_cstate=0 idle=poll
mce=ignore_ce nmi_watchdog=0 coredump_filter=0x3b iommu=off intel_iommu=off amd_iommu=off isolcpus=1-15 nosoftlockup transparent_hugepage= never rcu_nocbs=0 never rcu_nocbs=1-15 nohz_full=1-15 irqaffinity=0 selinux=0 audit=0 tsc=reliable skew_tick=1
kernel.sched_rt_runtime_us = -1

[cpu]
governor=performance
energy_perf_bias=performance

[sysfs]
/sys/bus/workqueue/devices/writeback/cpumask = 1
/sys/devices/virtual/workqueue/cpumask = 1
/sys/devices/virtual/workqueue/*/cpumask = 1
/sys/devices/system/machinecheck/machinecheck*/ignore_ce = 1

[script]
script=${i:PROFILE_DIR}/script.sh

[sysctl]
kernel.hung_task_timeout_secs = 600
kernel.nmi_watchdog = 0
vm.stat_interval = 10
kernel.timer_migration = 0
kernel.sched_rt_runtime_us = -1
#kernel.sched_min_granularity ns=10000000
vm.dirty_ratio = 10
vm.dirty_background_ratio=3 vm.nr_hugepages=4096 net.ipv4.tcp_timestamps=0 net.ipv4.tcp_sack=0 net.ipv4.tcp_ecn=0 #net.core.netdev_max_backlog=250000 net. core.rmem_max=0 #net. core.rmem_max=0 #net.core. core.rmem_max=16777216 net.core.wmem_max=16777216 net.core.rmem_default=16777216 net.core.wmem_default=16777216 net.core.optmem_max=16777216
net.ipv4.tcp_mem=16777216 16777216 16777216 16777216
net.ipv4.tcp_rmem=32768 436600 16777216
net.ipv4.tcp_wmem=8192 436600 16777216
net.ipv4.tcp_low_latency=1

Explanation

bootloader parameters

processor.max_cstate=0 and intel_idle.max_cstate=0: These two parameters are used to limit the maximum C state that the CPU can enter, C state is the power management state of the CPU, the larger the number, the deeper the power saving state the CPU enters (e.g., C1-C3), but the longer it takes to recover from this state to normal operation. Setting it to 0 prevents the CPU from entering the power saving state, which reduces the delay of the CPU recovering from the power saving state. Reference: 1. Prevent CPU Idling, Reference: EXABLAZE: Benchmarking

idle=poll: This parameter sets the idle state of the CPU to poll, i.e. the CPU will always poll when idle and will not go into a power saving state.

iommu=off, intel_iommu=off, and amd_iommu=off: These parameters turn off IOMMU, a virtualization technology used to manage device access to memory. IOMMU is a virtualization technique used to manage device access to memory. It affects DMA performance in some newer devices such as AMD7950X.

nmi_watchdog=0: This parameter turns off the NMI watchdog, a mechanism used to detect deadlocks on the system, and turning it off reduces some of the system overhead.

coredump_ﬁlter=0x3b: This parameter sets the ﬁlter for core dump.

isolcpus=1-15: This parameter sets the CPU isolation. In this example, CPU1 to CPU15 are isolated, they will not be used by the OS scheduler and only specific processes can use these CPUs.

nosoftlockup: This parameter disables the soft lockup detection. This parameter turns off the softlockup detection, which is a case of CPU overcapacity. Turning off this detection reduces some of the system overhead.

transparent_hugepage=never: This parameter turns off transparent macro. Large size is a memory management technique and turning it off can reduce some of the memory management overhead.

rcu_nocbs=1-15: This parameter sets the RCU (Read-Copy-Update) no-callback CPUs. in this example, CPU1 to CPU15 are set as no-callback CPUs, and they will not handle the RCU callback function.

nohz_full=1-15: This parameter sets the CPU status of full dynticks. In this example, CPU1 to CPU15 are set as full dynticks CPUs and their clock interrupts will be turned on and off dynamically. Reference: CPU Isolation - Nohz_full troubleshooting: broken TSC/clocksource

irqaﬃnity=0: This parameter sets the default IRQ affinity. In this example, all IRQs will be sent to CPU0. Reference: EXABLAZE: Benchmarking

selinux=0 and audit=0: These parameters turn off SELinux, a security module, and audit, a mechanism for logging system activity, so turning them off reduces some system overhead. Reference: EXABLAZE: Benchmarking

tsc=reliable: his entry tells the kernel that the timestamp counter (TSC) is reliable, can be synchronized across all processors, and will not stop when the processor transitions between P and C states. This may help to improve the performance of certain applications that rely on accurate time measurement.

skew_tick=1: In a multiprocessor system, it is common for each processor to receive periodic clock interrupts for driving the system clock and triggering timers. However, this may result in all processors receiving clock interrupts at the same time, creating synchronization interrupts, which may affect system performance. To avoid this, the Linux kernel introduces the skew_tick parameter. When skew_tick is set to 1, the kernel tries to stagger clock interrupts between processors to reduce synchronization interrupts. This is the default behavior. When skew_tick is set to 0, the kernel does not attempt to stagger clock interrupts and all processors receive clock interrupts at the same time.

/sys/bus/workqueue/devices/writeback/cpumask = 1: This parameter is used to set the CPU mask for the writeback work queue, which is a bitmask that specifies which CPUs are allowed to perform tasks in the writeback work queue. In this example, the value “1” means that only CPU 0 can perform these tasks.

/sys/devices/virtual/workqueue/cpumask = 1: This parameter is used to set the CPU mask for the virtual work queue. Again, a value of “1” means that only CPU 0 can perform tasks in the virtual work queue.

/sys/devices/virtual/workqueue/\*/cpumask = 1: This parameter is used to set the CPU mask for all virtual work queues.

/sys/devices/system/machinecheck\*/ignore_ce = 1: This parameter is used to set whether or not to ignore the Correctable Error of Machine Check. If set to “1”, the system will ignore these errors.

sysctl Parameters

kernel.hung_task_timeout_secs = 600: This parameter sets the amount of time in seconds that the system waits after detecting that a task is hung (i.e., unable to continue execution). If a task does not respond within this time, a warning message is printed.

kernel.nmi_watchdog = 0: This parameter disables the NMI watchdog, a mechanism used to detect deadlocks in the system, and disabling it reduces system overhead.

vm.stat_interval = 10: This parameter sets the interval in seconds between updates of virtual memory statistics.

kernel.timer_migration = 0: This parameter turns off timer migration. Timer migration is an optimization technique to migrate timers from one CPU to another, and turning it off can reduce some system overhead.

kernel.sched_rt_runtime_us = -1: This parameter sets the amount of CPU time in microseconds that a real-time task can use in each scheduling cycle. A setting of -1 means there is no limit.

vm.dirty_ratio = 10 and vm.dirty_background_ratio = 3: These parameters set the percentage of the maximum amount of dirty memory (i.e., memory that has not yet been written to disk) that can be used by the system. dirty_ratio is the threshold at which the system starts to force the writing of dirty memory, and dirty_background_ratio is the threshold at which the system starts to force the writing of dirty memory to disk.

dirty_background_ratio: is the threshold for the system to start writing to dirty background.

vm.nr_hugepages=4096: This parameter sets the number of macro pages in the system. Maximizing is a memory management technique used to reduce the size of the table in order to increase the efficiency of memory accesses.

net.ipv4.tcp_timestamps=0, net.ipv4.tcp_sack=0, and net.ipv4.tcp_ecn=0: These parameters turn off TCP timestamps, SACK (Selective Acknowledgement), and ECN (Explicit Congestion Notiﬁcation). These are TCP optimization techniques, and turning them off reduces some of the network overhead.

net.core.rmem_max=16777216, net.core.wmem_max=16777216, net.core.rmem_default=6777216, net.core.wmem_default=16777216 and net.core.optmem_max=16777216: These parameters set the maximum and default values for the network receive buffer and transmit buffer, and the maximum option memory for the socket.

net.ipv4.tcp_mem=16777216 16777216 16777216, net.ipv4.tcp_rmem=32768 436600 16777216, and net.ipv4.tcp_wmem=8192 436600 16777216: These parameters set the TCP memory These parameters set the TCP memory usage limit and the size of the receive and transmit buffers.

net.ipv4.tcp_low_latency=1: This parameter enables TCP’s low latency mode. In this mode, TCP minimizes latency instead of maximizing throughput.

cpu parameters

governor=performance: This parameter sets the CPU’s frequency scheduling policy to performance; in this mode, the CPU will try to operate at the highest possible frequency for best performance.

energy_perf_bias=performance: This parameter sets the CPU’s energy performance preference to performance; in this mode, the CPU prioritizes performance over energy consumption.

Both of these parameters are set to increase the performance of the CPU in order to reduce the processing time of soft interrupts. A soft interrupt is an interrupt created by a hardware device that needs to be handled by the CPU. The higher the CPU performance, the shorter the processing time for soft interrupts, reducing the number of soft interrupts on cores 1-15.

script.sh script

The commands in this section of start() are executed automatically each time the tuned service is started.

/usr/bin/sh /usr/bin/sh

. /usr/lib/tuned/functions

start() {
  for irq in 'ls /proc/irq/'
    do echo 1 > /proc/irq/$irq/smp_affinity
  done
  for cpu in $(seq 1 1 15)
    do echo 0 > /sys/devices/system/cpu/cpu${cpu}/online
  done
  for cpu in $(seq 1 1 15)
    do echo 1 > /sys/devices/system/cpu/cpu${cpu}/online
  do echo 1 > /sys/devices/system/cpu/cpu


  #Shut down services, whether loaded or unloaded.
  systemctl stop cpupower
  systemctl stop irqbalance
  systemctl stop firewalld
  systemctl stop cpuspeed
  systemctl stop cpufreqd
  systemctl stop powerd

  for cpu in $(seq 1 1 15)
  systemctl stop cpufreqd
    echo 0 > /sys/devices/system/machinecheck/machinecheck${cpu}/check_interval
  done

  ethtool -C enp1s0f0 rx-usecs 0 adaptive-rx off
  ethtool -C enp1s0f1 rx-usecs 0 adaptive-rx off
  ethtool -C enp2s0f0np0 rx-usecs 0 adaptive-rx off
  ethtool -C enp2s0f1np1 rx-usecs 0 adaptive-rx off

  echo 8192 > /sys/kernel/debug/tracing/buffer_size_kb

  return "$?"
}

stop() {
  return "$?"
}

process $@

Experiment Methods

systemctl status tuned

# If inactive:
systemctl restart tuned

cat /proc/cmdline

# tuned self-test:
tuned-adm active
tuned-adm verify

First, change the maximum size of the tracing date, otherwise it will not be enough: This is important, otherwise the date will be overwritten and not captured if you run it slowly.

sudo echo 8192 > /sys/kernel/debug/tracing/buffer_size_kb

Turn tracing off again, clear the tracing date, and turn tracing on again:

TRACING=/sys/kernel/debug/tracing/

Then:

# Make sure tracing is off for now
echo 0 > $TRACING/tracing_on

# Flush previous traces
echo > $TRACING/trace

# Record disturbance from other tasks
echo 1 > $TRACING/events/sched/sched_switch/enable

# Record disturbance from interrupts
echo 1 > $TRACING/events/irq_vectors/enable

# Now we can start tracing
echo 1 > $TRACING/tracing_on

Then:

# Disable tracing again and copy the tracing date from the desired core:
echo 0 > $TRACING/tracing_on

# Disable tracing and save traces from CPU 7 in a file
cat $TRACING/per_cpu/cpu7/trace > trace.7

cd cns-sysjitter-master
. /sysjitter --runtime 60 --cores 1-15 100

Finally, analyze the results:

scp [email protected]:/root/cns-sysjitter-master/trace.7 .
python analyze.py --start_time 3266.768148209 --end_time 3326.768156395 -- filename trace.7 --verbose

How sysjitter works

Main function: The main function of the program, main(), first parses the command arguments, then initializes some variables and data structures, then creates a thread on each CPU.

Then a thread is created on each CPU to measure the system jitter, and the results are written to a file.

The main workflow of this program is to first get the current CPU frequency and time, and then continuously get the CPU cycles in a loop. By comparing the number of CPU cycles obtained in two consecutive times, the jitter of the CPU during this period can be calculated. This process is repeated for each CPU to obtain the jitter of all CPUs in the system.

Conclusion

Running at different times or continuously with the same configuration can sometimes result in large differences in soft interrupts.

Most of these types of dates can be seen in the datebook, except for the idle.

call_function_single_entry and call_function_entry: These two events are related to smp (symmetric multiprocessing) function calls, which are functions that call one processor on another. These functions are usually performed in soft interrupt contexts.

irq_work_entry: IRQ work is a mechanism for scheduling tasks in soft interrupt contexts. This is a way to move work from hard interrupt handlers to soft interrupts.

sched_switch: This is a scheduling event indicating that the kernel is switching to a new task. This can happen for a number of reasons, including the current task running out of time, or a higher priority task becoming available.

local_timer_entry: This event indicates the start of a local timer interrupt. Timer interrupts are the centerpiece of the operating system and are used for tracking time, scheduling tasks, and so on.

reschedule_entry: This is a schedule event indicating that the kernel is trying to schedule a new task. This may be due to the current task being blocked, or a higher priority task becoming available.

It is useful to add tsc=reliable.

Adding the sysctl and sysfs commands in cpu-partitioning.conf does not have a significant effect.

After adding idle=poll, idle=poll prevents the idle state, and after removing idle=poll, the LOC does not increase in the idle state.

We need to add idle=poll and remove nohz=on, because the former is to keep the CPU from entering the idle, and the latter is to turn the tick to once a second after the CPU enters the idle, which is not useful if the former is already there. So we only need to add idle=poll, not nohz=on.

Explanation and summary

The nohz=on option enables the tickless idle mode, which disables timer interrupts when the system is idle. tickless idle mode is designed to save energy and may not be necessary for us.

nohz_full=${cpus} indicates the use of full dyntick mode, which allows the kernel to reduce the frequency of timer ticks to once every 1s (or once every 2-3 seconds in practice) when the system is idle or a task is running. This reduces CPU interference and improves overall system performance. tsc=reliable needs to be added manually for AMD processors and prior to kernel version 5.16, where the 1s tick is oﬄoad to the rest of the CPUs.

For tsc=reliable, the kernel can use TSC as its primary time source if TSC is synchronized across all processors; TSC is a counter that is incremented every CPU cycle, so it can provide very high-precision time information. However, if the TSC is not synchronized between processors, or stops counting when a processor enters a low-power state, then using the TSC may cause problems. tsc=reliable is used to tell the kernel that the TSC is reliable and can be used as a time source.