This is part three in a series where we examine “gaps” in our distributed traces.
In this final part, we’ll talk about monitoring improvements we’ve made to mitigate syscall delays and other risks associated with Linux Audit.
What is Linux Audit? #
Linux Audit generates audit records in response to syscalls. These records may contain security-relevant information, for example about privilege escalation attempts or suspicious network activity. When syscalls happen — like the
accept4 syscall in diagram 1 below — Linux enqueues audit records onto an internal backlog for asynchronous processing by userspace consumers like
Options, risks and signals #
On high throughput systems, Linux may produce audit records faster than userspace consumers can handle. Depending on how Linux audit is configured, this can lead to memory exhaustion, lost audit records, or syscall delays.
There are a number of Linux Audit configuration options, but the two we’ll look at are
backlog_wait_time. These options govern, respectively, the maximum size of the backlog, and how long Linux will wait for a full backlog to drain before dropping (a.k.a. “losing”) a record.
These options can be adjusted to make risk tradeoffs. Disabling
backlog_limit or setting it to a very high value reduces the risk of backlog overflows, but increases the risk of memory exhaustion. Decreasing
backlog_wait_time can reduce the duration of delays induced by Linux audit waiting for the backlog to drain, but can also increase the risk of dropped audit records.
Depending on the risk profile of your Linux Audit configuration, you may want to prioritize monitoring some signals over others. For example, if the
backlog_limit is non-zero, keep an eye on backlog utilization to mitigate the risks associated with backlog overflow. On the other hand, if it is set to zero, then pay closer attention to system memory utilization.
Assume for a moment that there is a high throughput Linux machine where the userspace audit record consumer is periodically struggling to keep pace with the rate of record production.
The table below shows the severity of risks and the importance of signals relative to different Linux Audit configurations that could be set on that machine. Actual risk severity and signal importance depends on factors such as machine hardware and workload characteristics.
|Linux Audit configuration options|
|Relative risk severity|
|Memory exhaustion (from audit records)||High||Low||Low|
|Lost audit records||High||Medium||Low|
|Syscall delays (induced by Linux Audit)||None||None||High|
|Relative signal importance|
Metrics and measurement tools #
At TpT, our
backlog_limit is set to 5000, and our
backlog_wait_time is set to 15000. We wrote previously about how these settings, combined with high throughput and insufficient memory allocation for our userspace audit record processor, led to backlog overflow and syscall delays. Given our configuration and risk profile, it makes sense for us to prioritize monitoring backlog utilization, backlog waiting, and lost audit records.
The most natural source of these signals is the status object. Along with Linux audit configuration option values, the kernel maintains internal counters for backlog size and lost records, which it exposes through the status object.
Collecting status metrics #
The status object can be retrieved with
auditctl -s or via a Netlink socket.
$ sudo auditctl -s | grep backlog\|lost | sort backlog 0 backlog_limit 5000 backlog_wait_time 15000 lost 0
We expose these metrics with linux-audit-exporter, a simple Prometheus exporter we created to make it easier to work with Linux Audit status. We use the bundled Helm chart to deploy it to all of our EKS nodes, and then use DataDog to collect the output. From there, we use DataDog to monitor backlog utilization (
100 * (backlog / backlog_limit)) and dropped records (
Backlog waiting #
This is a good start, but incomplete. It’s possible for the backlog to overflow, the kernel to induce delays in syscalls, and then for the backlog to drain while syscalls are delayed. If this happens between metric collection passes, then we won’t see spikes in backlog utilization, and will be left wondering why our application performance is degraded.
That possibility is why it’s important to also monitor backlog waiting. We contributed the
backlog_wait_time_actual counter to Linux, which reports the amount of time Linux Audit spends waiting for the backlog to drain. This counter makes it possible to monitor backlog waiting through the Linux Audit status object.
audit_log_start and eBPF #
Unfortunately, it isn’t available until Linux 5.9, and (at the time of this writing) we’re using AWS stock images running 4.14. Running a user-provided kernel or non-stock EKS AMIs involves more operational responsibility than we care to assume.
So, we looked for another way to gather information about backlog waiting.
audit_log_start is the internal kernel function that puts syscalls to sleep when the backlog is full. If this function is slow, there’s a good chance that it’s because the kernel has put the function caller to sleep in order to wait for the backlog to drain.
audit_log_start latency is a reasonable substitute signal for
What is eBPF? #
eBPF is a powerful framework for running safe and performant code inside the kernel. Unlike userspace programs, eBPF programs run inside the kernel, meaning that they can observe kernel behavior without expensive context switches from userspace to kernel mode. Unlike kernel modules, eBPF programs are checked at compile-time by a verifier that forbids unbounded loops or attempts to access uninitialized variables. Additionally, eBPF programs run inside an abstracted runtime context, which requires eBPF programs to access kernel data through safe eBPF helper functions.
These constraints provide a very high degree of safety, at the cost of flexibility. In spite of these constraints, it’s possible to build useful networking, security, and observability tools with eBPF. Check out Cilium, Falco, and Katran.
audit_log_start latency by supplying this program to ebpf_exporter, which compiles the program and loads it into the kernel’s eBPF virtual machine. We deploy the program and ebpf_exporter together to all of our EKS nodes with this Helm chart.
Once loaded into the kernel, our program is called whenever a call to
audit_log_start starts and ends. Our program uses the start and end time to record the call latency in a histogram. These histograms are exported as Prometheus metrics by ebpf_exporter, collected and monitored by DataDog.
audit_log_start latency as a substitute for
backlog_wait_time_actual does have some drawbacks.
- Imperfect signal.
audit_log_startlatency is not a direct measurement of backlog waiting behavior. Conceivably, latency could spike for reasons other than backlog overflow. It may help to also measure
schedule_timeout, and correlate changes to those two measurements.
audit_log_startis an internal kernel function, and is not a part of the Linux API.
audit_log_startis stable through kernel version 5.9, at which point
backlog_wait_time_actualcan be used instead.
The table below will hopefully guide your selection of tools to measure signals relevant to the risk profile of your Linux Audit configuration.
|Memory exhaustion||>= 2.6||Memory utilization||Any Linux monitoring system|
|Lost records||>= 2.6||Backlog utilization||linux-audit-exporter|
|Syscall delays||>= 4.4, < 5.9||ebpf_exporter|
|Syscall delays||>= 5.9||linux-audit-exporter|