Monitoring Linux Audit

This is part three in a series where we examine “gaps” in our distributed traces.

Previously, we wrote about encountering these gaps in our distributed traces, and using ftrace to trace these gaps to syscall delays induced by Linux Audit.

In this final part, we’ll talk about monitoring improvements we’ve made to mitigate syscall delays and other risks associated with Linux Audit.

What is Linux Audit? #

Linux Audit generates audit records in response to syscalls. These records may contain security-relevant information, for example about privilege escalation attempts or suspicious network activity. When syscalls happen — like the accept4 syscall in diagram 1 below — Linux enqueues audit records onto an internal backlog for asynchronous processing by userspace consumers like auditd.

Linux Audit enqueues
   audit records about syscalls into backlog for consumption by userspace consumers
   like auditd.
Diagram 1 — Linux Audit enqueues audit records about syscalls into backlog for consumption by userspace consumers like auditd.

Options, risks and signals #

On high throughput systems, Linux may produce audit records faster than userspace consumers can handle. Depending on how Linux audit is configured, this can lead to memory exhaustion, lost audit records, or syscall delays.

Options #

There are a number of Linux Audit configuration options, but the two we’ll look at are backlog_limit and backlog_wait_time. These options govern, respectively, the maximum size of the backlog, and how long Linux will wait for a full backlog to drain before dropping (a.k.a. “losing”) a record.

Risks #

These options can be adjusted to make risk tradeoffs. Disabling backlog_limit or setting it to a very high value reduces the risk of backlog overflows, but increases the risk of memory exhaustion. Decreasing backlog_wait_time can reduce the duration of delays induced by Linux audit waiting for the backlog to drain, but can also increase the risk of dropped audit records.

Signals #

Depending on the risk profile of your Linux Audit configuration, you may want to prioritize monitoring some signals over others. For example, if the backlog_limit is non-zero, keep an eye on backlog utilization to mitigate the risks associated with backlog overflow. On the other hand, if it is set to zero, then pay closer attention to system memory utilization.

Guidance #

Assume for a moment that there is a high throughput Linux machine where the userspace audit record consumer is periodically struggling to keep pace with the rate of record production.

The table below shows the severity of risks and the importance of signals relative to different Linux Audit configurations that could be set on that machine. Actual risk severity and signal importance depends on factors such as machine hardware and workload characteristics.

Linux Audit configuration options
backlog_limit06464
backlog_wait_time0060000
Relative risk severity
Memory exhaustion (from audit records)HighLowLow
Lost audit recordsHighMediumLow
Syscall delays (induced by Linux Audit)NoneNoneHigh
Relative signal importance
Memory utilizationHighLowLow
Backlog utilizationNoneHighMedium
Backlog waitingNoneNoneHigh

Metrics and measurement tools #

At TpT, our backlog_limit is set to 5000, and our backlog_wait_time is set to 15000. We wrote previously about how these settings, combined with high throughput and insufficient memory allocation for our userspace audit record processor, led to backlog overflow and syscall delays. Given our configuration and risk profile, it makes sense for us to prioritize monitoring backlog utilization, backlog waiting, and lost audit records.

The most natural source of these signals is the status object. Along with Linux audit configuration option values, the kernel maintains internal counters for backlog size and lost records, which it exposes through the status object.

Collecting status metrics #

The status object can be retrieved with auditctl -s or via a Netlink socket.

$ sudo auditctl -s | grep backlog\|lost | sort
backlog 0
backlog_limit 5000
backlog_wait_time 15000
lost 0

We expose these metrics with linux-audit-exporter, a simple Prometheus exporter we created to make it easier to work with Linux Audit status. We use the bundled Helm chart to deploy it to all of our EKS nodes, and then use DataDog to collect the output. From there, we use DataDog to monitor backlog utilization (100 * (backlog / backlog_limit)) and dropped records (lost).

Backlog waiting #

This is a good start, but incomplete. It’s possible for the backlog to overflow, the kernel to induce delays in syscalls, and then for the backlog to drain while syscalls are delayed. If this happens between metric collection passes, then we won’t see spikes in backlog utilization, and will be left wondering why our application performance is degraded.

Backlog overflow,
  syscall delays, and backlog draining between metric collection passes.
Diagram 2 — how backlog overflow and waiting can escape detection of monitoring collection passes.

That possibility is why it’s important to also monitor backlog waiting. We contributed the backlog_wait_time_actual counter to Linux, which reports the amount of time Linux Audit spends waiting for the backlog to drain. This counter makes it possible to monitor backlog waiting through the Linux Audit status object.

audit_log_start and eBPF #

Unfortunately, it isn’t available until Linux 5.9, and (at the time of this writing) we’re using AWS stock images running 4.14. Running a user-provided kernel or non-stock EKS AMIs involves more operational responsibility than we care to assume.

So, we looked for another way to gather information about backlog waiting.

What is audit_log_start? #

audit_log_start is the internal kernel function that puts syscalls to sleep when the backlog is full. If this function is slow, there’s a good chance that it’s because the kernel has put the function caller to sleep in order to wait for the backlog to drain.

While imperfect, audit_log_start latency is a reasonable substitute signal for backlog_wait_time_actual.

What is eBPF? #

eBPF is a powerful framework for running safe and performant code inside the kernel. Unlike userspace programs, eBPF programs run inside the kernel, meaning that they can observe kernel behavior without expensive context switches from userspace to kernel mode. Unlike kernel modules, eBPF programs are checked at compile-time by a verifier that forbids unbounded loops or attempts to access uninitialized variables. Additionally, eBPF programs run inside an abstracted runtime context, which requires eBPF programs to access kernel data through safe eBPF helper functions.

These constraints provide a very high degree of safety, at the cost of flexibility. In spite of these constraints, it’s possible to build useful networking, security, and observability tools with eBPF. Check out Cilium, Falco, and Katran.

ebpf_exporter #

We collect audit_log_start latency by supplying this program to ebpf_exporter, which compiles the program and loads it into the kernel’s eBPF virtual machine. We deploy the program and ebpf_exporter together to all of our EKS nodes with this Helm chart.

eBPF program measures
  audit_log_start latency, and ebpf_exporter exposes latency in Prometheus format.
Diagram 3 — eBPF program measures audit_log_start latency, and ebpf_exporter exposes histograms in Prometheus format.

Once loaded into the kernel, our program is called whenever a call to audit_log_start starts and ends. Our program uses the start and end time to record the call latency in a histogram. These histograms are exported as Prometheus metrics by ebpf_exporter, collected and monitored by DataDog.

Drawbacks #

Using audit_log_start latency as a substitute for backlog_wait_time_actual does have some drawbacks.

  1. Imperfect signal. audit_log_start latency is not a direct measurement of backlog waiting behavior. Conceivably, latency could spike for reasons other than backlog overflow. It may help to also measure schedule_timeout, and correlate changes to those two measurements.
  2. Maintainability. audit_log_start is an internal kernel function, and is not a part of the Linux API. audit_log_start is stable through kernel version 5.9, at which point backlog_wait_time_actual can be used instead.

Guidance #

The table below will hopefully guide your selection of tools to measure signals relevant to the risk profile of your Linux Audit configuration.

RiskKernelSignalTool
Memory exhaustion>= 2.6Memory utilizationAny Linux monitoring system
Lost records>= 2.6Backlog utilizationlinux-audit-exporter
Syscall delays>= 4.4, < 5.9audit_log_startebpf_exporter
Syscall delays>= 5.9backlog_wait_time_actuallinux-audit-exporter