Speech Script — eBPF Memory Leak Detection (Slides 27–33, ~5 minutes)
Speech Script — eBPF Memory Leak Detection (Slides 27–33, ~5 minutes)
Slide 27 — Title: Memory Leak (~30 sec)
Hi everyone, I'll be presenting the memory leak detection component of our eBPF project. 大家好,接下来由我介绍我们 eBPF 项目中内存泄漏检测的部分。
Memory leaks — allocations that are never freed — are a critical issue in long-running services. Over time they lead to increasing memory pressure, performance degradation, and eventually Out-Of-Memory. 内存泄漏——分配了却从未释放的内存——是长期运行服务中的关键问题。随着时间推移,会导致内存压力增大、性能下降,最终触发 Out-Of-Memory。
Traditional detection tools impose significant overhead and cannot be used in production, which motivated our adoption of eBPF for this task. Let's first look at how eBPF compares to these traditional tools. Next slide, please. 传统检测工具开销过大且无法在生产环境中使用,这促使我们采用 eBPF 来解决这个问题。我们先来看 eBPF 和传统工具的对比。请翻到下一页。
Slide 28 — eBPF vs Traditional Tools (~90 sec)
This table compares eBPF against the two established tools: Valgrind and AddressSanitizer. 这张表对比了 eBPF 与两个主流工具:Valgrind 和 AddressSanitizer。
Valgrind introduces 10 to 20x performance overhead. It does not require recompilation, but it does require relaunching the target process. It is unsuitable for production use. Valgrind 带来 10 到 20 倍的性能开销。虽然不需要重新编译,但必须重启目标进程,不适合生产环境。
AddressSanitizer reduces the overhead to 2 to 3x, but requires recompilation with instrumentation flags. It also requires a relaunch and cannot be used in production. AddressSanitizer 将开销降低到 2 到 3 倍,但需要加编译标志重新编译,同样需要重启,也不能用于生产。
eBPF operates at less than 5% overhead. It requires no code changes, no recompilation, and no process restart — it attaches dynamically to a running process. eBPF 的开销低于 5%。不需要修改代码、不需要重新编译、不需要重启进程——它动态附加到正在运行的进程上。
Additionally, it supports kernel-space allocation tracing through kmalloc and kfree, which neither Valgrind nor ASan can do. 此外,它支持通过 kmalloc 和 kfree 追踪内核态的内存分配,这是 Valgrind 和 ASan 都做不到的。
The tradeoffs: eBPF only detects unfreed allocations. It cannot identify use-after-free, buffer overflows, or double-free issues. It requires Linux kernel 4.9 or above and root privileges or CAP_BPF capability. 代价是:eBPF 只能检测未释放的分配,无法识别 use-after-free、缓冲区溢出或 double-free。需要 Linux 内核 4.9 以上以及 root 权限或 CAP_BPF 能力。
However, for continuous leak monitoring in production environments, these tradeoffs are acceptable, and eBPF is the appropriate tool. Now let's look at what data we actually capture. Next slide, please. 但对于生产环境中的持续泄漏监控,这些代价是可以接受的,eBPF 是合适的工具。现在来看我们实际捕获了哪些数据。请翻到下一页。
Slide 29 — What We Capture (~60 sec)
Our implementation hooks into memory allocation functions at two levels. 我们的实现在两个层面 hook 了内存分配函数。
In user-space, we use uprobes to intercept malloc, calloc, realloc, and free. In kernel-space, we use kprobes to intercept kmalloc, kfree, and kmem_cache_alloc. 用户态通过 uprobe 拦截 malloc、calloc、realloc 和 free;内核态通过 kprobe 拦截 kmalloc、kfree 和 kmem_cache_alloc。
For each allocation event, we record: the memory address and allocation size, timestamps for both allocation and free, the full stack trace, and the PID and TID. 每次分配事件我们记录:内存地址和大小、分配与释放的时间戳、完整调用栈、以及 PID 和 TID。
When a free call occurs, we match it against the corresponding allocation record. Any allocation without a matching free is flagged as a potential leak. 当 free 被调用时,我们将其与对应的分配记录匹配。任何没有匹配 free 的分配都会被标记为潜在泄漏。
The output provides two views: a timeline of outstanding unfreed allocations over time, and a ranked list of the top leaking call stacks sorted by total leaked bytes. 输出提供两个视图:未释放分配随时间的累积趋势,以及按泄漏总字节数排名的调用栈列表。
This allows developers to directly identify the code paths responsible for leaks. Let's see this in action with kernel-mode tracing. Next slide, please. 这使开发者能够直接定位造成泄漏的代码路径。接下来看看内核态追踪的实际效果。请翻到下一页。
Slide 30 — Trace Allocations in Kernel Mode (~30 sec)
This slide demonstrates kernel-mode tracing in practice. We successfully capture kmalloc calls along with their allocation sizes and complete kernel stack traces. 这页展示了内核态追踪的实际效果。我们成功捕获了 kmalloc 调用及其分配大小和完整的内核调用栈。
This capability enables detection of memory leaks in kernel modules, device drivers, and kernel subsystems — areas entirely invisible to user-space tools like Valgrind and ASan. 这一能力使我们能够检测内核模块、设备驱动和内核子系统中的内存泄漏——这些是 Valgrind 和 ASan 等用户态工具完全看不到的领域。
Now, raw terminal output is useful for debugging, but we also need a way to share findings with the team. So we built a visualization layer. Next slide, please. 原始终端输出对调试有用,但我们也需要一种方式和团队分享结果。所以我们构建了可视化层。请翻到下一页。
Slide 31 — HTML Visual Report (~45 sec)
Beyond raw log output, we developed a visualization tool that parses memleak logs and generates a self-contained HTML report. 除了原始日志输出,我们还开发了可视化工具,解析 memleak 日志并生成独立的 HTML 报告。
The report provides three components: trend charts displaying memory usage and allocation count over time, automated detection logic that identifies continuously growing patterns as suspected leaks, and a detailed table with expandable stack traces. 报告包含三个部分:展示内存使用量和分配次数随时间变化的趋势图、自动识别持续增长模式并标记为疑似泄漏的检测逻辑、以及带有可展开调用栈的详细表格。
This enables engineers to go from identifying a leak to locating the responsible code path in seconds. Let me show you the actual report output. Next slide, please. 这使工程师能够在几秒钟内从发现泄漏到定位负责的代码路径。让我展示实际的报告输出。请翻到下一页。
Slide 32 — Report Demonstration 1 (~25 sec)
This slide shows the actual HTML report output. The trend charts clearly show memory growth patterns over time, making it easy to spot abnormal accumulation at a glance. 这页展示了实际的 HTML 报告输出。趋势图清楚地展示了内存随时间的增长模式,异常累积一目了然。
The alert section highlights allocations that grew consistently across all captured snapshots, which are strong indicators of genuine leaks. Next slide, please. 告警区域高亮了在所有快照中持续增长的分配,这些是真实泄漏的强力指标。请翻到下一页。
Slide 33 — Report Demonstration 2 (~20 sec)
This page shows the detail table, which allows sorting by severity and expanding individual entries to examine full stack traces. 这页展示了详细表格,支持按严重程度排序,并可展开单条记录查看完整调用栈。
This reporting layer transforms raw detection data into an actionable diagnostic artifact that can be shared across the team. 这个报告层将原始检测数据转化为可操作的诊断产物,方便在团队间共享。
To summarize: we implemented a production-ready memory leak detection solution using eBPF — zero code modifications, less than 5% overhead, covering both user-space and kernel-space, with automated visualization for efficient root-cause analysis. That's all from me, thank you. 总结:我们用 eBPF 实现了一套生产可用的内存泄漏检测方案——零代码修改、低于 5% 开销、覆盖用户态和内核态,并配合自动化可视化实现高效的根因分析。我的部分就到这里,谢谢大家。
Timing Guide
| Slide | Content | Time |
|---|---|---|
| 27 | Memory Leak | 30s |
| 28 | eBPF vs Traditional Tools | 90s |
| 29 | What We Capture | 60s |
| 30 | Trace Allocations in Kernel Mode | 30s |
| 31 | HTML Visual Report | 45s |
| 32 | Report Demonstration 1 | 25s |
| 33 | Report Demonstration 2 + Closing | 20s |