Runtime Instrumentation Performance: Benchmark Results Across 12 Language Runtimes

The most common objection we hear from engineering teams evaluating runtime instrumentation is overhead. "We can't afford to add latency to our request path." It's a reasonable concern — and one we take seriously enough that we've spent considerable engineering time on. This post is the full benchmark data from our internal testing across 12 language runtimes. We're sharing it because we've seen other vendors present overhead numbers without methodology context, and that makes evaluation impossible. Here's what we measured, how we measured it, and where the variance comes from.

Test Methodology

We ran benchmarks across the following runtimes: Node.js 20 LTS, Python 3.12, Go 1.22, Java 21 (HotSpot JVM), Java 21 (GraalVM native), Ruby 3.3, PHP 8.3, Rust (via WASM runtime, tokio-based), .NET 8 (C#), Elixir/Erlang OTP 26, Scala 3 on JVM, and Kotlin on JVM.

For each runtime, we ran three test applications:

A minimal HTTP server responding to GET requests with a static JSON payload (latency-sensitive, minimal computation)
A data processing application performing in-memory aggregations on a 50MB dataset (CPU-bound, moderate call depth)
A realistic API server with database queries, external HTTP calls (mocked with a local stub), and business logic spanning 15–20 function hops per request

We measured: p50 latency, p99 latency, requests per second at 100% CPU utilization, and resident memory increase. Each test ran for 10 minutes with a 2-minute warmup period excluded from measurements. All tests ran on identical hardware: 4-core x86-64 VMs at 2.6 GHz, 8 GB RAM, running Ubuntu 22.04. The Runtimekindle instrumentation agent was deployed as a sidecar container sharing network namespace with the test application. Instrumentation was configured for call-graph capture with 10-second snapshot intervals — the default production configuration.

Raw numbers across all 12 runtimes:

Runtime	p50 latency overhead	p99 latency overhead	Throughput reduction	Memory overhead
Node.js 20	+1.4%	+2.1%	-1.6%	+22 MB
Python 3.12	+2.8%	+3.9%	-2.7%	+18 MB
Go 1.22	+0.6%	+1.1%	-0.8%	+11 MB
Java 21 (HotSpot)	+1.9%	+2.6%	-2.1%	+34 MB
Java 21 (GraalVM)	+1.2%	+1.8%	-1.4%	+28 MB
Ruby 3.3	+3.1%	+4.8%	-3.4%	+16 MB
PHP 8.3	+2.2%	+3.0%	-2.4%	+14 MB
Rust (WASM/tokio)	+0.4%	+0.9%	-0.5%	+8 MB
.NET 8 (C#)	+1.6%	+2.3%	-1.8%	+29 MB
Elixir/OTP 26	+2.4%	+3.6%	-2.6%	+20 MB
Scala 3 (JVM)	+2.0%	+2.8%	-2.2%	+36 MB
Kotlin (JVM)	+1.8%	+2.5%	-1.9%	+31 MB

Average across all 12 runtimes: 1.8% p50 latency overhead, 2.6% p99 latency overhead, 1.9% throughput reduction, 22 MB memory overhead. These numbers match the summary in our product documentation. The range is 0.4% to 3.1% for p50, with the outliers being Rust (near-zero overhead due to eBPF-based observation with minimal user-space cost) and Ruby (highest overhead due to the instrumentation approach required for the Ruby VM).

Why the Variance Exists Across Runtimes

The instrumentation approach differs by runtime, and the approach determines the overhead profile.

For Go and Rust, we use eBPF probes to observe function entry and return events at the kernel level. eBPF runs in a sandboxed kernel context with near-zero user-space cost. The Go and Rust numbers are the best-case scenario for what runtime instrumentation can cost — they're close to the theoretical floor for observability overhead.

For JVM-based runtimes (Java, Scala, Kotlin), we use the Java Instrumentation API (JVMTI) combined with bytecode transformation via a Java agent. This is the standard approach for JVM observability (used by APM tools like Datadog and New Relic), and the overhead is comparable to what you'd see from those tools in call-tracing mode. The 34 MB memory overhead for HotSpot JVM reflects the agent's own heap usage plus the call-graph data structures we maintain in memory.

For Python, Ruby, and PHP, the instrumentation approach is language-specific. Python uses sys.settrace combined with eBPF for the C extension layer. Ruby uses TracePoint. PHP uses its built-in tracing extension hooks. These approaches have more per-call overhead than eBPF-only instrumentation, which explains the higher percentages for those three runtimes. We're actively working on eBPF-based approaches for Python and Ruby that should bring overhead below 2% — current designs look promising but aren't ready for production deployment.

The memory overhead column is predominantly the call-graph data structure. We maintain a sliding window of recent call-graph state in memory (configurable window size, default 10 minutes). The memory cost scales with the number of distinct call paths observed, not with request volume. A high-throughput service that exercises the same call paths repeatedly won't accumulate more memory than a lower-throughput service with the same code coverage.

What These Numbers Mean in Practice

A 1.8% p50 latency overhead means: if your uninstrumented service has a 50ms median response time, you'll see roughly 50.9ms with instrumentation enabled. If your SLO is p99 < 200ms and your current p99 is 180ms, you have 20ms of headroom — the 2.6% p99 overhead adds about 4.7ms, well within budget.

The throughput reduction matters most for services running near CPU capacity. At 60–70% CPU utilization (a reasonable sustained load level), the throughput reduction is proportionally lower than the numbers above — those figures were measured at 100% CPU, which is where instrumentation overhead has its maximum relative impact. Under normal operating conditions, expect roughly half the throughput impact of the peak figures.

Memory overhead deserves a separate look for JVM services. If you're already tuning JVM heap sizes carefully, an additional 30–36 MB from the instrumentation agent needs to be accounted for in your container memory limits. We've seen teams hit container OOM events because they allocated exactly enough memory for their application workload without leaving room for the agent. Add at least 64 MB to your container memory limit when enabling instrumentation on JVM services.

Configuring for Lower Overhead When Needed

The default configuration captures the full call-graph with 10-second snapshot intervals. For services where even sub-2% overhead is a concern, several configuration levers reduce it further:

Sampling mode: Instead of instrumenting every request, sample a percentage (default: 100%, configurable to as low as 5%). At 10% sampling, overhead drops roughly proportionally. The call-graph completeness is lower — you'll miss call paths that appear in fewer than 10% of requests — but for high-traffic services with consistent call patterns, 10–20% sampling provides adequate coverage with under 0.3% overhead on most runtimes.

Snapshot interval: Increasing the call-graph snapshot interval from 10 seconds to 60 seconds reduces the frequency of call-graph serialization and transmission. The performance improvement is modest (around 15% reduction in overhead) but measurable on throughput-sensitive services.

Exclusion lists: You can exclude specific packages, modules, or function name patterns from instrumentation. If your service calls a known-safe library that accounts for 40% of function call volume, excluding it from instrumentation scope reduces overhead proportionally.

We don't recommend reducing the sampling rate below 10% for production security purposes — at very low sampling rates, you risk missing the call paths that matter for reachability analysis. But for development and staging environments where you want instrumentation present without any performance budget concerns, sampling rates of 1–5% work well for call-graph construction purposes.

Comparing to Alternative Approaches

For context: the overhead figures above are for active call-graph capture. Alternative security approaches have their own overhead profiles.

DAST scanning (automated browser-based or API-based attack simulation) runs against a deployed service and has no per-request overhead on production — but it requires a staging environment and doesn't provide production reachability data. IAST (Interactive Application Security Testing) typically shows overhead of 5–15% because it instruments at a finer granularity than call-graph capture, tracking individual data flows through the application. CSPM tools that scan infrastructure configuration have near-zero runtime overhead but don't address application-layer vulnerability reachability.

Our approach sits between IAST and passive observation in terms of overhead. It's more expensive than pure passive monitoring but less expensive than full IAST data-flow tracing, and it generates the specific data we need: which call paths are live, which findings lie on those paths, which can be suppressed as unreachable.

Reproducing These Results

We've published the benchmark test harness at our GitHub repository (link in the footer). The test applications and benchmark configurations are versioned. If you want to run your own benchmarks against your specific workload — which we encourage — the methodology notes above describe how we controlled for variables. The key decision point for meaningful results: test at realistic load levels, not synthetic peak load. Real-world instrumentation overhead is lower than peak-load overhead because real services don't run at 100% CPU utilization continuously.

If your benchmarks show significantly different results than ours, we'd want to know. Runtime behavior varies across container configurations, kernel versions, and CPU architectures in ways we may not have covered in our test matrix. Reach out with your data and we'll investigate.

Test Methodology

Why the Variance Exists Across Runtimes

What These Numbers Mean in Practice

Configuring for Lower Overhead When Needed

Comparing to Alternative Approaches

Reproducing These Results

Related articles

Why 80% of SAST Alerts Are Noise — And How to Fix It

What We Learned Building AppSec Infrastructure from Seed to Series A

The Cloud-Native Container Security Checklist for 2026