How We Use LLMs to Summarize Security Alerts in Plain English

There is a translation problem at the center of most AppSec workflows. Security scanners produce findings written for security engineers: CVE identifiers, CVSS scores, CWE classifications, references to NVD advisories from 2018. The people who need to act on those findings are application developers, who think in terms of function names, HTTP endpoints, database queries, and pull requests. The gap between those two vocabularies is where remediation time disappears.

We started using LLMs to bridge that gap about 14 months ago, and it changed the way our triage workflow operates. Not in a vague, directional way — we measured a 60% reduction in internal triage time within the first quarter. Here is exactly what we built and why the prompting decisions matter.

The Problem with CVE-Centric Alert Format

A standard SAST alert for a SQL injection vulnerability looks like this:

CWE-89: Improper Neutralization of Special Elements used in an SQL Command. CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H (8.8). Affected: src/api/users.py:147. See https://cwe.mitre.org/data/definitions/89.html

That alert is technically accurate. It tells a security engineer exactly what they need to know to classify the finding. It tells an application developer almost nothing about what to actually fix. Where in users.py? What is the vulnerable function doing? What input reaches it? What does a fix look like? None of that is present.

When we looked at our own internal triage data before implementing LLM summaries, the average time from alert creation to engineer assignment was 4.2 days for HIGH-severity findings. That lag was not caused by engineers ignoring security — it was caused by the translation cost. Security leads were manually writing context notes on each ticket before assigning them. One security engineer was spending 6–8 hours per week doing nothing but translating CVE jargon into developer-facing fix instructions.

The Three-Part Summary Structure

After several iterations, we converged on a three-part summary structure for every alert that flows through Runtimekindle's triage layer:

What is exposed: One sentence describing the vulnerability in application terms, not security taxonomy terms. Function name, call path, and the type of input that reaches the vulnerable code.
How it could be exploited: One to two sentences describing the realistic attack scenario given the application's context — not the theoretical worst case from the CVE description, but the specific exploit path given what Runtimekindle knows about the application's inputs and request handlers.
What to change: A concrete code-level fix recommendation — parameterized query, input validation function, library version, configuration change — phrased as a code review suggestion.

That structure is what an experienced security engineer would write manually for each finding. The LLM generates it at scale, for every finding, without a human in the loop.

The Prompting Architecture

The summary quality depends entirely on what context you inject into the prompt. Raw CVE data plus source code is not enough to produce useful summaries consistently. Here is the context we include:

The vulnerable code snippet: 20–40 lines around the finding location, enough to show the function signature and the call that triggers the vulnerability.
The call chain from the entry point: Derived from the runtime call-graph. Which HTTP handler or background job calls the vulnerable function? What parameters does it receive? This is the part that CVE databases cannot provide — it is specific to the application's actual architecture.
The runtime reachability score: Is this finding on a live call path? If the reachability score is low, the LLM summary notes that explicitly: "This function is not currently called in production. Monitor for reachability changes before treating as high priority."
The remediation pattern for the CWE class: We maintain a library of 80+ CWE-to-fix-pattern mappings. The prompt includes the relevant pattern as a hint, which dramatically reduces hallucinated or generic fix suggestions.

Without the call chain context, the LLM produces summaries that are accurate about the vulnerability class but vague about the specific application impact. With the call chain, the summaries read like they were written by someone who actually read the code.

What We Got Wrong the First Time

Our first implementation did not include the reachability score in the prompt. The LLM summaries were technically correct but created urgency even for findings in dead code. Engineers would read the summary, see a realistic-sounding exploit scenario, and start prioritizing remediation — for a function that was never called in production.

Adding the reachability score to the prompt — and explicitly instructing the model to note low-reachability findings — fixed that. We measured a 31% reduction in engineer time spent on suppressed findings within four weeks of that change. The summaries became honest about the risk level, not just accurate about the vulnerability class.

The second mistake was not providing CWE-to-fix-pattern mappings. Without those, the model's fix suggestions for memory safety issues in C/C++ code or deserialization vulnerabilities in Java were generic enough to be unhelpful — "use safe deserialization practices" is technically correct advice and completely useless as a code-level instruction. The pattern library made the difference between advice developers read and advice they act on.

Measuring Summary Quality

We track three quality metrics for LLM-generated summaries:

Fix adoption rate: What percentage of findings where the LLM suggested a specific fix saw that exact fix applied? Currently 68% for HIGH-reachability findings. That means in two-thirds of cases, the developer looked at the summary, agreed with the suggested fix, and implemented it without needing additional guidance.
Clarification request rate: How often does an engineer open a ticket or Slack thread asking for more context on a finding? Before LLM summaries, this happened on 38% of HIGH-severity findings. After, it dropped to 11%. The summaries answer the questions engineers would otherwise ask.
Mean-time-to-remediate for HIGH reachability findings: Our data shows HIGH-reachability findings close in an average of 3.4 days with LLM summaries vs. 19.1 days without. That is not purely attributable to the summaries — reachability filtering is doing some of that work — but the summary quality is a measurable contributor based on cohort analysis.

What LLM Triage Does Not Replace

LLM summaries are a force multiplier for security triage. They are not a substitute for human judgment on complex findings. Novel attack chains, multi-step privilege escalation paths, and business logic vulnerabilities require a security engineer to trace through the application manually — the LLM summary will get the vulnerability class right but may miss the business context that makes a MEDIUM finding a functional critical.

We route findings above a certain complexity threshold — defined by call depth, number of affected components, and whether the CVE has no established remediation pattern — to human review before LLM summary generation. That routing logic keeps the high-value cases in front of a security engineer while handling the long tail automatically.

The framing that has worked for us internally is this: LLM summaries handle the 85% of findings that have well-understood vulnerability classes and clear remediation paths. The other 15% need a human who understands both the security domain and the application's specific business logic. The goal is to make sure that human is spending their time on that 15% rather than writing context notes for the 85%.

Security triage is not a knowledge problem — your team already knows how to fix SQL injection. It is a communication and translation problem at scale. That is the specific problem where LLMs, used carefully, are genuinely useful right now.

The Problem with CVE-Centric Alert Format

The Three-Part Summary Structure

The Prompting Architecture

What We Got Wrong the First Time

Measuring Summary Quality

What LLM Triage Does Not Replace

Related articles

Why 80% of SAST Alerts Are Noise — And How to Fix It

The Developer Experience Problem in Application Security

What We Learned Building AppSec Infrastructure from Seed to Series A