LLMs for Code Security: How Smart Few-Shot Example Selection Boosts Vulnerability Detection

2025/11/03 Software Engineering

About the Research & Authors

This summary is a simplified, educational interpretation of the original academic paper. All ideas and findings are drawn directly from the researchers’ work and restated here for wider accessibility.

“On Selecting Few-Shot Examples for LLM-based Code Vulnerability Detection”
by Md Abdul Hannan, Ronghao Ni, Chi Zhang, Limin Jia, Ravi Mangal, and Corina S. Pasareanu
Affiliations: Carnegie Mellon University and Colorado State University
Published on arXiv (October 2025)
— Artificial Intelligence and Software Engineering (cs.AI / cs.SE) category

Note: This overview is not affiliated with the authors, Carnegie Mellon University, or Colorado State University. It aims to communicate the main research concepts and results in simple terms. For technical details, experiments, and data, see the official arXiv publication.

TL;DR: LLMs can miss software vulnerabilities. This paper shows that you can boost their accuracy without retraining them by carefully choosing the few-shot examples you place in the prompt. Two ideas help most: (1) add examples the model tends to get wrong (so it learns from its mistakes), and (2) add examples that are very similar to the code you want it to analyze. Combining both is usually best.

Why it matters

Fewer missed bugs
More explainable prompts
No model retraining needed

The Problem

Large Language Models (LLMs) like GPT or Qwen are powerful at writing, completing, and explaining code. However, when it comes to finding security vulnerabilities, their performance can be inconsistent. Identifying subtle flaws in logic, memory handling, or data validation often requires context that a model doesn’t automatically have.

In practice, how well an LLM detects vulnerabilities depends heavily on the examples it sees in the prompt. If the examples are random or irrelevant, results vary wildly — sometimes missing critical bugs or flagging harmless code. But when examples are strategically chosen to teach the model what real vulnerabilities look like, accuracy and consistency rise dramatically.

The research addresses this challenge by asking a key question: “How can we choose the right few-shot examples to make an LLM a better security auditor?”

Two Simple Ideas That Work

1) Learn From Mistakes (LFM)

The Learn From Mistakes (LFM) approach starts from a simple but powerful idea: when a model fails, that failure contains valuable information. Instead of ignoring its wrong predictions, researchers use them as teaching material for future prompts.

To build the LFM set, the model is first tested on a labeled dataset of code snippets—some secure, some vulnerable. Every time the model mislabels a piece of code (for example, marking vulnerable code as safe), that example is flagged. Over time, these “hard” examples—where the model struggles—become the perfect training aids.

The process looks like this:

The selected examples are then added as few-shot demonstrations inside the prompt for future predictions. By confronting the model with the types of errors it previously made, it learns the boundaries between “safe” and “vulnerable” code more sharply.

Researchers found that LFM significantly improves recall—meaning the model catches more actual vulnerabilities. However, this sometimes comes at the cost of precision: the model might flag a few safe examples as risky, especially if the mistake-heavy examples dominate the prompt.

Effect The LFM method teaches the model to focus on its weak spots, helping it identify subtle bugs it previously missed. It’s most effective when combined with other strategies—like Learn From Nearest Neighbors (LFNN)—to balance recall and precision.

2) Learn From Nearest Neighbors (LFNN)

The Learn From Nearest Neighbors (LFNN) approach takes inspiration from how humans learn by analogy. When facing a new coding problem, we often recall similar examples we’ve seen before. LFNN brings this intuition to LLMs by helping them analyze new code based on examples that are most alike in structure, syntax, and vulnerability patterns.

The key step is to represent each code snippet as a vector embedding—a numerical summary that captures its semantic meaning and logical behavior. By doing this, the system can compute how close two pieces of code are to each other in a high-dimensional “embedding space.”

When the LLM is asked to evaluate a new snippet, the LFNN method finds the nearest neighbors—examples from a labeled dataset that have the most similar embeddings. Those examples (along with their vulnerability labels) are then added to the model’s prompt. This gives the model immediate, contextually relevant cues before it makes a decision.

In simpler terms, LFNN works like a recommendation system for code: when analyzing a new snippet, it recalls the “most relevant past experiences” from the dataset. This leads to predictions that are more consistent and context-aware, especially when the model faces patterns it has not been explicitly trained on.

Researchers observed that LFNN improves precision—the model is less likely to raise false alarms because it learns from code that truly resembles the current context. It’s particularly effective for models that already understand programming structure well, such as those fine-tuned for coding tasks.

Effect The LFNN approach provides contextual grounding and smoother decision-making by anchoring predictions to relevant real-world examples. It shines when combined with LFM, creating a balance between learning from similarity and mistakes.

Best of Both Worlds: Combine Them

Both Learn From Mistakes (LFM) and Learn From Nearest Neighbors (LFNN) solve different parts of the same problem. LFM teaches the model to recognize and correct its weaknesses, while LFNN ensures it sees examples that are contextually similar to the current code. On their own, each approach helps — but when they’re combined strategically, their strengths reinforce each other and lead to a more balanced, robust model.

The researchers experimented with three ways of combining LFM and LFNN to see which produced the best trade-off between recall (catching real bugs) and precision (avoiding false alarms):

Union (Simple Merge): Create two separate sets — one from LFM (examples that expose common mistakes) and one from LFNN (examples most similar to the target code) — then merge them into a single few-shot prompt. This gives the model both “lesson” examples and “contextual” examples at once. It’s the simplest strategy and already provides a big improvement over random selection.
Neighbors-then-Mistakes: Start with LFNN by selecting the most similar examples to the target code, then run LFM only on that subset to keep the focus on code patterns where the model tends to fail within relevant contexts. This approach is like giving the model a short, targeted tutoring session on the most confusing cases that resemble the task at hand.
Mistakes-Neighbors-Mistakes (Iterative Learning): First, apply LFM globally to find where the model struggles most. Next, for each query, use LFNN to select local examples that are similar to the input code. Finally, apply LFM again, but this time only on those local neighbors. The result is a refined prompt that reflects both the model’s general weaknesses and the specific context of the code being tested. This was the most effective combination in experiments, producing high accuracy and consistency across datasets.

In short, the combination methods ensure that the model learns smarter, not harder. It gets the benefit of LFM’s self-awareness — understanding its weak spots — and LFNN’s contextual grounding — staying close to the kind of code it’s analyzing. This dual approach helps LLMs become more consistent vulnerability detectors across multiple languages and datasets.

Rule of thumb Use LFM when your model keeps missing real bugs (to raise recall). Use LFNN when it flags too many safe snippets (to improve precision). And if you need a balanced, reliable system — combine both.

How They Tested It

The authors ran experiments on several datasets spanning C/C++, Python, and JavaScript, using multiple code-focused LLMs (both open- and closed-source). To judge performance, they compared each method against two baselines:

Zero-shot: the model gets only the task instruction—no examples in the prompt.
Random few-shot: the model sees a small set of randomly chosen labeled examples in the prompt.

What the metrics mean

Accuracy: overall fraction of correct predictions (can be misleading if classes are imbalanced).
Precision: of the code flagged as vulnerable, how much was truly vulnerable (fewer false alarms → higher precision).
Recall: of all truly vulnerable code, how much was caught (fewer misses → higher recall).
F1 score: the harmonic mean of precision and recall. It rises only when both are strong, so it’s a good single number for security detection.

Key takeaways

LFNN (nearest neighbors) improved precision and gave steadier results, especially on Python/Node.js-style code.
LFM (learn from mistakes) raised recall—it caught more real bugs—but can reduce precision if overused.
Combining both was the most robust overall, balancing precision and recall across datasets and models.

Bars illustrate typical change in F1 score relative to the random few-shot baseline (schematic, not per-dataset exact values). Higher bars mean better balance of precision and recall.

What You Can Apply Today

Quick recipe

Keep a small, labeled library of code snippets (vulnerable + safe).
Embed your snippets and your query with a code embedding model.
Pick the top k nearest neighbors (LFNN).
During triage, note where the model misclassifies; add those to your teaching set (LFM).
Construct prompts with a mix of LFNN (relevance) + LFM (lessons learned).

Tuning tips

Too many misses? Add more LFM examples (the hard ones).
Too many false alarms? Favor closer LFNN neighbors.
Keep examples short, labeled, and language-matched to your query.
Rotate examples occasionally to prevent overfitting to one pattern.

Bottom Line

The research highlights a powerful insight: it’s not about giving LLMs more data, but the right data. Carefully chosen examples can make a language model perform like a much larger one—without any retraining.

By combining examples that are contextually similar to the code under review (LFNN) with those that expose its past mistakes (LFM), the model learns to reason more carefully about what truly makes code vulnerable. This dual strategy turns prompting into a form of lightweight, on-the-fly training.

In short: smarter example selection = stronger, more trustworthy vulnerability detection. It’s a practical path toward safer software—powered by better prompts, not bigger models.

Based on the paper “On Selecting Few-Shot Examples for LLM-based Code Vulnerability Detection”.