Press enter or click to view image in full size
Hello everyone! LLM applications are growing really fast these days. Most modern apps now have an LLM somewhere in the background. Have you ever made a tiny change to a prompt and later realized your LLM app started acting weird? Building LLM apps is easy, but keeping them reliable in production is harder.
Even small changes can quietly affect output quality, safety, and user experience, sometimes in ways you only notice when it’s too late.
That’s why just trusting your instincts is not enough. We need a clear, step-by-step way to check and monitor our models so we can know for sure whether a change makes the app better or worse.
In this article, we’ll walk through practical strategies to monitor and test LLM applications, with a focus on A/B testing, so you can ship with confidence.
Press enter or click to view image in full size
✨ What is A/B testing?
A/B testing is more than just comparing two prompts or models. It’s a method to understand how different variations perform and which changes truly improve your app. Unlike traditional software, where inputs produce predictable outputs, LLMs can generate different results even with the same prompt. That’s why metrics are important to measure performance accurately.
A/B testing isn’t just a comparison. It’s about taking full control of your model’s evolution. It moves you from saying, “I think it’s better” to confidently knowing, “I know it’s better,” based on real data.
With the right evaluation and monitoring in place, you can fine-tune your LLM applications safely and ensure a reliable, high-quality experience for your users.
✨ Key Challenges in LLM Evaluation
Evaluating LLM application is harder than evaluating traditional software applications because these systems are not deterministic and have multiple quality metrics.
- The result of LLM applications can change even with same inputs. To account for this variability we need larger sample sizes and more complex statistical methods.
- We can’t evaluate LLMs using only one metrics. Quality depends on accuracy, safety, cost, latency, user satisfaction and more. Focusing on just one metric can negatively affect the other metrics.
- Agentic applications results are dependent on chat history, user personality and other contextual factors. During testing, we need to consider these factors to ensure the results reflect real life scenarios.
✨ Key Metrics for LLM Evaluation
It is important to understand the key metrics/criteria we use to properly evaluate the performance of LLM applications. These metrics help us measure the quality, safety and system performance of the model’s outputs.
Text Quality & Performance Metrics
- Accuracy: It measures how often the agent produces correct outputs. Accuracy is a fundamental and critical metric, but it is not enough on its own. A model can produce correct outputs yet still be insufficient or use the wrong format.
- Faithfulness/Groundedness: These metrics check if the model’s answers match the given context and real-world facts. Faithfulness means the output follows the source information correctly. Groundedness means the facts in the output are true and come from trusted sources. For example, if the model gives historical facts or population numbers, they must be accurate and reliable.
- Conciseness & Relevance: This metric measures whether the result is relevant to the question or not. Even if the result is correct, if it is not relevant to the question, it is considered a failure.
- Fluency: Measures how natural and grammatically correct the model’s output is. A fluent response is easy to read and sounds like it was written by a human.
- Task Completion Rate: Shows how often the agent successfully completes a user’s task. It reflects both how correct the response is and how useful it is in real use.
Operational Metrics
- Latency: Measures the time it takes for the model to generate a response. Response time directly affects the user experience. Even if a variant is technically strong, it can still be considered a failure if it is too slow in practice.
- Token usage & Cost: Measures the number of input and output tokens used and the cost of generating a response. We need to balance cost and quality. If the cost increases by 50% but the quality improves by only 5%, we may not choose that variant.
Safety & Ethics Metrics
- Toxicity Score: Measures whether the response contains harmful, offensive, or biased content.
- Hallucination Detection: Sometimes a model can generate responses that are not real or factual. These are called hallucinations. This metric measures whether a response contains hallucinations.
User Experience Metrics
- User Retention: Measures how many users continue to use the application over time. High retention indicates that users find the model useful.
- Satisfaction Rating: Measures how satisfied users are with the agent’s responses or overall experience. This can be collected through surveys, ratings, or feedback.
✨ Evaluation Methods for LLMs
We have learned the key metrics for evaluating LLMs. Now, we can look at the evaluation techniques. In this section, we present different methods used to assess model performance.
🎯Reference-Based Scoring Methods
These methods measure the model’s output against reference texts (gold standards). They are commonly used for tasks like translation, summarization, and question answering.
🚀 F1 Score
F1 score is the harmonic mean of precision and recall. It is commonly used for classification or QA tasks because accuracy alone is not enough. Sometimes the model can have high recall but still produce many wrong predictions. F1 score balances precision and recall:
Press enter or click to view image in full size
🚀 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
It compares the overlap of n-grams between the generated text and reference text. It is often used for summarization tasks.
Imagine we have a story summarization and want to calculate ROUGE scores. We will calculate rouge-1 (unigrams), rouge-2 (bigrams) and rouge-L for that.
Press enter or click to view image in full size
ROUGE-1 (Unigram): We count each word individually and calculate Recall, Precision, and F1 score.
Matched Unigrams: [“Alice”, “falls”, “a”, “rabbit”, “hole”, “and”, “a”, “magical”]
Human Summary Words/Unigrams Count:11
AI Generated Summary Words/Unigrams Count: 11
Matched Unigrams Count: 8
Press enter or click to view image in full size
ROUGE-2 (Bigram): We group words into pairs (bigrams) and calculate Recall, Precision, and F1 score.
Matched Bigrams: [“Alice falls”, “a rabbit”, “rabbit hole”, “hole and”
“a magical”]
Matched Bigrams=5
Press enter or click to view image in full size
Rouge-L (Longest Common Subsequence): It measures the overlap of the longest common subsequence (LCS) between the reference and model’s output. It takes word order into account, unlike ROUGE-1 or ROUGE-2, which only look at individual words or word pairs.
LCS =
"Alice falls a rabbit hole and a magical"Length of LCS = 8 words
Press enter or click to view image in full size
Additionally, we can use the Hugging Face evaluate library to calculate ROUGE scores.
import evaluaterouge = evaluate.load("rouge")
predictions = ["Alice falls down a rabbit hole and finds a magical land."]
references = ["Alice falls into a rabbit hole and discovers a magical world."]
results = rouge.compute(predictions=predictions, references=references)
print("ROUGE-1:", results['rouge1'])
print("ROUGE-2:", results['rouge2'])
print("ROUGE-L:", results['rougeL'])
🚀 BLEU (Bilingual Evaluation Understudy)
It measures the overlap of n-grams between the AI-generated response and the human reference text. It is similar to ROUGE, but BLEU focuses on precision. The score ranges from 0 to 1, where 1 means a perfect match.It is usually used for text translation tasks.
BLEU counts each n-gram match only up to the maximum number it appears in the reference, so repeating words in the model’s output won’t unfairly increase the score.
Press enter or click to view image in full size
Matched unigrams (clipped by reference counts):
[“the”, “weather”, “is”, “sunny”, “and”, “warm”] → 6 matches
Total candidate unigrams: 7
Press enter or click to view image in full size
Brevity Penalty (BP) reduces the BLEU score when the model’s output is shorter than the reference. It prevents very short outputs from getting unfairly high scores.
Press enter or click to view image in full size
This code calculates the BLEU score for AI translations compared to human references using Hugging Face’s evaluate library.
import evaluatebleu = evaluate.load("bleu")
original_text = "Le temps aujourd'hui est ensoleillé et chaud." # Source (French)
reference_translation = ["The weather today is sunny and warm."] # Human Translation
ai_translation = ["Today the weather is sunny and warm."] # AI Translation
# Calculate BLEU
results = bleu.compute(predictions=ai_translation, references=reference_translation)
print(results)
Press enter or click to view image in full size
Note: BLEU scores can vary between different libraries because they handle tokenization, punctuation, and lowercasing differently. SacreBLEU solves this problem by standardizing these steps, making the scores consistent and reliable.
In A/B testing on OpenAI’s large language models, researchers used BLEU scores together with the Wilcoxon signed-rank test to determine statistical significance.
🎯Model-Based Evaluation Methods
Model-based evaluation methods measure how similar the meaning of the model’s output is to the reference text. They are useful for tasks like summarization, translation, or question answering, where the exact words may differ but the overall meaning should match.
🚀 BERTScore
BERTScore evaluates semantic similarity between AI outputs and reference texts using token embeddings. Unlike BLEU or ROUGE, it can capture meaning even when the exact words differ.
To calculate BERTScore, we follow these three steps:
- Extract token embeddings using a model like BERT, RoBERTa, or DeBERTa.
- Compute cosine similarity between the AI-generated output and reference token embeddings.
- Calculate precision, recall, and F1 scores based on these similarities.
In this example, we show how to calculate BERTScore using Hugging Face’s bert_score library.
from bert_score import score# Example texts
candidates = ["Today the weather is sunny and warm."] # AI-generated output
references = ["The weather today is sunny and warm."] # Human reference
# Calculate BERTScore
P, R, F1 = score(
candidates,
references,
lang="en",
model_type="bert-base-uncased"
)
# Print results
print("BERTScore Precision:", P.mean().item())
print("BERTScore Recall:", R.mean().item())
print("BERTScore F1:", F1.mean().item())
Press enter or click to view image in full size
🎯GEVAL & LLM-as-Judge
GEVAL and LLM-as-Judge use a language model to check the quality of AI-generated text. They look at fluency, correctness, relevance, and overall usefulness. GEVAL gives scores or rankings in a standard way. LLM-as-Judge lets any LLM act like a human evaluator. These methods catch errors and meaning problems that automatic metrics like BLEU or BERTScore might miss.
✨ What is Langfuse?
Langfuse is a platform for monitoring, evaluating, and analyzing large language models in production. It helps teams see what their models are doing, find problems like hallucinations or toxic outputs, and track performance over time. You can run A/B tests, follow key metrics, and view trends, making model evaluation easier and more practical.
✨Creating and Managing Prompt Versions
You can add your system prompts in Langfuse. In the image below, I show which button you need to click.
Press enter or click to view image in full size
After creating a new prompt, you can add versions such as development or QA. This allows you to use different prompts in production and development environments. When the new prompt is ready, you just need to add the production version to use it in production.
Press enter or click to view image in full size
Now, just you need to create a Langfuse client to call prompts from Langfuse. When calling a prompt from Langfuse, we need to provide the prompt name and version.
def get_langfuse_client() -> Langfuse:
"""Create a Langfuse client with explicit configuration. Required env vars: LANGFUSE_SECRET_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_HOST
"""
secret_key = os.environ.get("LANGFUSE_SECRET_KEY", "")
public_key = os.environ.get("LANGFUSE_PUBLIC_KEY", "")
host = os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com")
if not secret_key or not public_key:
raise EnvironmentError(
"Missing LANGFUSE_SECRET_KEY or LANGFUSE_PUBLIC_KEY. "
"Copy .env.example to .env and fill in the values."
)
return Langfuse(
secret_key=secret_key,
public_key=public_key,
host=host,
)
def get_prompt(name: str, *, label: str = "production", **variables) -> str:
"""Fetch a prompt template from Langfuse by name. Args:
name: Prompt name in Langfuse (e.g. "classifier-prompt").
label: Prompt label to fetch (default: "production").
**variables: Template variables to compile (e.g. input="some text").
Returns:
The compiled prompt string.
"""
langfuse = get_langfuse_client()
prompt = langfuse.get_prompt(name, type="text", label=label)
return prompt.compile(**variables)
✨ Observing Prompts
Let’s imagine we have a classification LLM application. This application classifies a given text into four categories: positive, negative, neutral, and harmful. In Langfuse, you can use the @observe decorator to automatically observe your application. Let’s look at an example.
@observe()
def classify_text(text: str) -> str:
"""Full classification pipeline with nested spans."""
prompt = get_prompt("classifier-prompt", input=text)
client = get_openrouter_client() response = client.chat.completions.create(
model="google/gemini-2.5-flash",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=50,
)
return response.choices[0].message.content
When we run our example, we can see the inputs and outputs in the Langfuse tracing section. In this section, we can see inputs, outputs, latency, token usage, and more.
Press enter or click to view image in full size
Press enter or click to view image in full size
If the application raises an error, we can also see the error reason here.
Press enter or click to view image in full size
The @observe decorator is really useful. However, sometimes we need more detailed and specific logs. For this, we can update observation spans manually.
Get Kader Miyanyedi’s stories in your inbox
Join Medium for free to get updates from this writer.
Remember me for faster sign in
Now let’s look at a different example. In this application, our model extracts information from user text. Let’s start with the ExtractEntity class, and then write the extract function.
class ExtractedEntities(BaseModel):
persons: list[str]
locations: list[str]
dates: list[str]
organizations: list[str]def extract_entities(text: str) -> ExtractedEntities:
"""Full extraction pipeline using manual `with` spans. Trace structure:
extract-entities (root span)
├── preprocess (span)
├── llm-call (generation)
└── postprocess (span) + scores
"""
langfuse = get_langfuse_client()
with langfuse.start_as_current_observation(
name="extract-entities",
as_type="span",
model="google/gemini-2.5-flash",
) as root_span:
root_span.update(input=text)
with root_span.start_as_current_observation(
name="llm-call",
as_type="generation",
model="google/gemini-2.5-flash",
model_parameters={"temperature": 0},
) as generation_span:
system_prompt = get_prompt("extractor-prompt")
client = get_openrouter_client()
response = client.chat.completions.create(
model="google/gemini-2.5-flash",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": text},
],
temperature=0,
max_tokens=500,
response_format={"type": "json_object"},
)
raw_json = response.choices[0].message.content
generation_span.update(
input=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": text},
],
output=raw_json,
usage_details={
"input": response.usage.prompt_tokens,
"output": response.usage.completion_tokens,
},
)
with root_span.start_as_current_observation(
name="postprocess", as_type="span"
) as postprocess_span:
postprocess_span.update(input=raw_json)
try:
entities = ExtractedEntities.model_validate_json(raw_json)
postprocess_span.score(
name="valid_json",
value=1.0,
data_type="BOOLEAN",
comment="LLM returned valid JSON matching schema",
)
except ValidationError as e:
postprocess_span.score(
name="valid_json",
value=0.0,
data_type="BOOLEAN",
comment=f"Invalid JSON: {e}",
)
entities = ExtractedEntities(
persons=[], locations=[], dates=[], organizations=[]
)
# Hallucination check
text_lower = text.lower()
hallucinated = []
for key in ["persons", "locations", "organizations"]:
for entity in getattr(entities, key):
if entity.lower() not in text_lower:
hallucinated.append(f"{key}: {entity}")
postprocess_span.score(
name="no_hallucination",
value=1.0 if not hallucinated else 0.0,
data_type="BOOLEAN",
comment="Clean" if not hallucinated else f"Hallucinated: {hallucinated}",
)
postprocess_span.update(output=entities.model_dump())
root_span.update(output=entities.model_dump())
return entities
In this example, we show how to track an extraction pipeline using Langfuse spans. The pipeline has three main spans and scores to check output quality:
- Root Span (extract-entities): This span covers the whole pipeline. It records the input text and acts as the parent for all other spans.
- LLM Call Span (llm-call): This span tracks the call to the language model. It logs the system prompt, user input, and raw model output.
- Postprocess Span (postprocess): This span handles converting the raw output into structured entities. It also validates the JSON and updates the final results.
- Scores (valid_json & no_hallucination): Scores check the quality of the output. valid_json ensures the output matches the expected format, and no_hallucination checks that entities appear in the input text.
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
✨Langfuse Scores
Langfuse allows you to add scores to your spans to measure the quality of model outputs. Scores can be used for things like:
- Validation checks: whether the output matches the expected JSON schema.
- Hallucination detection: checking if the entities in the output actually appear in the input text.
- Custom metrics: you can define your own scoring logic depending on your application.
Scores have a name, value, data type, and optional comment, which makes it easy to track and analyze them in the Langfuse dashboard.
You can view your scores in the Prompt Tracing or Scores tab under the Evaluation section.
Press enter or click to view image in full size
In the Scores section, Langfuse provides dashboards to analyze your metrics.
- If you select a single score, you can view its distribution and trends over time.
- If you select two scores, you can compare them using heatmaps, correlation analysis, and other statistical metrics.
This allows you to monitor model performance, detect issues, and track improvements across different runs.
Press enter or click to view image in full size
Press enter or click to view image in full size
✨Langfuse Dataset
A dataset is a collection of model inputs and outputs. In this section, you can create your own test datasets and run experiments with your prompts. Langfuse allows you to create datasets either in the UI or via API in your code.
Think about having many example inputs and outputs for classifier and extractor LLM applications. Let’s create a dataset for these examples using the API SDK.
# example for classifier
[
{"input": "I absolutely love this product!", "expected_output": "positive"},
{"input": "Best experience I've ever had!", "expected_output": "positive"}
...
]def create_classifier_dataset():
langfuse = get_langfuse_client() dataset_name = "sentiment-classifier-v1"
langfuse.create_dataset(name=dataset_name)
items = json.loads((DATASETS_DIR / "classifier_golden.json").read_text())
for item in items:
langfuse.create_dataset_item(
dataset_name=dataset_name,
input=item["input"],
expected_output=item["expected_output"],
)
With the create_dataset function, we create a dataset. Each dataset must have a unique name. Using the create_dataset_item function, we can add items to the dataset. After running this code, you should see your dataset in the Langfuse UI.
Press enter or click to view image in full size
Now we can run experiments for our prompts. There are two ways to run experiments: in the UI or via the API SDK.
Run Experiments via UI
The experiment begins with two key choices: Selecting the prompt version and assigning the model. Here, you decide which version of the prompt to test and which model will run it (e.g: gemini-2.5-flash) This setup makes sure your tests are consistent and easy to repeat.
Press enter or click to view image in full size
Press enter or click to view image in full size
Then we need to choose the dataset for the experiment. You should make sure the dataset matches the variables in your prompt. Langfuse helps with this automatically. The green “Valid configuration” box shows that the system has checked your dataset against your prompt placeholders (like input). This ensures your experiment runs smoothly without missing data.
Press enter or click to view image in full size
Now it’s time to decide how we measure success. Here, you pick Evaluators to automatically score your experiment results. Langfuse provides several managed evaluators like Correctness, Relevance, and Hallucination Detection.
Press enter or click to view image in full size
For our case, Correctness is the most suitable evaluator because our app classifies text into four categories: Positive, Negative, Harmful, and Neutral. Choosing the right metric ensures the results are meaningful and consistent.
Press enter or click to view image in full size
After setting up the evaluator, we can view the experiment overview, which shows the prompt, model, dataset name, and more.
Press enter or click to view image in full size
When the experiment is finished, we can see the evaluator score, model output, and expected output in the dataset’s experiments section. This is how you can manage your datasets and experiments in the UI. Next, let’s explore an API SDK example.
Press enter or click to view image in full size
Run Experiments via API SDK
To run an experiment, first get the dataset from Langfuse and pass it to the run_experiment function. This function takes the experiment name, data, task, and evaluators.
The task is the function that runs on each dataset item, actually it is your LLM application. In our example, it could be classify_text or extract_entities.
model = "nvidia/nemotron-3-nano-30b-a3b:free"
extractor_dataset = langfuse.get_dataset("entity-extractor-v1")def make_extractor_task(model: str):
def task(*, item, **kwargs):
result = extract_entities(item.input, model=model)
return result.model_dump()
return task
print(f"\n=== Extractor: {model} ===")
ext_result = langfuse.run_experiment(
name=f"extractor-{short_name}",
data=extractor_dataset.items,
task=make_extractor_task(model),
evaluators=[entity_recall_evaluator, hallucination_evaluator],
)
print(ext_result.format())
Evaluators are functions that compare the task’s output with the expected results. Langfuse comes with built-in evaluators, but you can also write your own.
The custom evaluator function takes two parameters: output and expected_output. It checks if the model’s output matches the expected result and assigns a score 1 if it matches, 0 if it doesn’t. The result is returned as an Evaluation object, which includes the score and a comment showing what the model produced versus what was expected.
def hallucination_evaluator(*, output, input, **kwargs):
"""Check if any extracted entity is NOT in the source text."""
text_lower = input.lower()
hallucinated = [] for key in ["persons", "locations", "organizations"]:
for entity in output.get(key, []):
if entity.lower() not in text_lower:
hallucinated.append(f"{key}: {entity}")
score = 1.0 if not hallucinated else 0.0
return Evaluation(
name="no_hallucination",
value=score,
data_type="BOOLEAN",
comment="Clean" if not hallucinated else f"Hallucinated: {hallucinated}",
)
When we run these experiments, we can see the outputs directly in the terminal. In Langfuse, if we go to the dataset section, we should see our experiments listed. Clicking on each experiment shows detailed results, letting us review how the model performed on every item.
Press enter or click to view image in full size
Press enter or click to view image in full size
nvidia/nemotron-3-nano-30b-a3b:free through OpenRouter. Since this is a free model, no usage cost is recorded in Langfuse.Press enter or click to view image in full size
You can compare different experiment results by clicking the Compare button in the top right corner. This view lets you compare outputs, latency, cost, and your evaluator results side by side.
Press enter or click to view image in full size
✨Langfuse LLM-as-a-Judge
We mentioned G-Eval and the LLM-as-a-judge methodology in the model-based evaluation section. LLM-as-a-judge is a simple idea: instead of writing functions to evaluate a model’s output, we use another LLM to do the evaluation.
For example, instead of checking if two outputs match exactly, we can ask a model questions like “Is this answer correct?” or “Does this output match the expected result?” The model then acts as a judge and gives a decision.
To use this in Langfuse, you need to set up an evaluator and define a default model in the LLM connections section. This model will be used as the judge.
Wait a minute… we already did this.
If you’ve run experiments in the UI, you’ve probably used it without even noticing. When you select the “Correctness” evaluator in Langfuse, it already works as an LLM-as-a-judge. Instead of doing a strict comparison, it uses another model to decide whether the output is correct based on the expected result.
In the LLM-as-a-Judge section, you can see how many times it was used, which evaluator was used, etc.
Press enter or click to view image in full size
You can also filter LLM-as-a-Judge traces in the Tracing section and view inputs, outputs, costs, etc., when LLM-as-a-Judge runs.
Press enter or click to view image in full size
✨Langfuse Dashboards
In the Home section, you can explore dashboards showing your metrics, such as traces, model costs, and scores. Use charts here to analyze and track your LLM’s performance.
Press enter or click to view image in full size
In the Dashboard section, you can create custom dashboards for your needs.
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
