Evaluating and Monitoring LLMs with Langfuse: A/B Testing & Metrics

Press enter or click to view image in full size

Hello everyone! LLM applications are growing really fast these days. Most modern apps now have an LLM somewhere in the background. Have you ever made a tiny change to a prompt and later realized your LLM app started acting weird? Building LLM apps is easy, but keeping them reliable in production is harder.

Even small changes can quietly affect output quality, safety, and user experience, sometimes in ways you only notice when it’s too late.
That’s why just trusting your instincts is not enough. We need a clear, step-by-step way to check and monitor our models so we can know for sure whether a change makes the app better or worse.

In this article, we’ll walk through practical strategies to monitor and test LLM applications, with a focus on A/B testing, so you can ship with confidence.

Press enter or click to view image in full size

✨ What is A/B testing?

A/B testing is more than just comparing two prompts or models. It’s a method to understand how different variations perform and which changes truly improve your app. Unlike traditional software, where inputs produce predictable outputs, LLMs can generate different results even with the same prompt. That’s why metrics are important to measure performance accurately.

A/B testing isn’t just a comparison. It’s about taking full control of your model’s evolution. It moves you from saying, “I think it’s better” to confidently knowing, “I know it’s better,” based on real data.

With the right evaluation and monitoring in place, you can fine-tune your LLM applications safely and ensure a reliable, high-quality experience for your users.

✨ Key Challenges in LLM Evaluation

Evaluating LLM application is harder than evaluating traditional software applications because these systems are not deterministic and have multiple quality metrics.

The result of LLM applications can change even with same inputs. To account for this variability we need larger sample sizes and more complex statistical methods.
We can’t evaluate LLMs using only one metrics. Quality depends on accuracy, safety, cost, latency, user satisfaction and more. Focusing on just one metric can negatively affect the other metrics.
Agentic applications results are dependent on chat history, user personality and other contextual factors. During testing, we need to consider these factors to ensure the results reflect real life scenarios.

✨ Key Metrics for LLM Evaluation

It is important to understand the key metrics/criteria we use to properly evaluate the performance of LLM applications. These metrics help us measure the quality, safety and system performance of the model’s outputs.

Text Quality & Performance Metrics

Accuracy: It measures how often the agent produces correct outputs. Accuracy is a fundamental and critical metric, but it is not enough on its own. A model can produce correct outputs yet still be insufficient or use the wrong format.
Faithfulness/Groundedness: These metrics check if the model’s answers match the given context and real-world facts. Faithfulness means the output follows the source information correctly. Groundedness means the facts in the output are true and come from trusted sources. For example, if the model gives historical facts or population numbers, they must be accurate and reliable.
Conciseness & Relevance: This metric measures whether the result is relevant to the question or not. Even if the result is correct, if it is not relevant to the question, it is considered a failure.
Fluency: Measures how natural and grammatically correct the model’s output is. A fluent response is easy to read and sounds like it was written by a human.
Task Completion Rate: Shows how often the agent successfully completes a user’s task. It reflects both how correct the response is and how useful it is in real use.

Operational Metrics

Latency: Measures the time it takes for the model to generate a response. Response time directly affects the user experience. Even if a variant is technically strong, it can still be considered a failure if it is too slow in practice.
Token usage & Cost: Measures the number of input and output tokens used and the cost of generating a response. We need to balance cost and quality. If the cost increases by 50% but the quality improves by only 5%, we may not choose that variant.

Safety & Ethics Metrics

Toxicity Score: Measures whether the response contains harmful, offensive, or biased content.
Hallucination Detection: Sometimes a model can generate responses that are not real or factual. These are called hallucinations. This metric measures whether a response contains hallucinations.

User Experience Metrics

User Retention: Measures how many users continue to use the application over time. High retention indicates that users find the model useful.
Satisfaction Rating: Measures how satisfied users are with the agent’s responses or overall experience. This can be collected through surveys, ratings, or feedback.

✨ Evaluation Methods for LLMs

We have learned the key metrics for evaluating LLMs. Now, we can look at the evaluation techniques. In this section, we present different methods used to assess model performance.

🎯Reference-Based Scoring Methods

These methods measure the model’s output against reference texts (gold standards). They are commonly used for tasks like translation, summarization, and question answering.

🚀 F1 Score
F1 score is the harmonic mean of precision and recall. It is commonly used for classification or QA tasks because accuracy alone is not enough. Sometimes the model can have high recall but still produce many wrong predictions. F1 score balances precision and recall:

Press enter or click to view image in full size

TP = True positive, FP = False positive, FN=False negative

🚀 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
It compares the overlap of n-grams between the generated text and reference text. It is often used for summarization tasks.

Imagine we have a story summarization and want to calculate ROUGE scores. We will calculate rouge-1 (unigrams), rouge-2 (bigrams) and rouge-L for that.

Press enter or click to view image in full size

ROUGE-1 (Unigram): We count each word individually and calculate Recall, Precision, and F1 score.

Matched Unigrams: [“Alice”, “falls”, “a”, “rabbit”, “hole”, “and”, “a”, “magical”]
Human Summary Words/Unigrams Count:11
AI Generated Summary Words/Unigrams Count: 11
Matched Unigrams Count: 8

Press enter or click to view image in full size

ROUGE-2 (Bigram): We group words into pairs (bigrams) and calculate Recall, Precision, and F1 score.

Matched Bigrams: [“Alice falls”, “a rabbit”, “rabbit hole”, “hole and”
“a magical”]
Matched Bigrams=5

Press enter or click to view image in full size

Rouge-L (Longest Common Subsequence): It measures the overlap of the longest common subsequence (LCS) between the reference and model’s output. It takes word order into account, unlike ROUGE-1 or ROUGE-2, which only look at individual words or word pairs.

LCS = "Alice falls a rabbit hole and a magical"Length of LCS = 8 words

Press enter or click to view image in full size

Additionally, we can use the Hugging Face evaluate library to calculate ROUGE scores.

import evaluaterouge = evaluate.load("rouge")
predictions = ["Alice falls down a rabbit hole and finds a magical land."]
references = ["Alice falls into a rabbit hole and discovers a magical world."]
results = rouge.compute(predictions=predictions, references=references)
print("ROUGE-1:", results['rouge1'])
print("ROUGE-2:", results['rouge2'])
print("ROUGE-L:", results['rougeL'])

🚀 BLEU (Bilingual Evaluation Understudy)
It measures the overlap of n-grams between the AI-generated response and the human reference text. It is similar to ROUGE, but BLEU focuses on precision. The score ranges from 0 to 1, where 1 means a perfect match.It is usually used for text translation tasks.

BLEU counts each n-gram match only up to the maximum number it appears in the reference, so repeating words in the model’s output won’t unfairly increase the score.

Press enter or click to view image in full size

Matched unigrams (clipped by reference counts):
[“the”, “weather”, “is”, “sunny”, “and”, “warm”] → 6 matches
Total candidate unigrams: 7

Press enter or click to view image in full size

Brevity Penalty (BP) reduces the BLEU score when the model’s output is shorter than the reference. It prevents very short outputs from getting unfairly high scores.

Press enter or click to view image in full size

This code calculates the BLEU score for AI translations compared to human references using Hugging Face’s evaluate library.

import evaluatebleu = evaluate.load("bleu")
original_text = "Le temps aujourd'hui est ensoleillé et chaud."  # Source (French)
reference_translation = ["The weather today is sunny and warm."]  # Human Translation
ai_translation = ["Today the weather is sunny and warm."]        # AI Translation
# Calculate BLEU
results = bleu.compute(predictions=ai_translation, references=reference_translation)
print(results)

Press enter or click to view image in full size

Precisions refers: unigram, bigram, trigram,4-gram precisions scores

Note: BLEU scores can vary between different libraries because they handle tokenization, punctuation, and lowercasing differently. SacreBLEU solves this problem by standardizing these steps, making the scores consistent and reliable.

In A/B testing on OpenAI’s large language models, researchers used BLEU scores together with the Wilcoxon signed-rank test to determine statistical significance.

🎯Model-Based Evaluation Methods

Model-based evaluation methods measure how similar the meaning of the model’s output is to the reference text. They are useful for tasks like summarization, translation, or question answering, where the exact words may differ but the overall meaning should match.

🚀 BERTScore
BERTScore evaluates semantic similarity between AI outputs and reference texts using token embeddings. Unlike BLEU or ROUGE, it can capture meaning even when the exact words differ.
To calculate BERTScore, we follow these three steps:

Extract token embeddings using a model like BERT, RoBERTa, or DeBERTa.
Compute cosine similarity between the AI-generated output and reference token embeddings.
Calculate precision, recall, and F1 scores based on these similarities.

In this example, we show how to calculate BERTScore using Hugging Face’s bert_score library.

from bert_score import score# Example texts
candidates = ["Today the weather is sunny and warm."]  # AI-generated output
references = ["The weather today is sunny and warm."]  # Human reference
# Calculate BERTScore
P, R, F1 = score(
    candidates, 
    references, 
    lang="en", 
    model_type="bert-base-uncased"
)
# Print results
print("BERTScore Precision:", P.mean().item())
print("BERTScore Recall:", R.mean().item())
print("BERTScore F1:", F1.mean().item())

Press enter or click to view image in full size

🎯GEVAL & LLM-as-Judge

GEVAL and LLM-as-Judge use a language model to check the quality of AI-generated text. They look at fluency, correctness, relevance, and overall usefulness. GEVAL gives scores or rankings in a standard way. LLM-as-Judge lets any LLM act like a human evaluator. These methods catch errors and meaning problems that automatic metrics like BLEU or BERTScore might miss.

✨ What is Langfuse?

Langfuse is a platform for monitoring, evaluating, and analyzing large language models in production. It helps teams see what their models are doing, find problems like hallucinations or toxic outputs, and track performance over time. You can run A/B tests, follow key metrics, and view trends, making model evaluation easier and more practical.

✨Creating and Managing Prompt Versions

You can add your system prompts in Langfuse. In the image below, I show which button you need to click.

Press enter or click to view image in full size

After creating a new prompt, you can add versions such as development or QA. This allows you to use different prompts in production and development environments. When the new prompt is ready, you just need to add the production version to use it in production.

Press enter or click to view image in full size

Now, just you need to create a Langfuse client to call prompts from Langfuse. When calling a prompt from Langfuse, we need to provide the prompt name and version.

def get_langfuse_client() -> Langfuse:
    """Create a Langfuse client with explicit configuration.    Required env vars: LANGFUSE_SECRET_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_HOST
    """
    secret_key = os.environ.get("LANGFUSE_SECRET_KEY", "")
    public_key = os.environ.get("LANGFUSE_PUBLIC_KEY", "")
    host = os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com")
    if not secret_key or not public_key:
        raise EnvironmentError(
            "Missing LANGFUSE_SECRET_KEY or LANGFUSE_PUBLIC_KEY. "
            "Copy .env.example to .env and fill in the values."
        )
    return Langfuse(
        secret_key=secret_key,
        public_key=public_key,
        host=host,
    )

def get_prompt(name: str, *, label: str = "production", **variables) -> str:
    """Fetch a prompt template from Langfuse by name.    Args:
        name: Prompt name in Langfuse (e.g. "classifier-prompt").
        label: Prompt label to fetch (default: "production").
        **variables: Template variables to compile (e.g. input="some text").
    Returns:
        The compiled prompt string.
    """
    langfuse = get_langfuse_client()
    prompt = langfuse.get_prompt(name, type="text", label=label)
    return prompt.compile(**variables)

✨ Observing Prompts

Let’s imagine we have a classification LLM application. This application classifies a given text into four categories: positive, negative, neutral, and harmful. In Langfuse, you can use the @observe decorator to automatically observe your application. Let’s look at an example.

@observe()
def classify_text(text: str) -> str:
    """Full classification pipeline with nested spans."""
    prompt = get_prompt("classifier-prompt", input=text)
    client = get_openrouter_client()    response = client.chat.completions.create(
        model="google/gemini-2.5-flash",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=50,
    )
    return response.choices[0].message.content

When we run our example, we can see the inputs and outputs in the Langfuse tracing section. In this section, we can see inputs, outputs, latency, token usage, and more.

Press enter or click to view image in full size

If the application raises an error, we can also see the error reason here.

Press enter or click to view image in full size

The @observe decorator is really useful. However, sometimes we need more detailed and specific logs. For this, we can update observation spans manually.

Get Kader Miyanyedi’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

Now let’s look at a different example. In this application, our model extracts information from user text. Let’s start with the ExtractEntity class, and then write the extract function.

class ExtractedEntities(BaseModel):
    persons: list[str]
    locations: list[str]
    dates: list[str]
    organizations: list[str]

def extract_entities(text: str) -> ExtractedEntities:
    """Full extraction pipeline using manual `with` spans.    Trace structure:
        extract-entities (root span)
        ├── preprocess (span)
        ├── llm-call (generation)
        └── postprocess (span) + scores
    """
    langfuse = get_langfuse_client()
    with langfuse.start_as_current_observation(
        name="extract-entities",
        as_type="span",
        model="google/gemini-2.5-flash",
    ) as root_span:
        root_span.update(input=text)
        with root_span.start_as_current_observation(
            name="llm-call",
            as_type="generation",
            model="google/gemini-2.5-flash",
            model_parameters={"temperature": 0},
        ) as generation_span:
            system_prompt = get_prompt("extractor-prompt")
            client = get_openrouter_client()
            response = client.chat.completions.create(
                model="google/gemini-2.5-flash",
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": text},
                ],
                temperature=0,
                max_tokens=500,
                response_format={"type": "json_object"},
            )
            raw_json = response.choices[0].message.content
            generation_span.update(
                input=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": text},
                ],
                output=raw_json,
                usage_details={
                    "input": response.usage.prompt_tokens,
                    "output": response.usage.completion_tokens,
                },
            )
        with root_span.start_as_current_observation(
            name="postprocess", as_type="span"
        ) as postprocess_span:
            postprocess_span.update(input=raw_json)
            try:
                entities = ExtractedEntities.model_validate_json(raw_json)
                postprocess_span.score(
                    name="valid_json",
                    value=1.0,
                    data_type="BOOLEAN",
                    comment="LLM returned valid JSON matching schema",
                )
            except ValidationError as e:
                postprocess_span.score(
                    name="valid_json",
                    value=0.0,
                    data_type="BOOLEAN",
                    comment=f"Invalid JSON: {e}",
                )
                entities = ExtractedEntities(
                    persons=[], locations=[], dates=[], organizations=[]
                )
            # Hallucination check
            text_lower = text.lower()
            hallucinated = []
            for key in ["persons", "locations", "organizations"]:
                for entity in getattr(entities, key):
                    if entity.lower() not in text_lower:
                        hallucinated.append(f"{key}: {entity}")
            postprocess_span.score(
                name="no_hallucination",
                value=1.0 if not hallucinated else 0.0,
                data_type="BOOLEAN",
                comment="Clean" if not hallucinated else f"Hallucinated: {hallucinated}",
            )
            postprocess_span.update(output=entities.model_dump())
        root_span.update(output=entities.model_dump())
    return entities

In this example, we show how to track an extraction pipeline using Langfuse spans. The pipeline has three main spans and scores to check output quality:

Root Span (extract-entities): This span covers the whole pipeline. It records the input text and acts as the parent for all other spans.
LLM Call Span (llm-call): This span tracks the call to the language model. It logs the system prompt, user input, and raw model output.
Postprocess Span (postprocess): This span handles converting the raw output into structured entities. It also validates the JSON and updates the final results.
Scores (valid_json & no_hallucination): Scores check the quality of the output. valid_json ensures the output matches the expected format, and no_hallucination checks that entities appear in the input text.

Press enter or click to view image in full size

✨Langfuse Scores

Langfuse allows you to add scores to your spans to measure the quality of model outputs. Scores can be used for things like:

Validation checks: whether the output matches the expected JSON schema.
Hallucination detection: checking if the entities in the output actually appear in the input text.
Custom metrics: you can define your own scoring logic depending on your application.

Scores have a name, value, data type, and optional comment, which makes it easy to track and analyze them in the Langfuse dashboard.

You can view your scores in the Prompt Tracing or Scores tab under the Evaluation section.

Press enter or click to view image in full size

In the Scores section, Langfuse provides dashboards to analyze your metrics.

If you select a single score, you can view its distribution and trends over time.
If you select two scores, you can compare them using heatmaps, correlation analysis, and other statistical metrics.

This allows you to monitor model performance, detect issues, and track improvements across different runs.

Press enter or click to view image in full size

✨Langfuse Dataset

A dataset is a collection of model inputs and outputs. In this section, you can create your own test datasets and run experiments with your prompts. Langfuse allows you to create datasets either in the UI or via API in your code.

Think about having many example inputs and outputs for classifier and extractor LLM applications. Let’s create a dataset for these examples using the API SDK.

# example for classifier
[
  {"input": "I absolutely love this product!", "expected_output": "positive"},
  {"input": "Best experience I've ever had!", "expected_output": "positive"}
  ...
]

def create_classifier_dataset():
    langfuse = get_langfuse_client()    dataset_name = "sentiment-classifier-v1"
    langfuse.create_dataset(name=dataset_name)
    items = json.loads((DATASETS_DIR / "classifier_golden.json").read_text())
    for item in items:
        langfuse.create_dataset_item(
            dataset_name=dataset_name,
            input=item["input"],
            expected_output=item["expected_output"],
        )

With the create_dataset function, we create a dataset. Each dataset must have a unique name. Using the create_dataset_item function, we can add items to the dataset. After running this code, you should see your dataset in the Langfuse UI.

Press enter or click to view image in full size

Now we can run experiments for our prompts. There are two ways to run experiments: in the UI or via the API SDK.

Run Experiments via UI
The experiment begins with two key choices: Selecting the prompt version and assigning the model. Here, you decide which version of the prompt to test and which model will run it (e.g: gemini-2.5-flash) This setup makes sure your tests are consistent and easy to repeat.

Press enter or click to view image in full size

We should add an LLM connection here so that we can choose models.

Press enter or click to view image in full size

Then we need to choose the dataset for the experiment. You should make sure the dataset matches the variables in your prompt. Langfuse helps with this automatically. The green “Valid configuration” box shows that the system has checked your dataset against your prompt placeholders (like input). This ensures your experiment runs smoothly without missing data.

Press enter or click to view image in full size

Now it’s time to decide how we measure success. Here, you pick Evaluators to automatically score your experiment results. Langfuse provides several managed evaluators like Correctness, Relevance, and Hallucination Detection.

Press enter or click to view image in full size

For our case, Correctness is the most suitable evaluator because our app classifies text into four categories: Positive, Negative, Harmful, and Neutral. Choosing the right metric ensures the results are meaningful and consistent.

Press enter or click to view image in full size

After setting up the evaluator, we can view the experiment overview, which shows the prompt, model, dataset name, and more.

Press enter or click to view image in full size

When the experiment is finished, we can see the evaluator score, model output, and expected output in the dataset’s experiments section. This is how you can manage your datasets and experiments in the UI. Next, let’s explore an API SDK example.

Press enter or click to view image in full size

Run Experiments via API SDK

To run an experiment, first get the dataset from Langfuse and pass it to the run_experiment function. This function takes the experiment name, data, task, and evaluators.

The task is the function that runs on each dataset item, actually it is your LLM application. In our example, it could be classify_text or extract_entities.

model = "nvidia/nemotron-3-nano-30b-a3b:free"
extractor_dataset = langfuse.get_dataset("entity-extractor-v1")def make_extractor_task(model: str):
    def task(*, item, **kwargs):
        result = extract_entities(item.input, model=model)
        return result.model_dump()
    return task
print(f"\n=== Extractor: {model} ===")
ext_result = langfuse.run_experiment(
    name=f"extractor-{short_name}",
    data=extractor_dataset.items,
    task=make_extractor_task(model),
    evaluators=[entity_recall_evaluator, hallucination_evaluator],
)
print(ext_result.format())

Evaluators are functions that compare the task’s output with the expected results. Langfuse comes with built-in evaluators, but you can also write your own.

The custom evaluator function takes two parameters: output and expected_output. It checks if the model’s output matches the expected result and assigns a score 1 if it matches, 0 if it doesn’t. The result is returned as an Evaluation object, which includes the score and a comment showing what the model produced versus what was expected.

def hallucination_evaluator(*, output, input, **kwargs):
    """Check if any extracted entity is NOT in the source text."""
    text_lower = input.lower()
    hallucinated = []    for key in ["persons", "locations", "organizations"]:
        for entity in output.get(key, []):
            if entity.lower() not in text_lower:
                hallucinated.append(f"{key}: {entity}")
    score = 1.0 if not hallucinated else 0.0
    return Evaluation(
        name="no_hallucination",
        value=score,
        data_type="BOOLEAN",
        comment="Clean" if not hallucinated else f"Hallucinated: {hallucinated}",
    )

When we run these experiments, we can see the outputs directly in the terminal. In Langfuse, if we go to the dataset section, we should see our experiments listed. Clicking on each experiment shows detailed results, letting us review how the model performed on every item.

Press enter or click to view image in full size

The experiment results show zero cost because I used the model nvidia/nemotron-3-nano-30b-a3b:free through OpenRouter. Since this is a free model, no usage cost is recorded in Langfuse.

Press enter or click to view image in full size

You can compare different experiment results by clicking the Compare button in the top right corner. This view lets you compare outputs, latency, cost, and your evaluator results side by side.

Press enter or click to view image in full size

✨Langfuse LLM-as-a-Judge

We mentioned G-Eval and the LLM-as-a-judge methodology in the model-based evaluation section. LLM-as-a-judge is a simple idea: instead of writing functions to evaluate a model’s output, we use another LLM to do the evaluation.

For example, instead of checking if two outputs match exactly, we can ask a model questions like “Is this answer correct?” or “Does this output match the expected result?” The model then acts as a judge and gives a decision.

To use this in Langfuse, you need to set up an evaluator and define a default model in the LLM connections section. This model will be used as the judge.

Wait a minute… we already did this.

If you’ve run experiments in the UI, you’ve probably used it without even noticing. When you select the “Correctness” evaluator in Langfuse, it already works as an LLM-as-a-judge. Instead of doing a strict comparison, it uses another model to decide whether the output is correct based on the expected result.

In the LLM-as-a-Judge section, you can see how many times it was used, which evaluator was used, etc.

Press enter or click to view image in full size

You can also filter LLM-as-a-Judge traces in the Tracing section and view inputs, outputs, costs, etc., when LLM-as-a-Judge runs.

Press enter or click to view image in full size

✨Langfuse Dashboards

In the Home section, you can explore dashboards showing your metrics, such as traces, model costs, and scores. Use charts here to analyze and track your LLM’s performance.

Press enter or click to view image in full size

In the Dashboard section, you can create custom dashboards for your needs.

Press enter or click to view image in full size