Talk "Beyond Accuracy: Rethinking Evaluation for LLM Classifiers"

Topic was presented at Munich Datageeks - November Edition 2025.

Abstract

This talk presents a practical framework for evaluating Large Language Models (LLMs) in real-world classification scenarios. Using IT support ticket classification as a primary example, the speaker addresses the deceptive simplicity of LLM-based classification and the significant challenges that arise in production environments. The presentation introduces a three-component evaluation toolbox consisting of golden datasets, LLMs as judge, and human feedback mechanisms. Golden datasets provide technical ground truth and enable reproducibility checks during development. LLMs as judge offer semantic evaluation and robustness testing in production. Human feedback captures real-world system performance through user interactions. The framework addresses unique LLM challenges including ambiguity, inconsistency, and drift while mapping technical metrics to business impact. The methodology extends beyond classification to other LLM applications such as information extraction and Retrieval Augmented Generation (RAG), providing a comprehensive approach to continuous evaluation throughout the project lifecycle from development to production deployment.

About the Speaker

Alisa works as a data scientist at EON Digital Technology in the data and AI department. She is part of an internal consultancy unit with over 150 data scientists and data engineers that provides data and AI services across the entire EON group. Her work focuses on various use cases including grid efficiency and stability, as well as customer service improvements in the energy retail sector, which is her primary area of expertise. In this presentation, she shares insights from her experience implementing and evaluating LLM-based classification systems in production environments.

Transcript Summary

The Deceptive Simplicity of LLM Classification

LLM-based classification appears straightforward compared to traditional machine learning workflows from five years ago. The basic process involves defining inputs, writing a prompt, running it through a model, and obtaining labels. This eliminates the need for extensive data labeling, model architecture selection, hyperparameter tuning, and train-test splits. However, this apparent simplicity is misleading when applied to realistic scenarios.

The speaker emphasizes that real-world classification problems involve differentiating between 50 to 150 classes that are often ambiguous and semantically similar. Using the analogy of classifying 150 siblings rather than cats and dogs, she illustrates the inherent difficulty. Misclassifications in IT support scenarios lead to tickets being routed to wrong teams, causing delays and increased costs.

A critical insight is that while LLM predictions are becoming increasingly cost-effective, obtaining reliable evidence of their performance remains expensive. The speaker estimates that only 20% of time is spent developing LLM solutions, while 80% is dedicated to evaluation.

Challenges in Real-World Classification Scenarios

General Classification Challenges

Real-world classification problems present several inherent difficulties:

High cardinality: Systems typically need to handle 50 to 200 different classes, making the problem inherently complex
Hierarchical structure: Classes often exist in multiple levels with different teams responsible for different groupings, making flat classification inadequate
Overlapping semantics: Classes may have ambiguous boundaries, such as email clients potentially belonging to both email and application categories
Multi-label scenarios: Users frequently experience multiple simultaneous issues, requiring the system to identify several valid concerns in a single support request
Long-tail class imbalance: Common issues like password resets occur frequently while critical issues like firewall problems or certificate expirations remain rare

LLM-Specific Challenges

LLMs introduce three additional dimensions of complexity:

Ambiguity: While LLMs excel at understanding textual nuance, their effectiveness depends on well-defined classes. The problem is compounded when human annotators themselves disagree on classifications. The speaker suggests measuring human-to-human Cohen's kappa to establish an upper ceiling for achievable accuracy, as low human agreement indicates inherent dataset ambiguity.

Inconsistency: LLMs are inherently non-deterministic, meaning identical inputs may produce different outputs. This includes both natural output variation and prompt sensitivity, where minor prompt modifications can significantly alter predictions. The goal is optimizing the probability that the LLM consistently predicts the same class for identical inputs.

Drift: Systems experience instability over time through multiple mechanisms. Data drift occurs when new ticket types emerge. Concept drift happens when class definitions evolve. Model drift is particularly pronounced with black-box LLMs, as providers like OpenAI can update models without transparency. The speaker recommends measuring Jensen-Shannon divergence between distributions to detect shifts, though specific threshold values must be empirically determined for each use case.

The Importance of Business-Aligned Evaluation

The speaker emphasizes moving beyond purely technical metrics to understand real-world impact. High accuracy can be misleading when class distributions are imbalanced. For example, achieving 80% accuracy by correctly classifying common, easily-resolved network issues while failing on rare but costly firewall problems (representing only 1% of data) creates negative business outcomes.

Poor classification performance manifests in concrete business metrics including increased mean time to resolution, higher bounce rates between agents, and elevated operational costs. Critical error classes must be identified and weighted appropriately during evaluation to align technical performance with business objectives.

Component 1: Golden Datasets

Golden datasets serve as the technical ground truth, consisting of labeled cases curated by multiple human annotators to reduce bias. These datasets enable systematic model benchmarking across different configurations, including various model types, zero-shot versus few-shot approaches, and different prompt formulations.

Confusion Matrix Analysis

Confusion matrices provide powerful diagnostic capabilities by revealing:

Which classes are poorly predicted
Which classes are frequently confused with each other
Performance patterns for rare but important classes

All elements outside the diagonal indicate misclassifications requiring investigation.

Measurable Metrics

Golden datasets enable calculation of standard classification metrics including per-class and overall precision, recall, and various F1 score variants (macro, micro, weighted). These provide quantitative performance baselines.

System Stability Assessment

Running different model types on identical golden datasets reveals distribution shifts in predictions. This comparison helps determine which models are more favorable and whether new models maintain or improve upon baseline performance. Versioned golden datasets support regression testing, ensuring new models achieve at least equivalent performance before production deployment.

When LLMs output confidence scores alongside labels, golden datasets enable monitoring confidence degradation across model versions, providing early warning signals of performance issues.

Business Impact Mapping

Golden datasets support translating technical metrics into business value. By incorporating information about class severity and resolution costs, teams can estimate:

Overall cost savings from automation
Reductions in average handling time
Projected decreases in handover rates

This translation provides stakeholders with concrete evidence of system value and justifies continued investment.

Component 2: LLMs as Judge

LLMs as judge provides semantic evaluation by using one LLM to assess another's predictions. The process involves running initial classification with one model (e.g., Gemini), then using a different model family (e.g., GPT-4) to evaluate prediction quality on the given text. The judge outputs a score from zero to one indicating correctness, along with reasoning.

In the example provided, when an email mentions both VPN issues and rejected credentials, the judge might score the VPN-only prediction at 0.8, recognizing partial correctness while noting the missed credential problem.

Production Application

LLMs as judge proves particularly valuable for production or staging data where ground truth labels are unavailable. Implementation can involve:

Random sampling: Evaluating a subset of predictions to manage costs, as each judgment requires an additional LLM call
Selective sampling: Focusing evaluation on critical classes like firewall issues, especially important given long-tail distributions where random sampling would rarely capture rare classes

Judge-to-Model Agreement Rate

This metric tracks how frequently the judge agrees with the original model's predictions. A baseline agreement rate (e.g., 94%) establishes normal system behavior. Declining agreement rates signal system degradation without diagnosing specific causes, prompting deeper investigation before business metrics are impacted.

Rejected Class Analysis

Examining which classes receive frequent rejections from the judge reveals systematic confusion patterns. For instance, discovering that network issues are often rejected and identified as hardware issues highlights specific areas requiring prompt refinement or class definition clarification.

Emerging Class Detection

When the judge consistently indicates uncertainty about predictions, responding that it cannot classify the input into existing categories, this signals emerging ticket types not covered by the current taxonomy. This early warning enables proactive prompt updates and taxonomy expansion before significant misclassification volumes accumulate.

Component 3: Human Feedback

Human feedback captures genuine real-world system performance by tracking how end users interact with predictions. In IT support scenarios, agents working in the CRM system may correct automatically assigned labels when they disagree with the model's classification.

Feedback Capture Mechanism

The system logs label corrections by tracking:

Which ticket was modified
What the original predicted label was
What the corrected label is

This creates an audit trail enabling subsequent analysis and metric calculation.

Correction Pattern Analysis

Clustering corrected issues reveals systematic gaps in model performance. For example, discovering that the system frequently misses multi-factor authentication problems suggests either adding a dedicated class for MFA issues or refining prompts to better recognize this pattern. This analysis creates a virtuous cycle where human corrections improve the system.

Building Enhanced Golden Datasets

Human-corrected labels represent high-quality, production-aligned ground truth. These corrections can be selectively incorporated back into golden datasets, ensuring evaluation data remains representative of real-world usage patterns and emerging ticket types.

Misclassification Rate Tracking

Monitoring how frequently agents correct predictions provides a direct performance indicator. Lower misclassification rates correlate with reduced mean time to resolution and decreased operational costs. Rising misclassification rates trigger investigation and potential interventions such as:

Prompt refinement based on error patterns
Switching to few-shot classification with human-labeled examples
Architectural changes to the classification pipeline

Rerouted Ticket Monitoring

Tracking ticket bounce rates between agents measures service level agreement impact. High handover rates indicate classification errors forcing tickets through multiple teams before resolution. Steep increases suggest need for escalation mechanisms or improved routing rules to ensure faster resolution.

Critical Issue Performance

Analyzing which issue types receive corrections identifies performance gaps in high-severity classes. For critical classes like firewall issues, targeted improvements may include:

Specialized prompts for specific problem domains
More detailed class descriptions
Splitting overly broad classes into more granular categories

Extension to Other LLM Applications

The three-component framework generalizes beyond classification to other LLM use cases, demonstrating its versatility.

Information Extraction

For extracting fields from invoices:

Golden datasets: Hand-labeled document collections mapping documents to expected extracted fields
LLMs as judge: Scoring extraction quality from zero to one in production, evaluating whether extracted fields are correct
Human feedback: Tracking how frequently the finance team corrects extracted values

Retrieval Augmented Generation (RAG)

For question-answering systems:

Golden datasets: Collections of questions paired with expected retrieved documents, assessing retrieval quality
LLMs as judge: Fact-checking generated answers for completeness and hallucination detection
Human feedback: Monitoring rejection signals such as thumbs-down ratings or users rephrasing questions due to unsatisfactory initial answers

These patterns indicate the framework's applicability across diverse LLM problem types while maintaining consistent evaluation principles.

Continuous Evaluation Framework

The three components work synergistically to answer complementary questions throughout the system lifecycle:

Golden datasets: How good is our model on known truth that has been defined by human annotators?
LLMs as judge: Are we still performing well? This functions as a guardrail indicating whether the system moves in the right direction
Human feedback: How does our model behave in reality with actual users?

Combining these metrics in dashboards targeted at different stakeholder levels—from developers to business decision-makers—enables comprehensive system monitoring. This approach supports stress testing, assumption validation, learning from production behavior, and iterative improvement.

The Driving Analogy

The speaker concludes with an intuitive analogy comparing the evaluation framework to driving from point A to point B:

Golden datasets function as the navigation system, indicating the intended trajectory, speed, and progress toward the destination
LLMs as judge act like road barriers preventing the car from steering into the woods and maintaining course
Human feedback resembles a child in the back seat asking "are we there yet?"—sometimes brutally honest but equally valuable for understanding journey progress

This analogy emphasizes how the three components provide different but essential perspectives on system performance, working together to ensure successful deployment and operation.