Talk "Beyond Accuracy: Rethinking Evaluation for LLM Classifiers"
Framework for evaluating LLMs in production classification systems using three components: golden datasets for ground truth benchmarking, LLMs as judge for semantic validation, and human feedback for real-world performance tracking.
Topic was presented at Munich Datageeks - November Edition 2025.
Abstract
This talk presents a practical framework for evaluating Large Language Models (LLMs) in real-world classification scenarios. Using IT support ticket classification as a primary example, the speaker addresses the deceptive simplicity of LLM-based classification and the significant challenges that arise in production environments. The presentation introduces a three-component evaluation toolbox consisting of golden datasets, LLMs as judge, and human feedback mechanisms. Golden datasets provide technical ground truth and enable reproducibility checks during development. LLMs as judge offer semantic evaluation and robustness testing in production. Human feedback captures real-world system performance through user interactions. The framework addresses unique LLM challenges including ambiguity, inconsistency, and drift while mapping technical metrics to business impact. The methodology extends beyond classification to other LLM applications such as information extraction and Retrieval Augmented Generation (RAG), providing a comprehensive approach to continuous evaluation throughout the project lifecycle from development to production deployment.
About the Speaker
Alisa works as a data scientist at EON Digital Technology in the data and AI department. She is part of an internal consultancy unit with over 150 data scientists and data engineers that provides data and AI services across the entire EON group. Her work focuses on various use cases including grid efficiency and stability, as well as customer service improvements in the energy retail sector, which is her primary area of expertise. In this presentation, she shares insights from her experience implementing and evaluating LLM-based classification systems in production environments.
Transcript Summary
The Deceptive Simplicity of LLM Classification
LLM-based classification appears straightforward compared to traditional machine learning workflows from five years ago. The basic process involves defining inputs, writing a prompt, running it through a model, and obtaining labels. This eliminates the need for extensive data labeling, model architecture selection, hyperparameter tuning, and train-test splits. However, this apparent simplicity is misleading when applied to realistic scenarios.
The speaker emphasizes that real-world classification problems involve differentiating between 50 to 150 classes that are often ambiguous and semantically similar. Using the analogy of classifying 150 siblings rather than cats and dogs, she illustrates the inherent difficulty. Misclassifications in IT support scenarios lead to tickets being routed to wrong teams, causing delays and increased costs.
A critical insight is that while LLM predictions are becoming increasingly cost-effective, obtaining reliable evidence of their performance remains expensive. The speaker estimates that only 20% of time is spent developing LLM solutions, while 80% is dedicated to evaluation.
Challenges in Real-World Classification Scenarios
General Classification Challenges
Real-world classification problems present several inherent difficulties:
- High cardinality: Systems typically need to handle 50 to 200 different classes, making the problem inherently complex
- Hierarchical structure: Classes often exist in multiple levels with different teams responsible for different groupings, making flat classification inadequate
- Overlapping semantics: Classes may have ambiguous boundaries, such as email clients potentially belonging to both email and application categories
- Multi-label scenarios: Users frequently experience multiple simultaneous issues, requiring the system to identify several valid concerns in a single support request
- Long-tail class imbalance: Common issues like password resets occur frequently while critical issues like firewall problems or certificate expirations remain rare
LLM-Specific Challenges
LLMs introduce three additional dimensions of complexity:
Ambiguity: While LLMs excel at understanding textual nuance, their effectiveness depends on well-defined classes. The problem is compounded when human annotators themselves disagree on classifications. The speaker suggests measuring human-to-human Cohen's kappa to establish an upper ceiling for achievable accuracy, as low human agreement indicates inherent dataset ambiguity.
Inconsistency: LLMs are inherently non-deterministic, meaning identical inputs may produce different outputs. This includes both natural output variation and prompt sensitivity, where minor prompt modifications can significantly alter predictions. The goal is optimizing the probability that the LLM consistently predicts the same class for identical inputs.
Drift: Systems experience instability over time through multiple mechanisms. Data drift occurs when new ticket types emerge. Concept drift happens when class definitions evolve. Model drift is particularly pronounced with black-box LLMs, as providers like OpenAI can update models without transparency. The speaker recommends measuring Jensen-Shannon divergence between distributions to detect shifts, though specific threshold values must be empirically determined for each use case.
The Importance of Business-Aligned Evaluation
The speaker emphasizes moving beyond purely technical metrics to understand real-world impact. High accuracy can be misleading when class distributions are imbalanced. For example, achieving 80% accuracy by correctly classifying common, easily-resolved network issues while failing on rare but costly firewall problems (representing only 1% of data) creates negative business outcomes.
Poor classification performance manifests in concrete business metrics including increased mean time to resolution, higher bounce rates between agents, and elevated operational costs. Critical error classes must be identified and weighted appropriately during evaluation to align technical performance with business objectives.
Component 1: Golden Datasets
Golden datasets serve as the technical ground truth, consisting of labeled cases curated by multiple human annotators to reduce bias. These datasets enable systematic model benchmarking across different configurations, including various model types, zero-shot versus few-shot approaches, and different prompt formulations.
Confusion Matrix Analysis
Confusion matrices provide powerful diagnostic capabilities by revealing:
- Which classes are poorly predicted
- Which classes are frequently confused with each other
- Performance patterns for rare but important classes
All elements outside the diagonal indicate misclassifications requiring investigation.
Measurable Metrics
Golden datasets enable calculation of standard classification metrics including per-class and overall precision, recall, and various F1 score variants (macro, micro, weighted). These provide quantitative performance baselines.
System Stability Assessment
Running different model types on identical golden datasets reveals distribution shifts in predictions. This comparison helps determine which models are more favorable and whether new models maintain or improve upon baseline performance. Versioned golden datasets support regression testing, ensuring new models achieve at least equivalent performance before production deployment.
When LLMs output confidence scores alongside labels, golden datasets enable monitoring confidence degradation across model versions, providing early warning signals of performance issues.
Business Impact Mapping
Golden datasets support translating technical metrics into business value. By incorporating information about class severity and resolution costs, teams can estimate:
- Overall cost savings from automation
- Reductions in average handling time
- Projected decreases in handover rates
This translation provides stakeholders with concrete evidence of system value and justifies continued investment.
Component 2: LLMs as Judge
LLMs as judge provides semantic evaluation by using one LLM to assess another's predictions. The process involves running initial classification with one model (e.g., Gemini), then using a different model family (e.g., GPT-4) to evaluate prediction quality on the given text. The judge outputs a score from zero to one indicating correctness, along with reasoning.
In the example provided, when an email mentions both VPN issues and rejected credentials, the judge might score the VPN-only prediction at 0.8, recognizing partial correctness while noting the missed credential problem.
Production Application
LLMs as judge proves particularly valuable for production or staging data where ground truth labels are unavailable. Implementation can involve:
- Random sampling: Evaluating a subset of predictions to manage costs, as each judgment requires an additional LLM call
- Selective sampling: Focusing evaluation on critical classes like firewall issues, especially important given long-tail distributions where random sampling would rarely capture rare classes
Judge-to-Model Agreement Rate
This metric tracks how frequently the judge agrees with the original model's predictions. A baseline agreement rate (e.g., 94%) establishes normal system behavior. Declining agreement rates signal system degradation without diagnosing specific causes, prompting deeper investigation before business metrics are impacted.
Rejected Class Analysis
Examining which classes receive frequent rejections from the judge reveals systematic confusion patterns. For instance, discovering that network issues are often rejected and identified as hardware issues highlights specific areas requiring prompt refinement or class definition clarification.
Emerging Class Detection
When the judge consistently indicates uncertainty about predictions, responding that it cannot classify the input into existing categories, this signals emerging ticket types not covered by the current taxonomy. This early warning enables proactive prompt updates and taxonomy expansion before significant misclassification volumes accumulate.
Component 3: Human Feedback
Human feedback captures genuine real-world system performance by tracking how end users interact with predictions. In IT support scenarios, agents working in the CRM system may correct automatically assigned labels when they disagree with the model's classification.
Feedback Capture Mechanism
The system logs label corrections by tracking:
- Which ticket was modified
- What the original predicted label was
- What the corrected label is
This creates an audit trail enabling subsequent analysis and metric calculation.
Correction Pattern Analysis
Clustering corrected issues reveals systematic gaps in model performance. For example, discovering that the system frequently misses multi-factor authentication problems suggests either adding a dedicated class for MFA issues or refining prompts to better recognize this pattern. This analysis creates a virtuous cycle where human corrections improve the system.
Building Enhanced Golden Datasets
Human-corrected labels represent high-quality, production-aligned ground truth. These corrections can be selectively incorporated back into golden datasets, ensuring evaluation data remains representative of real-world usage patterns and emerging ticket types.
Misclassification Rate Tracking
Monitoring how frequently agents correct predictions provides a direct performance indicator. Lower misclassification rates correlate with reduced mean time to resolution and decreased operational costs. Rising misclassification rates trigger investigation and potential interventions such as:
- Prompt refinement based on error patterns
- Switching to few-shot classification with human-labeled examples
- Architectural changes to the classification pipeline
Rerouted Ticket Monitoring
Tracking ticket bounce rates between agents measures service level agreement impact. High handover rates indicate classification errors forcing tickets through multiple teams before resolution. Steep increases suggest need for escalation mechanisms or improved routing rules to ensure faster resolution.
Critical Issue Performance
Analyzing which issue types receive corrections identifies performance gaps in high-severity classes. For critical classes like firewall issues, targeted improvements may include:
- Specialized prompts for specific problem domains
- More detailed class descriptions
- Splitting overly broad classes into more granular categories
Extension to Other LLM Applications
The three-component framework generalizes beyond classification to other LLM use cases, demonstrating its versatility.
Information Extraction
For extracting fields from invoices:
- Golden datasets: Hand-labeled document collections mapping documents to expected extracted fields
- LLMs as judge: Scoring extraction quality from zero to one in production, evaluating whether extracted fields are correct
- Human feedback: Tracking how frequently the finance team corrects extracted values
Retrieval Augmented Generation (RAG)
For question-answering systems:
- Golden datasets: Collections of questions paired with expected retrieved documents, assessing retrieval quality
- LLMs as judge: Fact-checking generated answers for completeness and hallucination detection
- Human feedback: Monitoring rejection signals such as thumbs-down ratings or users rephrasing questions due to unsatisfactory initial answers
These patterns indicate the framework's applicability across diverse LLM problem types while maintaining consistent evaluation principles.
Continuous Evaluation Framework
The three components work synergistically to answer complementary questions throughout the system lifecycle:
- Golden datasets: How good is our model on known truth that has been defined by human annotators?
- LLMs as judge: Are we still performing well? This functions as a guardrail indicating whether the system moves in the right direction
- Human feedback: How does our model behave in reality with actual users?
Combining these metrics in dashboards targeted at different stakeholder levels—from developers to business decision-makers—enables comprehensive system monitoring. This approach supports stress testing, assumption validation, learning from production behavior, and iterative improvement.
The Driving Analogy
The speaker concludes with an intuitive analogy comparing the evaluation framework to driving from point A to point B:
- Golden datasets function as the navigation system, indicating the intended trajectory, speed, and progress toward the destination
- LLMs as judge act like road barriers preventing the car from steering into the woods and maintaining course
- Human feedback resembles a child in the back seat asking "are we there yet?"—sometimes brutally honest but equally valuable for understanding journey progress
This analogy emphasizes how the three components provide different but essential perspectives on system performance, working together to ensure successful deployment and operation.