COMPARATIVE EVALUATION OF LARGE LANGUAGE MODELS FOR AUTOMATED SALES DATA STORYTELLING
DOI:
https://doi.org/10.25215/8194288770.49Abstract
This study presents a comprehensive evaluation of Large Language Models (LLMs) for automated data storytelling in the sales analytics domain. We develop an endto-end GenAI-powered pipeline that ingests structured sales data, performs statistical analysis, assembles contextual prompts, and generates executive-level business narratives using multiple LLMs. The system includes evaluation modules that measure output quality across readability, actionability, factual accuracy, and completeness. Four stateof-the-art LLMs were incorporated (Gemini, Cohere, Groq, and Hugging Face), though one model failed during execution due to API authentication limitations. The remaining models were assessed both quantitatively and visually using comparative bar charts, composite scores, and ranking plots. Cohere demonstrated the highest overall performance, excelling in accuracy and completeness, followed by Groq and Gemini. Observed variations underscore the influence of decoding randomness in LLM outputs. Additionally, the pipeline incorporates response validation routines, metric aggregation, and robust error logging. Limitations include restricted API usage and output variability across executions. This work contributes a reproducible methodology for enterprise narrative generation, offering practical guidance on selecting and evaluating LLMs for real-world business intelligence. Future directions include incorporating model ensembles, fine-tuning with domain-specific data, and expanding to multimodal storytelling formats.Published
2026-03-11
Issue
Section
Articles
