Evaluating Recommender Systems (EvidentlyAI summary) • Codes is cheap

Source: https://www.evidentlyai.com/ranking-metrics/evaluating-recommender-systems

Overview

This EvidentlyAI article walks through how to align ranking metrics with the business goals of a recommender system. It highlights the difference between product KPIs (click-through, revenue, watch time) and offline proxies such as precision, recall, MAP, nDCG, coverage, novelty, and serendipity.

Key points

Start with product intent – determine whether the recommender should drive clicks, conversions, dwell time, or other KPIs. Design experiments and datasets that mimic production distribution, and define what “relevance” means for your catalog.
Core ranking metrics:
*Precision@K / Recall@K*: share of relevant items among the top-K or recall of all relevant items within the shortlist.
*MRR / MAP*: reward correct items ranked high by averaging reciprocal rank or precision at each hit.
*nDCG*: discounts lower-ranked hits and normalizes across users so scores are comparable.
*Coverage*: measures how much of the catalog and user base the system actually recommends.
Beyond accuracy – track diversity, novelty, and serendipity to avoid echo-chamber recommendations and surface long-tail items. Monitor business constraints like price ranges or brand mix if merchandising requires it.
Evaluation workflow – combine offline validation (holdout sets, cross-validation) with online A/B tests; continually monitor logs for drift, cold start behavior, and catalog changes.

Takeaways

There is no single “best” metric; teams need a balanced scoreboard that connects offline rank metrics with online business KPIs.
High-quality logging (user interactions, impressions, context) is the foundation for meaningful evaluation.
Balancing accuracy with coverage and diversity is critical to maintain discovery, especially in marketplaces with long-tail inventory.