Engineering

How to Know If Your Recommendations Are Working: NDCG, Hit Rate, and Coverage

Most teams track CTR and call it done. Here's what NDCG, hit rate, and catalogue coverage actually measure — and how to use them before you ship.

9 min read

How to Know If Your Recommendations Are Working: NDCG, Hit Rate, and Coverage

Most teams measure recommendation quality by watching click-through rate in production. That is like testing a bridge by driving a truck over it. It works right up until it doesn't, and by then the damage is done.

Offline evaluation — measuring quality against a held-out test set before you ship — lets you compare models, tune hyperparameters, and catch regressions without risking your live conversion rate. But offline metrics are only useful if you know what they measure and what their limits are.

This post covers the three metrics NeuronSearchLab tracks as standard: NDCG@K, Hit Rate@K, and Catalogue Coverage@K. All three are implemented in our open evaluation harness and run as part of every model promotion.

How Offline Evaluation Works

The basic setup: take your interaction history, hold out a fraction of each user's interactions as the "test set," train your model on the remaining data, then ask the model to predict what each user will interact with next.

NeuronSearchLab uses a per-user random hold-out split. For each user with at least 2 interactions, we hold out ceil(0.2 × n_interactions) items at random as ground truth:

train_matrix, test_matrix = train_test_split(
    interactions,
    test_fraction=0.2,
    min_interactions=2,
    seed=42,
)

Users with fewer than 2 interactions go entirely into training and are excluded from evaluation — you cannot hold out items from someone who barely has any.

For each evaluated user, the model generates a ranked list of K recommendations (excluding items in the training set). We then compare that list to the ground truth held-out items.

NDCG@K: Does Ranking Order Matter?

NDCG stands for Normalised Discounted Cumulative Gain. It measures how well the model ranks relevant items — specifically, whether it puts them near the top of the list.

The intuition: a relevant item at position 1 is worth more than a relevant item at position 5. NDCG captures this by applying a logarithmic discount to each position.

DCG (Discounted Cumulative Gain) for a single user:

DCG@K = Σ_{rank=1}^{K} (1 if item_at_rank is relevant else 0) / log₂(rank + 1)

Position 1 contributes 1/log₂(2) = 1.0. Position 2 contributes 1/log₂(3) ≈ 0.63. Position 5 contributes 1/log₂(6) ≈ 0.39.

Ideal DCG (IDCG) is what you'd get if the model returned all relevant items in the top positions — the theoretical maximum for that user.

NDCG normalises DCG by IDCG so the score is always in [0, 1]:

NDCG@K = DCG@K / IDCG@K

From our implementation:

def ndcg_at_k(recommended: list[int], relevant: set[int], k: int) -> float:
    dcg = 0.0
    for rank, item in enumerate(recommended[:k], start=1):
        if item in relevant:
            dcg += 1.0 / math.log2(rank + 1)

    ideal_hits = min(len(relevant), k)
    idcg = sum(1.0 / math.log2(rank + 1) for rank in range(1, ideal_hits + 1))

    return dcg / idcg if idcg > 0 else 0.0

What good looks like: NDCG@10 values for collaborative filtering on e-commerce data typically range from 0.05 to 0.25. Higher is better. A jump from 0.08 to 0.12 is meaningful. Comparing models on the same dataset is more useful than the absolute number.

What NDCG misses: It only rewards getting the right items into the list in the right order. It says nothing about whether your recommendations are diverse, novel, or covering the full catalogue.

Hit Rate@K: The Simplest Useful Metric

Hit Rate@K is the fraction of users for whom at least one held-out item appears in the top-K recommendations. Binary per user: 0 if none of the held-out items are in the top-K, 1 if at least one is.

def hit_rate_at_k(recommended: list[int], relevant: set[int], k: int) -> float:
    return float(any(item in relevant for item in recommended[:k]))

Averaged across all evaluated users, it tells you: "For what percentage of users did the model surface at least one thing they would have interacted with?"

Hit Rate@10 of 0.4 means the model gets a relevant item into the top 10 for 40% of users. Hit Rate@20 for the same model will always be >= Hit Rate@10.

Why Hit Rate alongside NDCG? NDCG measures quality of ranking; Hit Rate measures reach. A model can have good NDCG (it ranks correctly when it is right) but poor Hit Rate (it is wrong for most users). Tracking both gives you a more complete picture.

Practical use: Hit Rate is easy to explain to non-technical stakeholders. "The model finds at least one relevant item for 43% of users in the top 10" is concrete and actionable.

Catalogue Coverage@K: Are You Recommending the Same 50 Items to Everyone?

Coverage@K measures what fraction of the full item catalogue appears in at least one user's top-K recommendation list.

def coverage_at_k(all_recommendations: list[list[int]], n_items: int, k: int) -> float:
    surfaced: set[int] = set()
    for recs in all_recommendations:
        surfaced.update(recs[:k])
    return len(surfaced) / n_items

This is a system-level metric, not a per-user metric. A coverage of 0.10 means that across all users, only 10% of items are ever recommended to anyone.

Why this matters: A model optimised purely for accuracy (NDCG, Hit Rate) will converge on recommending popular items. Popular items have the most interactions in training, so the model learns they are "good." But recommending the same 50 items to every user is not personalisation — and it is bad for business. Long-tail items often have better margins, and discovering niche items creates a different kind of engagement than surfacing the obvious.

Coverage is also a proxy for fairness. If your platform has 10,000 sellers, do 9,950 of them never appear in recommendations?

What good looks like: Coverage depends heavily on catalogue size and K. For a 10k-item catalogue with K=10, coverage of 0.15–0.30 is typical for collaborative filtering. Significantly lower suggests your model has collapsed onto popular items. Significantly higher suggests you might be sacrificing relevance for diversity.

Running the Evaluation Harness

NeuronSearchLab ships an evaluation CLI that computes all three metrics in one pass:

python -m src.evaluate \
  --interactions data/sample/interactions.csv \
  --model-path models/als_baseline.pkl \
  --k 10 \
  --test-fraction 0.2

Output:

{
  "ndcg": 0.142,
  "hit_rate": 0.387,
  "recall": 0.091,
  "precision": 0.043,
  "coverage": 0.213,
  "n_users_evaluated": 847
}

Run this before promoting any model to production. We track all three headline metrics (NDCG, Hit Rate, Coverage) across training runs and reject a promotion if any metric regresses by more than 5% relative to the current live model.

Offline Metrics vs. Online Metrics: The Gap

Offline evaluation is necessary but not sufficient. The held-out test items are sampled from historical interactions — which means the model is evaluated on its ability to predict past behaviour, not future behaviour.

The gap between offline and online performance is real and sometimes large. A model with better NDCG@10 in offline eval does not always win an A/B test. Common reasons:

  • Exposure bias — users only interacted with items they were shown. The held-out test set reflects historical recommendations, not true user preferences.
  • Temporal shift — user tastes change; held-out items are from the recent past, which may not represent tomorrow.
  • Feedback loops — a model that recommends popular items gets more interactions on popular items, which reinforces their popularity in the next training run.

Offline metrics narrow the candidate set. A/B tests determine the winner. Never skip the A/B test.

A Practical Evaluation Checklist

Before promoting a new model:

  • [ ] NDCG@10 equal to or better than current model (no regression)
  • [ ] Hit Rate@10 equal to or better than current model
  • [ ] Coverage@10 not significantly worse — a drop in coverage warrants investigation
  • [ ] n_users_evaluated is sufficient (< 100 users evaluated = results are noisy, do not trust them)
  • [ ] Evaluation was run on the same data split as the current model (same seed, same test_fraction)
  • [ ] Cold-start users (not in training data) are excluded from this evaluation and tracked separately

Summary

  • NDCG@K: ranking quality, i.e., whether relevant items are near the top. Weakness: ignores diversity and novelty.
  • Hit Rate@K: reach, i.e., whether at least one relevant item appears. Weakness: binary metric with no ranking credit.
  • Coverage@K: catalogue breadth across users. Weakness: does not directly measure per-user relevance.

None of these metrics alone tells the full story. Together, they cover the three dimensions that matter most for a recommendation system: are you ranking well, are you helping most users, and are you using the full catalogue?

NeuronSearchLab computes all three in every evaluation run. If you want to run the harness against your own data, the evaluation module is available through the API.

This is the third post in NeuronSearchLab's technical series on recommendation systems. Previous posts: How ALS Collaborative Filtering Works and The Cold-Start Problem.