Engineering

How We Measure Recommendation Quality — And Why We're Open-Sourcing It

Most recommendation vendors give you a black box. We built an open evaluation harness so you can score any system — including ours — with your own data.

8 min read

How We Measure Recommendation Quality — And Why We're Open-Sourcing It

Most recommendation vendors give you a black box. You upload your data, they return a list, and you're supposed to trust that it's good. We don't think that's good enough — and we built a tool to prove it.

Today we're open-sourcing nsl-eval: the same offline evaluation harness we use internally to measure and improve NeuronSearchLab's recommendation engine. You can run it against any recommendation system — including ours and our competitors'.

Why we're doing this

The honest answer is that we think transparent evaluation is good for the whole field, and good for us.

Recommender systems are hard to evaluate objectively. Vendors cherry-pick metrics, benchmark on curated datasets, and compare against weak baselines. Buyers end up making six-figure decisions based on demos and sales promises.

If we give you the tools to run your own evaluation on your own data with your own hold-out split, the results are harder to spin. That's a constraint we're comfortable with because we think we'll win a fair fight. And it's a commitment we're making publicly: use nsl-eval to score NeuronSearchLab. If we're not the best option for your workload, the harness will tell you.

There's a secondary reason: better tooling raises the floor for everyone building on top of recommendation systems. Teams that can run rigorous offline evals ship better models. That's good for the ecosystem and, eventually, good for our customers.

Why offline evaluation matters

Online A/B testing is the gold standard for measuring the business impact of recommendations. But it's expensive:

  • You need real production traffic.
  • Errors are visible to users.
  • A rigorous test takes weeks to reach statistical significance.

Offline evaluation lets you move fast. Hold out 20% of each user's history, ask your model to predict what they'd have interacted with, and score the result against the held-out ground truth. It's not a perfect proxy for live performance — the gap between offline and online metrics is real — but it's a reliable filter that's cheap enough to run on every pull request.

The practical flow: use offline eval to eliminate bad models quickly, then confirm winners with an A/B test.

What we measure

Five metrics, all at cutoff K (typically K=10):

  • NDCG@K: are the most relevant items ranked highest? Penalises relevant items buried deep in the list.
  • Hit Rate@K: did we surface at least one relevant item for this user? Easy to explain to stakeholders.
  • Recall@K: how much of the user's held-out history did we recover in the top K.
  • Precision@K: how many of our top-K recommendations were actually relevant.
  • Coverage@K: what fraction of the catalogue is recommended across all users.

Coverage is the metric most teams skip. A model that recommends the same 50 popular items to everyone can score well on NDCG and Hit Rate — but it is commercially useless, and it is unfair to the long-tail sellers or content creators on your platform.

Our baseline numbers

We trained an ALS model (32 factors, 10 iterations) on a synthetic e-commerce dataset with 200 users, 482 items, and ~4,238 interactions — the kind of scale you'd see on a mid-size platform in early growth.

nsl-eval \
  --interactions data/sample/interactions.csv \
  --model-path   models/als_baseline.pkl \
  --k            10
{
  "ndcg":              0.3241,
  "hit_rate":          0.5820,
  "recall":            0.2901,
  "precision":         0.1045,
  "coverage":          0.4712,
  "n_users_evaluated": 160
}

In plain English: for 58% of users, we surfaced at least one item they would have interacted with. We recovered 29% of their held-out interactions. We surfaced 47% of the full catalogue — well above the popularity-bias baseline of roughly 12%.

The catalogue coverage number is the one we're most proud of. Popularity bias is the default failure mode for collaborative filtering at this scale. Getting to 47% on a small synthetic dataset required deliberate tuning; it doesn't happen automatically.

How to run it on your own data

pip install nsl-eval

The Python API:

from nsl_eval import train_test_split, evaluate_model

# interactions is a scipy.sparse.csr_matrix (users × items)
train, test = train_test_split(interactions, test_fraction=0.2, seed=42)
metrics = evaluate_model(model, train, test, k=10)
print(metrics)

Any object with a .recommend(user_idx, interactions, n) method is compatible — including models from the implicit library. The README includes a full adapter example for evaluating NeuronSearchLab's live API on your own held-out split.

A sample dataset and pre-trained baseline model are bundled in the repo. You can get real output in under two minutes.

Benchmarking against other systems

The harness is designed to evaluate any system that returns a ranked list. The repo includes an adapter example for Recombee. Running the same train/test split against multiple systems gives you a direct apples-to-apples comparison on your own data.

Here's what that comparison looks like on the bundled synthetic dataset (200 users, 482 items):

Metric        | NeuronSearchLab | Recombee
------------- | --------------- | --------
ndcg          | 0.3241          | 0.2718
hit_rate      | 0.5820          | 0.4903
recall        | 0.2901          | 0.2344
precision     | 0.1045          | 0.0891
coverage      | 0.4712          | 0.3985

These numbers are from the synthetic dataset in the repo — use them to validate the methodology works, not to make purchasing decisions. The only number that matters is what you get when you run this against your own data.

What's next

This harness is the foundation for three things we're building:

  1. Continuous evaluation in CI — every PR runs nsl-eval; NDCG regressions block merge.
  2. Neural re-ranking — a lightweight two-tower model to replace ALS on catalogues >10k items. The harness is how we'll measure whether it actually improves things.
  3. Cold-start metrics — a separate evaluation path for new users and new items, where collaborative filtering breaks down and different approaches are needed.

Try it

The repo is at github.com/neuronsearchlab/nsl-eval. It is MIT-licensed. Stars, issues, and adapter contributions are welcome.

If you want to run this against a live NeuronSearchLab account, start a free trial — the API adapter example in the README works out of the box.

This is the fourth post in NeuronSearchLab's technical series on recommendation systems. Previous posts: How ALS Collaborative Filtering Works, The Cold-Start Problem, and NDCG, Hit Rate, and Coverage Explained.