7. Production Concerns

The model taxonomy is excellent, but real systems also require these decisions.

7.1 Retrieval + ranking architecture

Most large systems are two-stage:

  1. Candidate generation (fast, high recall)
  2. Ranking (slower, richer features/objective)

Without this separation, serving cost or latency becomes prohibitive.

Google’s course extends this into a practical three-stage view:

  1. Candidate generation
  2. Scoring or ranking
  3. Re-ranking

The extra re-ranking stage matters because the best ranked list for raw engagement is often not the best final surface once you account for freshness, diversity, fairness, or business constraints.

Recommendation process architecture

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

In production, candidate generation is usually itself a mixture of sources:

  • Embedding nearest neighbors from a two-tower or matrix-factorization model
  • Co-visitation or graph-based retrieval
  • Popularity or trending backfills
  • Rule-based inventory or policy constraints

One key Google point is that scores from different candidate generators are usually not comparable. That is why a separate scorer or ranker is often necessary after retrieval.

For neural retrieval, Google also stresses approximate nearest-neighbor search rather than exact brute-force scoring over the full catalog. Libraries such as ScaNN are used to make this practical at large scale.

Approximate nearest-neighbor retrieval in embedding space

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

7.2 Label design and negatives

For implicit data, non-click is not always negative. You need:

  • Exposure-aware negatives
  • Position-bias-aware training
  • Time-windowed labels matching product goals

Google’s scoring module makes a related point: you need to be explicit about what you are optimizing. A model trained for click probability can converge to clickbait. A model trained for watch time may overserve long items. A model trained for immediate conversion can hurt long-term trust or retention.

In other words, score definition is part of the product design, not just a modeling choice.

For feed-style or slate recommendation, Google also recommends distinguishing between:

  • Position-dependent models, which estimate utility at a fixed slot
  • Position-independent models, which try to estimate intrinsic relevance before layout effects

That distinction matters because position bias can make top slots look artificially better even when the item itself is not more relevant.

7.3 Evaluating recommender and ranking systems

D2L’s NeuMF evaluator is a good starting point for implicit-feedback ranking evaluation. The protocol uses a chronological split, holds out a future ground-truth item gu for each user u, and ranks that item against items the user has not interacted with.

Two core metrics in that setup are:

Hit@K=1|U|uU1(ranku,guK)

and

AUC=1|U|uU1|ISu|jISu1(ranku,gu<ranku,j)

where U is the user set, I is the item set, and Su is the set of items already associated with user u.

This evaluator is useful because it respects time order and measures whether the held-out future item is surfaced near the top. But for production recommendation systems, you usually need a wider evaluation stack than Hit@K and AUC alone.

Stage-specific offline metrics

  • Retrieval: use Recall@M or candidate hit rate to verify that the candidate generator is not dropping relevant items before the ranker sees them.
  • Ranking: use NDCG@K, Recall@K, and MRR for top-of-list quality. If you have multiple relevant held-out items per user, MAP is also useful.
  • Rating prediction: use RMSE or MAE only when explicit rating prediction is the real product task. These metrics are much less informative for feed ranking or item recommendation.

Let Gu denote the relevant items for user u, let Cu(M) be the top-M retrieval candidate set, and let Lu(K) be the top-K ranked list.

For retrieval, a standard candidate-stage metric is:

Recall@M=1|U|uU|GuCu(M)||Gu|

For the final ranked list, the analogous metric is:

Recall@K=1|U|uU|GuLu(K)||Gu|

To reward correct ordering near the top, define

DCG@K(u)=j=1K2relu,j1log2(j+1)

and then normalize by the ideal ordering:

NDCG@K=1|U|uUDCG@K(u)IDCG@K(u)

where relu,j is the relevance label of the item at position j for user u.

If you care about the position of the first relevant result, use mean reciprocal rank:

MRR=1|U|uU1ru

where ru is the rank position of the first relevant item for user u, with reciprocal rank taken as 0 if no relevant item appears.

If multiple relevant items can appear in the list, average precision is also useful:

AP@K(u)=1min(|Gu|,K)j=1KPrecision@j(u),1(iu,jGu)

MAP@K=1|U|uUAP@K(u)

where iu,j is the item shown at rank j to user u.

Among these, NDCG@K is often the strongest single ranking metric because it rewards putting the most relevant items near the top rather than merely somewhere in the top K.

Protocol choices matter as much as the metric

  • Use chronological splits for implicit and sequence-aware tasks. Random splits can leak future information.
  • State clearly whether evaluation is full-catalog, sampled-negative, or candidate-set based. Numbers are not comparable across these protocols.
  • Evaluate on exposed or eligible items when possible. Treating every unclicked item in the full catalog as a negative can distort results.
  • Report by segment: new users, power users, new items, head items, and long-tail items often behave very differently.
  • When the system has multiple stages, evaluate each stage separately and end-to-end.

Beyond ranking accuracy

Accuracy metrics alone can produce a recommender that is brittle or bad for the product.

  • Coverage: how much of the catalog is ever recommended
  • Diversity: how different the recommended items are from one another
  • Novelty and serendipity: whether the system only repeats obvious items
  • Calibration: whether recommendations match the user’s current intent, not just their historical average
  • Fairness and marketplace health: whether some suppliers, creators, or item groups are systematically suppressed

These matter because a system with slightly lower NDCG@K can still be better for long-term engagement if it improves diversity, catalog health, or repeat-user satisfaction.

Online evaluation

Offline metrics are necessary but insufficient. You still need A/B tests with:

  • Primary metrics: CTR, conversion, watch time, retention, revenue, or long-term value depending on the product
  • Guardrails: latency, page-load impact, complaint rate, hide/block rate, unsafe-content rate
  • Diagnostic cuts: new vs returning users, cold-start items, geography, device, heavy-user cohorts

For ranking changes, it is also useful to monitor the full funnel:

  • Candidate recall
  • Ranker win rate on exposed impressions
  • Final-surface engagement
  • Downstream business outcomes

Offline-to-online recommender evaluation flow

7.4 Re-ranking, freshness, diversity, and exploration

Pure exploitation can collapse catalog diversity. You need controlled exploration:

  • Epsilon-greedy or Thompson-style policies
  • Re-ranking for diversity/novelty
  • Periodic calibration checks

Google’s reranking material is especially useful here. In practice, re-ranking is where you inject constraints that the base ranker usually misses:

  • Freshness so the feed does not go stale
  • Diversity so near-duplicate items do not dominate
  • Fairness or marketplace balance so one creator, seller, or provider is not systematically overexposed
  • Local policy constraints such as demotions, blocks, maturity filters, or legal limits

This stage is often simpler than the main ranker, but it has outsized product impact because it controls the final list actually seen by the user.

7.5 Reliability and monitoring

Data scientists should treat recommenders as continuously monitored systems:

  • Feature drift and embedding drift
  • Candidate recall degradation
  • Online metric drift and alerting
  • Safe fallback policies

If you want a broader production-systems reference beyond recommender-specific papers, Chip Huyen’s Designing Machine Learning Systems is a strong complement to this chapter.

It is not a recommender-systems book specifically, but it is useful for recommender practitioners because it covers the operational side of production ML: data pipelines, iterative development, deployment tradeoffs, monitoring, feedback loops, and the gap between offline model quality and deployed system behavior. That makes it a good companion once you move from method selection into long-term system ownership.

Previous
Next