7. Production Concerns
The model taxonomy is excellent, but real systems also require these decisions.
7.1 Retrieval + ranking architecture
Most large systems are two-stage:
- Candidate generation (fast, high recall)
- Ranking (slower, richer features/objective)
Without this separation, serving cost or latency becomes prohibitive.
Google’s course extends this into a practical three-stage view:
- Candidate generation
- Scoring or ranking
- Re-ranking
The extra re-ranking stage matters because the best ranked list for raw engagement is often not the best final surface once you account for freshness, diversity, fairness, or business constraints.
Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.
In production, candidate generation is usually itself a mixture of sources:
- Embedding nearest neighbors from a two-tower or matrix-factorization model
- Co-visitation or graph-based retrieval
- Popularity or trending backfills
- Rule-based inventory or policy constraints
One key Google point is that scores from different candidate generators are usually not comparable. That is why a separate scorer or ranker is often necessary after retrieval.
For neural retrieval, Google also stresses approximate nearest-neighbor search rather than exact brute-force scoring over the full catalog. Libraries such as ScaNN are used to make this practical at large scale.
Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.
7.2 Label design and negatives
For implicit data, non-click is not always negative. You need:
- Exposure-aware negatives
- Position-bias-aware training
- Time-windowed labels matching product goals
Google’s scoring module makes a related point: you need to be explicit about what you are optimizing. A model trained for click probability can converge to clickbait. A model trained for watch time may overserve long items. A model trained for immediate conversion can hurt long-term trust or retention.
In other words, score definition is part of the product design, not just a modeling choice.
For feed-style or slate recommendation, Google also recommends distinguishing between:
- Position-dependent models, which estimate utility at a fixed slot
- Position-independent models, which try to estimate intrinsic relevance before layout effects
That distinction matters because position bias can make top slots look artificially better even when the item itself is not more relevant.
7.3 Evaluating recommender and ranking systems
D2L’s NeuMF evaluator is a good starting point for implicit-feedback ranking evaluation. The protocol uses a chronological split, holds out a future ground-truth item
Two core metrics in that setup are:
and
where
This evaluator is useful because it respects time order and measures whether the held-out future item is surfaced near the top. But for production recommendation systems, you usually need a wider evaluation stack than
Stage-specific offline metrics
- Retrieval: use
or candidate hit rate to verify that the candidate generator is not dropping relevant items before the ranker sees them. - Ranking: use
, , and for top-of-list quality. If you have multiple relevant held-out items per user, is also useful. - Rating prediction: use
or only when explicit rating prediction is the real product task. These metrics are much less informative for feed ranking or item recommendation.
Let
For retrieval, a standard candidate-stage metric is:
For the final ranked list, the analogous metric is:
To reward correct ordering near the top, define
and then normalize by the ideal ordering:
where
If you care about the position of the first relevant result, use mean reciprocal rank:
where
If multiple relevant items can appear in the list, average precision is also useful:
where
Among these,
Protocol choices matter as much as the metric
- Use chronological splits for implicit and sequence-aware tasks. Random splits can leak future information.
- State clearly whether evaluation is full-catalog, sampled-negative, or candidate-set based. Numbers are not comparable across these protocols.
- Evaluate on exposed or eligible items when possible. Treating every unclicked item in the full catalog as a negative can distort results.
- Report by segment: new users, power users, new items, head items, and long-tail items often behave very differently.
- When the system has multiple stages, evaluate each stage separately and end-to-end.
Beyond ranking accuracy
Accuracy metrics alone can produce a recommender that is brittle or bad for the product.
- Coverage: how much of the catalog is ever recommended
- Diversity: how different the recommended items are from one another
- Novelty and serendipity: whether the system only repeats obvious items
- Calibration: whether recommendations match the user’s current intent, not just their historical average
- Fairness and marketplace health: whether some suppliers, creators, or item groups are systematically suppressed
These matter because a system with slightly lower
Online evaluation
Offline metrics are necessary but insufficient. You still need A/B tests with:
- Primary metrics: CTR, conversion, watch time, retention, revenue, or long-term value depending on the product
- Guardrails: latency, page-load impact, complaint rate, hide/block rate, unsafe-content rate
- Diagnostic cuts: new vs returning users, cold-start items, geography, device, heavy-user cohorts
For ranking changes, it is also useful to monitor the full funnel:
- Candidate recall
- Ranker win rate on exposed impressions
- Final-surface engagement
- Downstream business outcomes
7.4 Re-ranking, freshness, diversity, and exploration
Pure exploitation can collapse catalog diversity. You need controlled exploration:
- Epsilon-greedy or Thompson-style policies
- Re-ranking for diversity/novelty
- Periodic calibration checks
Google’s reranking material is especially useful here. In practice, re-ranking is where you inject constraints that the base ranker usually misses:
- Freshness so the feed does not go stale
- Diversity so near-duplicate items do not dominate
- Fairness or marketplace balance so one creator, seller, or provider is not systematically overexposed
- Local policy constraints such as demotions, blocks, maturity filters, or legal limits
This stage is often simpler than the main ranker, but it has outsized product impact because it controls the final list actually seen by the user.
7.5 Reliability and monitoring
Data scientists should treat recommenders as continuously monitored systems:
- Feature drift and embedding drift
- Candidate recall degradation
- Online metric drift and alerting
- Safe fallback policies
7.6 Recommended systems reading
If you want a broader production-systems reference beyond recommender-specific papers, Chip Huyen’s Designing Machine Learning Systems is a strong complement to this chapter.
It is not a recommender-systems book specifically, but it is useful for recommender practitioners because it covers the operational side of production ML: data pipelines, iterative development, deployment tradeoffs, monitoring, feedback loops, and the gap between offline model quality and deployed system behavior. That makes it a good companion once you move from method selection into long-term system ownership.