Understanding Recommender Systems

Single-page version for printing or saving as PDF.

Use the handbook version if you want chapter navigation and the left sidebar.

Chapter Guide

  1. Why Recommender Systems Matter
  2. Explicit vs. Implicit Feedback
  3. Model Families
  4. Matrix Factorization
  5. Feature-Rich Recommendation
  6. Deep Models
  7. Production Concerns
  8. Build Sequence
  9. Summary
  10. References
  11. Survey Papers and Further Reading

1. Why Recommender Systems Matter

Recommender systems help users navigate very large item catalogs (videos, products, courses, jobs, music) by ranking items likely to be relevant to each user. See the background overview in Wikipedia: Recommender system.

Recommendation process illustration

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

For data scientists, this is usually not a pure prediction task. It is a ranking and decision problem with constraints:

  • Relevance and personalization
  • Diversity and novelty
  • Latency and serving cost
  • Business goals (retention, conversion, long-term value)

Two-stage recommender architecture

1.1 Common applications

  • E-commerce and retail: cross-sell, upsell, “complete the look”, and basket expansion
  • Media and entertainment: personalized ranking of video, music, articles, and ads
  • Banking and financial services: product recommendations, offers, and next-best action

1.2 Business value

  • Helps users discover items they would not have found through search alone
  • Increases engagement, session depth, and content consumption
  • Improves conversion, basket size, and retention when recommendations are well-targeted

Google’s recommender course also makes a useful product distinction between two common surfaces:

  • Homepage recommendations, where the query is the user or session context
  • Related-item recommendations, where the query is the current item being viewed

That distinction matters because homepage recommendation usually starts from a user or context embedding, while related-item recommendation often starts from the item embedding itself and retrieves nearby items in embedding space.

2. Explicit vs. Implicit Feedback

As in the reference article, the first key split is the type of supervision.

2.1 Explicit feedback

Examples:

  • Star ratings
  • Like/dislike labels
  • Written reviews with sentiment scores

Pros:

  • Direct preference signal
  • Easier to define regression-style losses

Cons:

  • Sparse in most real products
  • Selection bias (only some users rate)

2.2 Implicit feedback

Examples:

  • Clicks
  • Watch time
  • Purchases
  • Add-to-cart, save, dwell

Pros:

  • High volume
  • Better behavioral coverage

Cons:

In both cases, interactions define a sparse user-item matrix with entries over user-item pairs (u,i).

Explicit versus implicit feedback comparison

User-item matrix examples for explicit and implicit data

2.3 Recommendation tasks

Following D2L Chapter 21, it helps to separate recommendation work by task:

  • Rating prediction: estimate a user’s explicit rating for an item
  • Top-n recommendation: rank candidate items and return a personalized list
  • Sequence-aware recommendation: use ordered behavior and timestamps
  • Click-through rate prediction: predict whether a shown item or ad will be clicked
  • Cold-start recommendation: serve new users or new items when history is limited

These tasks overlap, but they drive different labels, evaluation protocols, and model choices.

2.4 Benchmark datasets and split strategy

The MovieLens 100K dataset remains the standard conceptual benchmark for explicit-feedback recommendation.

  • 100,000 ratings
  • 943 users
  • 1,682 movies
  • Ratings from 1 to 5
  • Approximate matrix sparsity of 93.7%

Two split strategies from D2L are especially useful in practice:

  1. Random split for rating prediction and general offline evaluation
  2. Sequence-aware split, where the most recent interaction is held out per user

This distinction matters because sequence-aware recommendation should be evaluated with a chronological split, not a random one.

3. Model Families

Model choice depends heavily on what data you have. If you only observe interactions, collaborative filtering is usually the first serious approach. If you also have user and item attributes, content-based or hybrid models become more useful. If the current situation matters, such as device, country, time, or within-session behavior, then contextual models become important.

3.1 Content-based filtering

Use user/item attributes and metadata.

  • Item vectors from text/category/tags/embeddings
  • User representation from demographics and/or consumed item profiles
  • Similarity models (cosine, k-NN) or supervised models over features

Strength:

  • Better cold-start for new items (and sometimes new users)

Limitation:

  • Limited collaborative signal; can over-specialize
  • Can become overly narrow if the feature space does not capture richer or emerging interests

Google’s course also emphasizes that content-based systems are often easier to explain and easier to cold-start for new items, but they tend to be weaker at serendipity than collaborative models.

AspectAdvantagesDisadvantages
User specificityDoes not require interaction data from other users, so it can scale cleanly across many users and preserve privacy betterQuality depends heavily on having good item features and user profiles
Discovery patternCan serve niche items that match a user’s known interests very wellUsually expands less well beyond existing interests, so serendipity is weaker
Modeling burdenEasier to explain because recommendations can be tied back to item attributesRequires substantial domain knowledge and hand-engineered or high-quality learned features
Content-based filtering feature matrix illustration

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

3.2 Collaborative filtering

Use interaction patterns across all users/items.

  • Neighborhood methods (user-user, item-item)
  • Latent-factor methods (matrix factorization)

Strength:

  • Often strong personalization with enough interaction history

Limitation:

  • Cold-start if no history exists
AspectAdvantagesDisadvantages
Feature engineeringEmbeddings are learned automatically, so little prior domain knowledge is requiredHarder to incorporate side features such as demographics, metadata, or context without model extensions
Discovery patternCan introduce serendipity because similar users can pull an item into a user’s candidate setSuffers from cold-start for fresh users and fresh items without interactions
Practical roleStrong starting point because a feedback matrix alone can power a usable candidate generatorProduction systems usually need extra machinery such as WALS projection heuristics, side-feature augmentation, or hybrid models to fill the gaps

3.3 Contextual filtering

Contextual filtering incorporates information about the current situation into the recommendation process.

  • Examples of context: device, country, date, time, session state, or recent action sequence
  • Useful when the same user may want different items under different circumstances
  • Often framed as next-action or next-item prediction rather than only long-run preference estimation

Contextual recommendation diagram

3.4 Hybrid models

Combine metadata with interaction learning.

  • Best default choice in many production systems
  • Handles cold-start better than pure collaborative filtering
  • Usually outperforms pure content-based methods once enough interactions accumulate

Content-based, collaborative filtering, and hybrid model comparison

3.5 Embedding spaces and similarity measures for candidate generation

The Google Developers course sharpens an important operational point: candidate generation is usually a nearest-neighbor search problem in an embedding space. Given a query embedding q and item embedding x, the retrieval stage depends heavily on the similarity measure you choose.

Common choices are:

scos(q,x)=q,xq2x2

sdot(q,x)=q,x

dL2(q,x)=qx2

If the embeddings are normalized, cosine, dot product, and squared Euclidean distance induce closely related rankings. Without normalization, however, they behave differently:

  • Dot product favors larger embedding norms, which often correlates with popular or frequent items
  • Cosine focuses more on angular alignment, which can be better for semantic similarity
  • Euclidean distance emphasizes physical closeness in the embedding space

Google also suggests a useful interpolation between pure cosine and pure dot product:

sα(q,x)=q2αx2αcos(q,x),α(0,1)

This lets you keep some popularity signal without letting large-norm items dominate retrieval.

Candidate retrieval in embedding space

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

Different similarity choices induce different rankings

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

4. Matrix Factorization

The reference article emphasizes matrix factorization variants. This remains foundational for data scientists.

4.1 PMF / latent factors (explicit feedback)

Model:

r^ui=puqi

where user and item embeddings pu,qiRf.

Regularized loss over observed pairs Ω:

minP,Q (u,i)Ω(ruipuqi)2 +λ(pu22+qi22)

Optimization:

  • SGD (simple, flexible)
  • ALS (efficient for large sparse systems)
  • Practical implementations are available in the Surprise library and its documentation

With ALS, you alternate between solving for user factors while holding item factors fixed and solving for item factors while holding user factors fixed. That makes large sparse factorization problems easier to optimize in practice.

Illustration of matrix factorization model

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

Alternating least squares optimization cycle

4.2 SVD-style bias terms

A common extension adds global/user/item bias terms:

r^ui=μ+bu+bi+puqi

Biases capture broad effects (strict users, broadly popular items) and usually improve quality.

4.3 Implicit-feedback factorization

Following the article’s logic, implicit events are treated as preference plus confidence.

One common setup:

  • Preference: pui0,1 from interaction presence
  • Confidence: cui=1+αtui, where tui is interaction strength

Objective:

minX,Y u,icui(puixuyi)2 +λ(xu22+yi22)

This is the core weighted-implicit matrix factorization approach used in large-scale recommenders.

The Google course adds an important weighted-matrix-factorization view that is especially useful in industrial retrieval systems. Let A be the feedback matrix and let obs denote observed interactions. A common weighted objective is:

minU,V (u,i)obs(AuiUu,Vi)2 +w0(u,i)obsUu,Vi2

Here w0 controls how strongly the model treats unobserved pairs as weak negatives. In practice, this matters a lot: too little weight on unobserved pairs can make the embedding space collapse, while too much weight can wash out true positives. Google also notes that frequent users or popular items can dominate the objective, so observed pairs are often reweighted by user or item frequency.

4.4 Evaluation for rating prediction

For explicit-feedback recommendation, D2L’s matrix factorization section uses RMSE as the primary evaluation measure:

RMSE=1|T|(u,i)T(ruir^ui)2

where T is the evaluation set of observed user-item pairs.

RMSE is appropriate for rating prediction, but it is not sufficient for top-n recommendation because it does not evaluate rank order.

4.5 AutoRec for nonlinear rating prediction

AutoRec extends collaborative filtering with an autoencoder-style reconstruction objective.

  • Input is a partially observed user vector or item vector from the rating matrix
  • The network reconstructs missing entries through a hidden representation
  • Only observed ratings should contribute to the training loss

For item-based AutoRec, D2L writes the input as the ith column Ri of the rating matrix and reconstructs it with a nonlinear network:

h(Ri)=f(W,g(VRi+μ)+b)

The learning objective minimizes reconstruction error over observed entries only:

minW,V,μ,b i=1MRih(Ri)O2 +λ(WF2+VF2)

Conceptually, AutoRec matters because it is one of the earliest examples in D2L of moving from linear collaborative filtering to nonlinear neural reconstruction for rating prediction.

4.6 Personalized ranking objectives

D2L makes an important distinction between rating prediction objectives and ranking objectives.

  • Pointwise objectives model one user-item interaction at a time
  • Pairwise objectives model relative preference between a positive and a negative item
  • Listwise objectives optimize properties of an entire ranked list
Objective familyTraining signalProsConsTypical use
PointwiseOne labeled user-item example at a timeSimple to implement, works with standard regression or classification losses, easy to calibrate as a score or probabilityDoes not optimize ordering directly, sensitive to label noise and exposure bias, can overfocus on absolute score accuracyCTR prediction, rating prediction, coarse ranking baselines
PairwisePositive item compared against a sampled negative itemBetter aligned with top-n ranking, efficient for implicit feedback, usually easier to train than full listwise methodsQuality depends heavily on negative sampling, does not model full-list effects, can miss business constraints beyond pair comparisonsCandidate generation, implicit-feedback retrieval, pre-ranking
ListwiseEntire ranked list or slateBest conceptual match to ranking metrics such as NDCG, can optimize position effects and whole-list qualityMore complex objectives, heavier computation, harder data construction and serving alignmentFinal-stage ranking, search ranking, slate optimization

For top-n recommendation from implicit feedback, pairwise objectives are often a better match to the task.

The two core D2L losses are:

  1. Bayesian Personalized Ranking (BPR), which encourages the positive item to score above a sampled negative item:

(u,i,j)Dlnσ(y^uiy^uj)λΘΘ2

  1. Hinge ranking loss, which pushes the positive item away from the negative item by a margin m:

(u,i,j)Dmax(my^ui+y^uj,0)

These are central for implicit-feedback recommendation because they optimize relative ordering rather than absolute score accuracy.

4.7 SVD++ intuition

SVD++ augments user representation with signals from interacted items, helping when explicit feedback is sparse but interaction history exists.

5. Feature-Rich Recommendation

As D2L section 21.8 emphasizes, interaction data is often sparse and noisy. In many production settings, recommendation is better framed as impression-level prediction with rich side features.

5.1 Feature-rich recommendation and CTR

Feature-rich recommendation is common in ads, feeds, and product surfaces.

  • Labels are often binary, such as click vs no click
  • Inputs include many categorical fields rather than only user and item IDs
  • The D2L advertising example uses 34 fields, with the first column as the click label and the remaining columns as categorical features

This setting is different from classic matrix factorization because the goal is often click-through rate prediction over impression-level examples rather than rating reconstruction.

CTR is defined as:

CTR=clicksimpressions×100

5.2 Factorization machines

Factorization machines are one of the most important bridges between collaborative filtering and feature-rich prediction.

For a feature vector xRd, the two-way FM model is:

y^(x)=w0+i=1dwixi+i=1dj=i+1dvi,vjxixj

Interpretation:

  • The first two terms are linear
  • The last term models pairwise feature interactions
  • If one feature encodes user identity and another encodes item identity, the interaction term reduces to a collaborative-filtering-style embedding interaction

D2L also highlights the computational trick that reduces FM interaction cost from O(kd2) to O(kd), which is why FM remains practical on high-dimensional sparse data.

5.3 DeepFM

DeepFM extends FM by combining low-order feature interactions from FM with high-order nonlinear interactions from a deep network.

  • The FM branch captures low-order interactions
  • The deep branch uses shared embeddings and an MLP to learn higher-order interactions
  • Both outputs are combined into a final prediction

D2L presents the DeepFM prediction as:

y^=σ(y^(FM)+y^(DNN))

DeepFM is especially useful when simple pairwise interactions are not expressive enough, but you still want the inductive bias of factorization-based feature interaction.

DeepFM architecture

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

5.4 Hybrid factorization with features (LightFM-style)

  • User embedding = sum of user-feature embeddings
  • Item embedding = sum of item-feature embeddings
  • Score uses dot product (+ optional biases)

Why data scientists use this:

  • Stronger cold-start behavior
  • Smooth path between collaborative and content-based modeling
  • Practical when metadata quality is reasonable

The Google course makes the same idea concrete from a matrix-factorization angle: you can augment the original interaction matrix with user-feature and item-feature blocks, then factorize the augmented matrix so that side features learn embeddings alongside users and items. Conceptually, this is one of the cleanest bridges between classic WALS-style recommender systems and modern hybrid feature-based models.

5.5 Industrial ads CTR architectures

If you want a more industrial view of feature-rich ranking, two recent papers are worth reading after the material above.

These are useful extensions of the chapter because they show what happens when feature-rich recommendation is pushed into industrial ads ranking. They should not be read as broad recommender-system blueprints or as the state of the art for recommender systems in general. They are narrower and more specific: both are ads CTR architecture papers shaped by impression-level prediction, extreme scale, and production serving constraints.

6. Deep Models

The NVIDIA glossary adds an important extension: deep learning recommenders build on embeddings and factorization ideas, but replace simple linear interactions with more expressive neural architectures.

Useful model families include:

  • Feedforward networks and multilayer perceptrons for flexible nonlinear scoring
  • Convolutional models when image content matters
  • Recurrent networks and transformers for sequential, session-based behavior

6.1 Two-tower retrieval models

The Google course gives the retrieval intuition, and the two blog references sharpen how that intuition gets productionized. A two-tower model is a dual-encoder architecture: one tower maps the query side into an embedding, and the other maps the item side into the same embedding space. The interaction is deliberately delayed until the very end.

If xq is the query-side input and xi is the item-side input, the core score is:

s(xq,xi)=ψ(xq),ϕ(xi)

where ψ() is the query tower and ϕ() is the item tower.

This late-interaction design is the key reason two-tower models dominate retrieval and pre-ranking. The towers can be trained jointly, but item embeddings can then be precomputed and indexed, which makes large-scale ANN retrieval practical.

Two-tower retrieval architecture

Training objectives and the softmax view

Instead of only factorizing a user-item matrix, you can map a query context x through a neural network to a dense representation ψ(x) and score the catalog with a softmax layer:

z(x)=ψ(x)V

p(ix)=exp(zi)j=1|I|exp(zj)

where V contains the learned item representations.

In practice, exact softmax over a large catalog is too expensive, so industrial systems usually rely on sampled softmax, negative sampling, hard negatives, BPR-style pairwise losses, or contrastive objectives such as InfoNCE. Google’s negative-sampling subsection is worth reading because it gives a concrete explanation of folding: if you train only on positive pairs, embeddings from unrelated categories can collapse into the same region and produce spurious recommendations. The Shaped deep dive also notes that pointwise log loss is still common when the retriever is trained as a coarse candidate generator ahead of a stronger ranker.

Training a softmax recommendation model

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

AspectMatrix factorizationSoftmax DNN / dual-encoder training
Query and side featuresNot easy to include directlyCan incorporate richer query, context, and side features
Cold startWeak by default, though heuristics and projection tricks can helpHandles new queries more naturally when query features are available
Folding riskLess prone to folding; WALS-style weighting can help control itMore prone to folding and usually needs negative sampling or related regularization
Training scalabilityEasier to scale to very large sparse corporaHarder to scale; often needs sampling, hashing, or other approximations
Serving costVery cheap when user and item embeddings are static or cheaply updatedItem embeddings can be cached, but query embeddings often need to be computed online

Google’s summary judgment is useful: matrix factorization is usually the better retrieval choice for very large corpora, while DNN-based retrieval becomes attractive when you need richer query features and more personalized relevance modeling.

Training versus serving

This is where the architecture becomes operationally attractive:

  • During training, the two towers are optimized jointly so that relevant query-item pairs are close in the embedding space and irrelevant pairs are pushed apart.
  • During serving, the item tower is run offline over the full catalog and its embeddings are stored in an ANN index.
  • At request time, the system computes only the query embedding online, queries the ANN index, and returns a top-K candidate set for downstream ranking.

This decoupling is why two-tower models are common in candidate generation, related-item retrieval, and pre-ranking stages with strict latency budgets.

Two-tower training and serving workflow

Tower design choices

The towers do not have to be simple MLPs. As the Shaped article emphasizes, the query tower may consume user IDs, demographics, session state, search context, or sequential behavior, while the item tower may consume item IDs, metadata, text, image embeddings, or other modality-specific features.

Common choices include:

  • MLPs over concatenated embeddings and dense features
  • Sequence models or transformers on the query side for recent behavior
  • Text or multimodal encoders on the item side for semantic retrieval
  • Symmetric dual encoders when both sides have similar modalities
  • Asymmetric dual encoders when the query and item spaces are very different

For smaller catalogs, the raw two-tower score may be enough to rank directly. For very large catalogs, it is almost always used as a retrieval or pre-ranking model ahead of a richer scorer.

Limitations and promising extensions

The main weakness is also the reason the model is fast: user-item interaction is restricted to the final dot product. This creates an information bottleneck.

In practice, that means:

  • Fine-grained cross-feature interactions are not modeled explicitly
  • Subtle conditional preferences can be missed
  • The retriever usually needs a downstream ranker to recover accuracy

The Reach Sumit survey is useful here because it covers several extensions aimed at reducing this bottleneck while keeping most of the serving efficiency:

  • DAT (Dual Augmented Two-Tower): augments each tower with cross-side historical interaction signals
  • IntTower: adds feature-importance calibration, fine-grained early interaction, and contrastive interaction regularization
  • ColBERT-style late interaction: preserves query-item decoupling better than a full cross-encoder while keeping richer token-level matching than a pure dot product

These models live in the space between pure representation-based retrieval and full interaction-heavy ranking models.

Interaction-enhanced two-tower variants

6.2 Neural collaborative filtering

Neural collaborative filtering keeps the collaborative setup of user-item interactions, but learns the interaction function with a neural network instead of relying only on a dot product.

  • In NeuMF from D2L, a generalized matrix factorization (GMF) path is combined with an MLP path
  • This can capture more complex nonlinear relationships than matrix factorization alone
  • It is most useful when interaction volume is high enough to support a richer model

NeuMF also fits naturally with pairwise ranking and negative sampling, rather than only explicit rating prediction.

NeuMF architecture

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

6.3 Variational autoencoders for collaborative filtering

Variational autoencoder approaches learn a compressed latent representation of a user’s interaction history and then reconstruct likely missing interactions.

  • Useful for implicit-feedback recommendation
  • Helps capture nonlinear structure in sparse user-item behavior
  • Often treated as a reconstruction problem over interaction vectors

VAE-style collaborative filtering architecture

6.4 Contextual sequence learning

Session-based recommenders often care less about static preference and more about what the user is likely to do next.

  • In D2L’s sequence-aware recommendation section, the featured model is Caser, which uses horizontal and vertical convolutions over the recent interaction matrix
  • Horizontal filters capture union-level patterns across multiple recent actions
  • Vertical filters capture point-level effects of individual recent actions
  • RNN, LSTM, GRU, and transformer models are also widely used for this setting
  • Inputs can include both ordered actions and contextual features such as time, device, or location
  • This is especially relevant in streaming, shopping, and short-session products

Caser architecture

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

D2L also provides a useful view of how sequence-aware samples are constructed from chronological user histories, including the held-out next item and sampled negatives:

Sequence-aware data generation

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

6.5 Wide-and-deep style models

Wide-and-deep architectures combine memorization and generalization.

  • The wide component captures simpler feature interactions that may occur rarely
  • The deep component learns richer nonlinear structure through embeddings and dense layers
  • This pattern is effective when recommendation quality depends on both handcrafted cross-features and learned representations

Wide-and-deep recommendation architecture

6.6 DLRM-style models

DLRM-style models are designed for recommendation data with many categorical features and some numerical features.

  • Embeddings handle sparse categorical inputs
  • MLP layers process dense features
  • Explicit pairwise feature interactions are then modeled before final prediction

DLRM-style recommendation architecture

These models are widely used in large-scale ranking and click-through prediction systems.

7. Production Concerns

The model taxonomy is excellent, but real systems also require these decisions.

7.1 Retrieval + ranking architecture

Most large systems are two-stage:

  1. Candidate generation (fast, high recall)
  2. Ranking (slower, richer features/objective)

Without this separation, serving cost or latency becomes prohibitive.

Google’s course extends this into a practical three-stage view:

  1. Candidate generation
  2. Scoring or ranking
  3. Re-ranking

The extra re-ranking stage matters because the best ranked list for raw engagement is often not the best final surface once you account for freshness, diversity, fairness, or business constraints.

Recommendation process architecture

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

In production, candidate generation is usually itself a mixture of sources:

  • Embedding nearest neighbors from a two-tower or matrix-factorization model
  • Co-visitation or graph-based retrieval
  • Popularity or trending backfills
  • Rule-based inventory or policy constraints

One key Google point is that scores from different candidate generators are usually not comparable. That is why a separate scorer or ranker is often necessary after retrieval.

For neural retrieval, Google also stresses approximate nearest-neighbor search rather than exact brute-force scoring over the full catalog. Libraries such as ScaNN are used to make this practical at large scale.

Approximate nearest-neighbor retrieval in embedding space

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

7.2 Label design and negatives

For implicit data, non-click is not always negative. You need:

  • Exposure-aware negatives
  • Position-bias-aware training
  • Time-windowed labels matching product goals

Google’s scoring module makes a related point: you need to be explicit about what you are optimizing. A model trained for click probability can converge to clickbait. A model trained for watch time may overserve long items. A model trained for immediate conversion can hurt long-term trust or retention.

In other words, score definition is part of the product design, not just a modeling choice.

For feed-style or slate recommendation, Google also recommends distinguishing between:

  • Position-dependent models, which estimate utility at a fixed slot
  • Position-independent models, which try to estimate intrinsic relevance before layout effects

That distinction matters because position bias can make top slots look artificially better even when the item itself is not more relevant.

7.3 Evaluating recommender and ranking systems

D2L’s NeuMF evaluator is a good starting point for implicit-feedback ranking evaluation. The protocol uses a chronological split, holds out a future ground-truth item gu for each user u, and ranks that item against items the user has not interacted with.

Two core metrics in that setup are:

Hit@K=1|U|uU1(ranku,guK)

and

AUC=1|U|uU1|ISu|jISu1(ranku,gu<ranku,j)

where U is the user set, I is the item set, and Su is the set of items already associated with user u.

This evaluator is useful because it respects time order and measures whether the held-out future item is surfaced near the top. But for production recommendation systems, you usually need a wider evaluation stack than Hit@K and AUC alone.

Stage-specific offline metrics

  • Retrieval: use Recall@M or candidate hit rate to verify that the candidate generator is not dropping relevant items before the ranker sees them.
  • Ranking: use NDCG@K, Recall@K, and MRR for top-of-list quality. If you have multiple relevant held-out items per user, MAP is also useful.
  • Rating prediction: use RMSE or MAE only when explicit rating prediction is the real product task. These metrics are much less informative for feed ranking or item recommendation.

Let Gu denote the relevant items for user u, let Cu(M) be the top-M retrieval candidate set, and let Lu(K) be the top-K ranked list.

For retrieval, a standard candidate-stage metric is:

Recall@M=1|U|uU|GuCu(M)||Gu|

For the final ranked list, the analogous metric is:

Recall@K=1|U|uU|GuLu(K)||Gu|

To reward correct ordering near the top, define

DCG@K(u)=j=1K2relu,j1log2(j+1)

and then normalize by the ideal ordering:

NDCG@K=1|U|uUDCG@K(u)IDCG@K(u)

where relu,j is the relevance label of the item at position j for user u.

If you care about the position of the first relevant result, use mean reciprocal rank:

MRR=1|U|uU1ru

where ru is the rank position of the first relevant item for user u, with reciprocal rank taken as 0 if no relevant item appears.

If multiple relevant items can appear in the list, average precision is also useful:

AP@K(u)=1min(|Gu|,K)j=1KPrecision@j(u),1(iu,jGu)

MAP@K=1|U|uUAP@K(u)

where iu,j is the item shown at rank j to user u.

Among these, NDCG@K is often the strongest single ranking metric because it rewards putting the most relevant items near the top rather than merely somewhere in the top K.

Protocol choices matter as much as the metric

  • Use chronological splits for implicit and sequence-aware tasks. Random splits can leak future information.
  • State clearly whether evaluation is full-catalog, sampled-negative, or candidate-set based. Numbers are not comparable across these protocols.
  • Evaluate on exposed or eligible items when possible. Treating every unclicked item in the full catalog as a negative can distort results.
  • Report by segment: new users, power users, new items, head items, and long-tail items often behave very differently.
  • When the system has multiple stages, evaluate each stage separately and end-to-end.

Beyond ranking accuracy

Accuracy metrics alone can produce a recommender that is brittle or bad for the product.

  • Coverage: how much of the catalog is ever recommended
  • Diversity: how different the recommended items are from one another
  • Novelty and serendipity: whether the system only repeats obvious items
  • Calibration: whether recommendations match the user’s current intent, not just their historical average
  • Fairness and marketplace health: whether some suppliers, creators, or item groups are systematically suppressed

These matter because a system with slightly lower NDCG@K can still be better for long-term engagement if it improves diversity, catalog health, or repeat-user satisfaction.

Online evaluation

Offline metrics are necessary but insufficient. You still need A/B tests with:

  • Primary metrics: CTR, conversion, watch time, retention, revenue, or long-term value depending on the product
  • Guardrails: latency, page-load impact, complaint rate, hide/block rate, unsafe-content rate
  • Diagnostic cuts: new vs returning users, cold-start items, geography, device, heavy-user cohorts

For ranking changes, it is also useful to monitor the full funnel:

  • Candidate recall
  • Ranker win rate on exposed impressions
  • Final-surface engagement
  • Downstream business outcomes

Offline-to-online recommender evaluation flow

7.4 Re-ranking, freshness, diversity, and exploration

Pure exploitation can collapse catalog diversity. You need controlled exploration:

  • Epsilon-greedy or Thompson-style policies
  • Re-ranking for diversity/novelty
  • Periodic calibration checks

Google’s reranking material is especially useful here. In practice, re-ranking is where you inject constraints that the base ranker usually misses:

  • Freshness so the feed does not go stale
  • Diversity so near-duplicate items do not dominate
  • Fairness or marketplace balance so one creator, seller, or provider is not systematically overexposed
  • Local policy constraints such as demotions, blocks, maturity filters, or legal limits

This stage is often simpler than the main ranker, but it has outsized product impact because it controls the final list actually seen by the user.

7.5 Reliability and monitoring

Data scientists should treat recommenders as continuously monitored systems:

  • Feature drift and embedding drift
  • Candidate recall degradation
  • Online metric drift and alerting
  • Safe fallback policies

If you want a broader production-systems reference beyond recommender-specific papers, Chip Huyen’s Designing Machine Learning Systems is a strong complement to this chapter.

It is not a recommender-systems book specifically, but it is useful for recommender practitioners because it covers the operational side of production ML: data pipelines, iterative development, deployment tradeoffs, monitoring, feedback loops, and the gap between offline model quality and deployed system behavior. That makes it a good companion once you move from method selection into long-term system ownership.

8. Practical Build Sequence for Data Scientists

Use this as a practical order of operations rather than a rigid recipe. The point is to add complexity only when the simpler stage has already been validated.

8.1 Define the objective hierarchy

Start by making the optimization target explicit.

  • Clarify whether the system is optimizing short-term CTR, conversion, watch time, retention, revenue, or long-term value
  • Decide which goals are primary and which are guardrails
  • Make sure the metric definition matches the actual user and business objective

This step matters because the wrong target can make even a technically strong ranker harmful in production.

8.2 Build strong non-ML baselines

Before training complex models, establish hard-to-beat baselines:

  • popularity
  • recency
  • co-visitation
  • simple item-to-item similarity

These baselines are useful for debugging, launch safety, and calibration. If a more complex model cannot beat them offline and online, the model is probably not production-ready.

8.3 Add collaborative filtering

Once you have meaningful interaction data, collaborative filtering is usually the first serious model family to try.

  • start with matrix factorization or neighborhood approaches
  • use this stage to learn whether interaction data alone is already enough to support useful personalization
  • evaluate retrieval quality separately from final ranking quality

This is often the point where recommendation becomes genuinely personalized rather than mostly heuristic.

8.4 Add metadata for hybrid robustness

After collaborative filtering is working, bring in user, item, and context features.

  • item metadata for cold-start items
  • user/context features for sparse users or context-sensitive surfaces
  • hybrid factorization or feature-rich ranking models

This stage improves robustness, especially when the system has to handle new items, changing inventory, or sparse users.

8.5 Introduce two-stage retrieval and ranking

As the catalog grows, a single heavy model over the full candidate space becomes impractical.

  • add a fast, high-recall candidate generator
  • follow with a richer ranker using better features and objectives
  • add re-ranking if you need freshness, diversity, fairness, or policy constraints

This is usually the architectural step that turns a workable recommender into a scalable one.

8.6 Establish experiment and monitoring standards

Once the system is live, treat it as a continuously evaluated product system.

  • define offline and online success criteria
  • use A/B testing with guardrails
  • monitor candidate recall, latency, drift, and business impact
  • keep safe fallback policies available

At this point, the core challenge is no longer just model training. It is maintaining quality under changing data, product goals, and operational constraints.

9. Summary

The current handbook path is meant to give data scientists a compact but practical map of the field:

  • Explicit vs implicit feedback
  • Recommendation tasks such as rating prediction, top-n ranking, sequence-aware recommendation, CTR prediction, and cold-start
  • Benchmark data practices such as MovieLens sparsity analysis and chronological evaluation splits
  • Content-based, collaborative, contextual, and hybrid filtering
  • Embedding-space candidate generation, similarity design, and matrix factorization variants
  • AutoRec and ranking objectives such as BPR and hinge loss
  • Feature-rich recommendation with factorization machines and DeepFM
  • Deep recommenders such as two-tower retrieval, interaction-enhanced dual encoders, NCF, VAE-style models, wide-and-deep models, and DLRM-style architectures
  • Three-stage production design with retrieval, scoring, and re-ranking for freshness, diversity, and fairness

For practicing data scientists, the differentiator is not only model choice. It is operational quality: robust labeling, unbiased evaluation, scalable serving, and disciplined online experimentation.

The next two chapters extend the handbook in two directions:

  • References groups the main materials behind the handbook.
  • Survey Papers and Further Reading points to broader literature when you want to go beyond a chapter-level guide.

10. References

Use this chapter as the source map for the handbook. The list is grouped by how the material is most useful in practice.

10.1 Guides and courses

10.2 Core modeling papers

10.3 Tools and implementations

10.4 Industrial ads CTR papers

11. Survey Papers and Further Reading

This chapter is not meant to be exhaustive. The goal is to help you widen your map of the field without losing the thread of the handbook.

Before jumping into the reading list, it helps to be explicit about how practitioners can use survey papers efficiently.

11.1 How to use survey papers efficiently

For practitioners, the best way to use a survey paper is usually not to read every cited method in order.

  1. Read the taxonomy first and decide which branch actually matches your product surface.
  2. Focus on the evaluation section to see which metrics and data assumptions are standard in that subfield.
  3. Pull out the open-problems section to understand what still breaks in real systems.
  4. Use the references to identify a small number of landmark papers rather than trying to read the full citation graph.

If you are building production systems rather than writing a paper, the main value of surveys is not completeness. It is faster problem framing and better judgment about which methods are mature enough to operationalize.

With that reading strategy in mind, here is a compact starting set.

11.2 Start here

If you only want a small reading set after finishing this handbook, start with these:

  1. A Survey on Accuracy-oriented Neural Recommendation: From Collaborative Filtering to Information-rich Recommendation (2021)
  2. A Survey on Session-based Recommender Systems (2022)
  3. Bias and Debias in Recommender System: A Survey and Future Directions (2020)
  4. Multimodal Recommender Systems: A Survey (2023; updated 2024)
  5. How Can Recommender Systems Benefit from Large Language Models: A Survey (2025)
  6. A Survey of Real-World Recommender Systems: Challenges, Constraints, and Industrial Perspectives (2025)

This subset gives you coverage of classical-to-neural evolution, sequence recommendation, bias, multimodality, LLM-era recommender work, and industrial deployment constraints.

11.3 General and industry-facing surveys

A Survey of Real-World Recommender Systems: Challenges, Constraints, and Industrial Perspectives (2025). This survey is unusually valuable for practitioners because it centers deployment constraints rather than only benchmark performance. It discusses the industrial tradeoffs around latency, retraining cadence, marketplace constraints, product surfaces, organizational limitations, and evaluation gaps between offline results and business outcomes. It is relevant to this handbook because it is the closest survey match to the practical orientation of Production Concerns and Build Sequence.

11.4 Neural and deep-learning recommender surveys

A Survey on Accuracy-oriented Neural Recommendation: From Collaborative Filtering to Information-rich Recommendation (2021). This survey is one of the most useful bridges between classic collaborative filtering and the modern neural recommendation literature. It does not just list architectures; it organizes the field by how recommendation moves from pure user-item interaction modeling toward richer inputs such as context, content, knowledge, and multi-behavior signals. It is relevant to this handbook because it helps place the Model Families, Feature-Rich Recommendation, and Deep Models chapters into one coherent research map instead of treating them as separate toolkits.

11.5 Sequential, session-based, and decision-oriented surveys

A Survey on Session-based Recommender Systems (2022). This survey focuses on settings where long-run user histories are weak, absent, or less useful than short-horizon intent. It lays out the data structure, problem formulation, and method families for session-based recommendation, including the shift from simple Markov-style methods to GRU-, attention-, and transformer-based models. It is relevant to this handbook because the Explicit vs. Implicit Feedback, Model Families, and Deep Models chapters only introduce sequence-aware recommendation at a high level; this paper is the right follow-on when your product is driven by recency, intent shifts, and within-session behavior.

11.6 Responsible and robust recommendation

Bias and Debias in Recommender System: A Survey and Future Directions (2020). This survey is useful because it reframes recommendation quality as partly a data-generation problem rather than only a model-design problem. It catalogs major bias sources such as exposure bias, selection bias, position bias, and popularity bias, then reviews algorithmic and evaluation-side responses. It is relevant to this handbook because many of the production issues discussed in Production Concerns and the evaluation caveats in the ranking chapters become much easier to reason about once you view recommendation pipelines through a bias-and-debias lens.

11.7 Emerging directions

Multimodal Recommender Systems: A Survey (2023; updated 2024). This survey covers recommendation systems that use more than IDs and tabular metadata, such as text, images, audio, and video. It is especially useful for understanding how representation learning changes when item understanding itself becomes a multimodal problem instead of a pure interaction problem. It is relevant to this handbook because it extends the Feature-Rich Recommendation and Deep Models chapters into the part of the field where content understanding and recommendation become tightly coupled.

How Can Recommender Systems Benefit from Large Language Models: A Survey (2025). This survey is a strong next read if you want to understand where LLMs fit into recommendation without collapsing everything into hype. It organizes the space around representation, reasoning, generation, user understanding, and agentic or interactive recommendation settings, while also discussing limits such as latency, hallucination, and evaluation mismatch. It is relevant to this handbook because it helps extend the Deep Models and Production Concerns chapters into the current wave of LLM-assisted retrieval, ranking, explanation, and recommendation workflows.

11.8 Foundational pre-2020 surveys worth keeping

These are older than the 2020+ focus of this section, but they are still worth keeping because they remain useful orientation documents.