Understanding Recommender Systems

Single-page version for printing or saving as PDF.

Use the handbook version if you want chapter navigation and the left sidebar.

Chapter Guide

1. Why Recommender Systems Matter

Recommender systems help users navigate very large item catalogs (videos, products, courses, jobs, music) by ranking items likely to be relevant to each user. See the background overview in Wikipedia: Recommender system.

Recommendation process illustration

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

For data scientists, this is usually not a pure prediction task. It is a ranking and decision problem with constraints:

Relevance and personalization
Diversity and novelty
Latency and serving cost
Business goals (retention, conversion, long-term value)

Two-stage recommender architecture

1.1 Common applications

E-commerce and retail: cross-sell, upsell, “complete the look”, and basket expansion
Media and entertainment: personalized ranking of video, music, articles, and ads
Banking and financial services: product recommendations, offers, and next-best action

1.2 Business value

Helps users discover items they would not have found through search alone
Increases engagement, session depth, and content consumption
Improves conversion, basket size, and retention when recommendations are well-targeted

Google’s recommender course also makes a useful product distinction between two common surfaces:

Homepage recommendations, where the query is the user or session context
Related-item recommendations, where the query is the current item being viewed

That distinction matters because homepage recommendation usually starts from a user or context embedding, while related-item recommendation often starts from the item embedding itself and retrieves nearby items in embedding space.

2. Explicit vs. Implicit Feedback

As in the reference article, the first key split is the type of supervision.

2.1 Explicit feedback

Examples:

Star ratings
Like/dislike labels
Written reviews with sentiment scores

Pros:

Direct preference signal
Easier to define regression-style losses

Cons:

Sparse in most real products
Selection bias (only some users rate)

2.2 Implicit feedback

Examples:

Clicks
Watch time
Purchases
Add-to-cart, save, dwell

Pros:

High volume
Better behavioral coverage

Cons:

Noisy preference proxy
Requires careful negative sampling and weighting

In both cases, interactions define a sparse user-item matrix with entries over user-item pairs $(u, i)$ .

Explicit versus implicit feedback comparison

User-item matrix examples for explicit and implicit data

2.3 Recommendation tasks

Following D2L Chapter 21, it helps to separate recommendation work by task:

Rating prediction: estimate a user’s explicit rating for an item
Top- $n$ recommendation: rank candidate items and return a personalized list
Sequence-aware recommendation: use ordered behavior and timestamps
Click-through rate prediction: predict whether a shown item or ad will be clicked
Cold-start recommendation: serve new users or new items when history is limited

These tasks overlap, but they drive different labels, evaluation protocols, and model choices.

2.4 Benchmark datasets and split strategy

The MovieLens 100K dataset remains the standard conceptual benchmark for explicit-feedback recommendation.

100,000 ratings
943 users
1,682 movies
Ratings from 1 to 5
Approximate matrix sparsity of 93.7%

Two split strategies from D2L are especially useful in practice:

Random split for rating prediction and general offline evaluation
Sequence-aware split, where the most recent interaction is held out per user

This distinction matters because sequence-aware recommendation should be evaluated with a chronological split, not a random one.

3. Model Families

Model choice depends heavily on what data you have. If you only observe interactions, collaborative filtering is usually the first serious approach. If you also have user and item attributes, content-based or hybrid models become more useful. If the current situation matters, such as device, country, time, or within-session behavior, then contextual models become important.

3.1 Content-based filtering

Use user/item attributes and metadata.

Item vectors from text/category/tags/embeddings
User representation from demographics and/or consumed item profiles
Similarity models (cosine, k-NN) or supervised models over features

Strength:

Better cold-start for new items (and sometimes new users)

Limitation:

Limited collaborative signal; can over-specialize
Can become overly narrow if the feature space does not capture richer or emerging interests

Google’s course also emphasizes that content-based systems are often easier to explain and easier to cold-start for new items, but they tend to be weaker at serendipity than collaborative models.

Aspect	Advantages	Disadvantages
User specificity	Does not require interaction data from other users, so it can scale cleanly across many users and preserve privacy better	Quality depends heavily on having good item features and user profiles
Discovery pattern	Can serve niche items that match a user’s known interests very well	Usually expands less well beyond existing interests, so serendipity is weaker
Modeling burden	Easier to explain because recommendations can be tied back to item attributes	Requires substantial domain knowledge and hand-engineered or high-quality learned features

Content-based filtering feature matrix illustration

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

3.2 Collaborative filtering

Use interaction patterns across all users/items.

Neighborhood methods (user-user, item-item)
Latent-factor methods (matrix factorization)

Strength:

Often strong personalization with enough interaction history

Limitation:

Cold-start if no history exists

Aspect	Advantages	Disadvantages
Feature engineering	Embeddings are learned automatically, so little prior domain knowledge is required	Harder to incorporate side features such as demographics, metadata, or context without model extensions
Discovery pattern	Can introduce serendipity because similar users can pull an item into a user’s candidate set	Suffers from cold-start for fresh users and fresh items without interactions
Practical role	Strong starting point because a feedback matrix alone can power a usable candidate generator	Production systems usually need extra machinery such as WALS projection heuristics, side-feature augmentation, or hybrid models to fill the gaps

3.3 Contextual filtering

Contextual filtering incorporates information about the current situation into the recommendation process.

Examples of context: device, country, date, time, session state, or recent action sequence
Useful when the same user may want different items under different circumstances
Often framed as next-action or next-item prediction rather than only long-run preference estimation

Contextual recommendation diagram

3.4 Hybrid models

Combine metadata with interaction learning.

Best default choice in many production systems
Handles cold-start better than pure collaborative filtering
Usually outperforms pure content-based methods once enough interactions accumulate

Content-based, collaborative filtering, and hybrid model comparison

3.5 Embedding spaces and similarity measures for candidate generation

The Google Developers course sharpens an important operational point: candidate generation is usually a nearest-neighbor search problem in an embedding space. Given a query embedding $q$ and item embedding $x$ , the retrieval stage depends heavily on the similarity measure you choose.

Common choices are:

$s_{\cos} (q, x) = \frac{⟨ q, x ⟩}{‖ q ‖_{2} ‖ x ‖_{2}}$

$s_{dot} (q, x) = ⟨ q, x ⟩$

$d_{L 2} (q, x) = ‖ q - x ‖_{2}$

If the embeddings are normalized, cosine, dot product, and squared Euclidean distance induce closely related rankings. Without normalization, however, they behave differently:

Dot product favors larger embedding norms, which often correlates with popular or frequent items
Cosine focuses more on angular alignment, which can be better for semantic similarity
Euclidean distance emphasizes physical closeness in the embedding space

Google also suggests a useful interpolation between pure cosine and pure dot product:

$s_{α} (q, x) = ‖ q ‖_{2}^{α} ‖ x ‖_{2}^{α} \cos (q, x), α \in (0, 1)$

This lets you keep some popularity signal without letting large-norm items dominate retrieval.

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

Different similarity choices induce different rankings

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

4. Matrix Factorization

The reference article emphasizes matrix factorization variants. This remains foundational for data scientists.

4.1 PMF / latent factors (explicit feedback)

Model:

${\hat{r}}_{u i} = p_{u}^{⊤} q_{i}$

where user and item embeddings $p_{u}, q_{i} \in R^{f}$ .

Regularized loss over observed pairs $Ω$ :

$\begin{aligned} min_{P, Q} & \sum_{(u, i) \in Ω} {(r_{u i} - p_{u}^{⊤} q_{i})}^{2} & + λ (‖ p_{u} ‖_{2}^{2} + ‖ q_{i} ‖_{2}^{2}) \end{aligned}$

Optimization:

SGD (simple, flexible)
ALS (efficient for large sparse systems)
Practical implementations are available in the Surprise library and its documentation

With ALS, you alternate between solving for user factors while holding item factors fixed and solving for item factors while holding user factors fixed. That makes large sparse factorization problems easier to optimize in practice.

Illustration of matrix factorization model

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

Alternating least squares optimization cycle

4.2 SVD-style bias terms

A common extension adds global/user/item bias terms:

${\hat{r}}_{u i} = μ + b_{u} + b_{i} + p_{u}^{⊤} q_{i}$

Biases capture broad effects (strict users, broadly popular items) and usually improve quality.

4.3 Implicit-feedback factorization

Following the article’s logic, implicit events are treated as preference plus confidence.

One common setup:

Preference: $p_{u i} \in 0, 1$ from interaction presence
Confidence: $c_{u i} = 1 + α \cdot t_{u i}$ , where $t_{u i}$ is interaction strength

Objective:

$\begin{aligned} min_{X, Y} & \sum_{u, i} c_{u i} {(p_{u i} - x_{u}^{⊤} y_{i})}^{2} & + λ (‖ x_{u} ‖_{2}^{2} + ‖ y_{i} ‖_{2}^{2}) \end{aligned}$

This is the core weighted-implicit matrix factorization approach used in large-scale recommenders.

The Google course adds an important weighted-matrix-factorization view that is especially useful in industrial retrieval systems. Let $A$ be the feedback matrix and let $obs$ denote observed interactions. A common weighted objective is:

$\begin{aligned} min_{U, V} & \sum_{(u, i) \in obs} {(A_{u i} - ⟨ U_{u}, V_{i} ⟩)}^{2} & + w_{0} \sum_{(u, i) \notin obs} ⟨ U_{u}, V_{i} ⟩^{2} \end{aligned}$

Here $w_{0}$ controls how strongly the model treats unobserved pairs as weak negatives. In practice, this matters a lot: too little weight on unobserved pairs can make the embedding space collapse, while too much weight can wash out true positives. Google also notes that frequent users or popular items can dominate the objective, so observed pairs are often reweighted by user or item frequency.

4.4 Evaluation for rating prediction

For explicit-feedback recommendation, D2L’s matrix factorization section uses RMSE as the primary evaluation measure:

$RMSE = \sqrt{\frac{1}{| T |} \sum_{(u, i) \in T} {(r_{u i} - {\hat{r}}_{u i})}^{2}}$

where $T$ is the evaluation set of observed user-item pairs.

RMSE is appropriate for rating prediction, but it is not sufficient for top- $n$ recommendation because it does not evaluate rank order.

4.5 AutoRec for nonlinear rating prediction

AutoRec extends collaborative filtering with an autoencoder-style reconstruction objective.

Input is a partially observed user vector or item vector from the rating matrix
The network reconstructs missing entries through a hidden representation
Only observed ratings should contribute to the training loss

For item-based AutoRec, D2L writes the input as the $i$ th column $R_{* i}$ of the rating matrix and reconstructs it with a nonlinear network:

$h (R_{* i}) = f (W, g (V R_{* i} + μ) + b)$

The learning objective minimizes reconstruction error over observed entries only:

$\begin{aligned} min_{W, V, μ, b} & \sum_{i = 1}^{M} {‖ R_{* i} - h (R_{* i}) ‖}_{O}^{2} & + λ (‖ W ‖_{F}^{2} + ‖ V ‖_{F}^{2}) \end{aligned}$

Conceptually, AutoRec matters because it is one of the earliest examples in D2L of moving from linear collaborative filtering to nonlinear neural reconstruction for rating prediction.

4.6 Personalized ranking objectives

D2L makes an important distinction between rating prediction objectives and ranking objectives.

Pointwise objectives model one user-item interaction at a time
Pairwise objectives model relative preference between a positive and a negative item
Listwise objectives optimize properties of an entire ranked list

Objective family	Training signal	Pros	Cons	Typical use
Pointwise	One labeled user-item example at a time	Simple to implement, works with standard regression or classification losses, easy to calibrate as a score or probability	Does not optimize ordering directly, sensitive to label noise and exposure bias, can overfocus on absolute score accuracy	CTR prediction, rating prediction, coarse ranking baselines
Pairwise	Positive item compared against a sampled negative item	Better aligned with top- $n$ ranking, efficient for implicit feedback, usually easier to train than full listwise methods	Quality depends heavily on negative sampling, does not model full-list effects, can miss business constraints beyond pair comparisons	Candidate generation, implicit-feedback retrieval, pre-ranking
Listwise	Entire ranked list or slate	Best conceptual match to ranking metrics such as NDCG, can optimize position effects and whole-list quality	More complex objectives, heavier computation, harder data construction and serving alignment	Final-stage ranking, search ranking, slate optimization

For top- $n$ recommendation from implicit feedback, pairwise objectives are often a better match to the task.

The two core D2L losses are:

Bayesian Personalized Ranking (BPR), which encourages the positive item to score above a sampled negative item:

$\sum_{(u, i, j) \in D} \ln σ ({\hat{y}}_{u i} - {\hat{y}}_{u j}) - λ_{Θ} ‖ Θ ‖^{2}$

Hinge ranking loss, which pushes the positive item away from the negative item by a margin $m$ :

$\sum_{(u, i, j) \in D} max (m - {\hat{y}}_{u i} + {\hat{y}}_{u j}, 0)$

These are central for implicit-feedback recommendation because they optimize relative ordering rather than absolute score accuracy.

4.7 SVD++ intuition

SVD++ augments user representation with signals from interacted items, helping when explicit feedback is sparse but interaction history exists.

5. Feature-Rich Recommendation

As D2L section 21.8 emphasizes, interaction data is often sparse and noisy. In many production settings, recommendation is better framed as impression-level prediction with rich side features.

5.1 Feature-rich recommendation and CTR

Feature-rich recommendation is common in ads, feeds, and product surfaces.

Labels are often binary, such as click vs no click
Inputs include many categorical fields rather than only user and item IDs
The D2L advertising example uses 34 fields, with the first column as the click label and the remaining columns as categorical features

This setting is different from classic matrix factorization because the goal is often click-through rate prediction over impression-level examples rather than rating reconstruction.

CTR is defined as:

$CTR = \frac{clicks}{impressions} \times 100$

5.2 Factorization machines

Factorization machines are one of the most important bridges between collaborative filtering and feature-rich prediction.

For a feature vector $x \in R^{d}$ , the two-way FM model is:

$\hat{y} (x) = w_{0} + \sum_{i = 1}^{d} w_{i} x_{i} + \sum_{i = 1}^{d} \sum_{j = i + 1}^{d} ⟨ v_{i}, v_{j} ⟩ x_{i} x_{j}$

Interpretation:

The first two terms are linear
The last term models pairwise feature interactions
If one feature encodes user identity and another encodes item identity, the interaction term reduces to a collaborative-filtering-style embedding interaction

D2L also highlights the computational trick that reduces FM interaction cost from $O (k d^{2})$ to $O (k d)$ , which is why FM remains practical on high-dimensional sparse data.

5.3 DeepFM

DeepFM extends FM by combining low-order feature interactions from FM with high-order nonlinear interactions from a deep network.

The FM branch captures low-order interactions
The deep branch uses shared embeddings and an MLP to learn higher-order interactions
Both outputs are combined into a final prediction

D2L presents the DeepFM prediction as:

$\hat{y} = σ ({\hat{y}}^{(F M)} + {\hat{y}}^{(D N N)})$

DeepFM is especially useful when simple pairwise interactions are not expressive enough, but you still want the inductive bias of factorization-based feature interaction.

DeepFM architecture

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

5.4 Hybrid factorization with features (LightFM-style)

User embedding = sum of user-feature embeddings
Item embedding = sum of item-feature embeddings
Score uses dot product (+ optional biases)

Why data scientists use this:

Stronger cold-start behavior
Smooth path between collaborative and content-based modeling
Practical when metadata quality is reasonable

The Google course makes the same idea concrete from a matrix-factorization angle: you can augment the original interaction matrix with user-feature and item-feature blocks, then factorize the augmented matrix so that side features learn embeddings alongside users and items. Conceptually, this is one of the cleanest bridges between classic WALS-style recommender systems and modern hybrid feature-based models.

5.5 Industrial ads CTR architectures

If you want a more industrial view of feature-rich ranking, two recent papers are worth reading after the material above.

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction (2022) is an ads CTR paper about feature interaction modeling at online advertising scale. Its main point is that different interaction modules capture different useful signals, so a hierarchical ensemble can outperform committing to a single interaction design.
InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction (2025) is a Meta Ads CTR paper. It focuses on learning stronger interactions between heterogeneous signals such as profile features, context features, and behavior sequences.

These are useful extensions of the chapter because they show what happens when feature-rich recommendation is pushed into industrial ads ranking. They should not be read as broad recommender-system blueprints or as the state of the art for recommender systems in general. They are narrower and more specific: both are ads CTR architecture papers shaped by impression-level prediction, extreme scale, and production serving constraints.

6. Deep Models

The NVIDIA glossary adds an important extension: deep learning recommenders build on embeddings and factorization ideas, but replace simple linear interactions with more expressive neural architectures.

Useful model families include:

Feedforward networks and multilayer perceptrons for flexible nonlinear scoring
Convolutional models when image content matters
Recurrent networks and transformers for sequential, session-based behavior

6.1 Two-tower retrieval models

The Google course gives the retrieval intuition, and the two blog references sharpen how that intuition gets productionized. A two-tower model is a dual-encoder architecture: one tower maps the query side into an embedding, and the other maps the item side into the same embedding space. The interaction is deliberately delayed until the very end.

If $x_{q}$ is the query-side input and $x_{i}$ is the item-side input, the core score is:

$s (x_{q}, x_{i}) = ⟨ ψ (x_{q}), ϕ (x_{i}) ⟩$

where $ψ (\cdot)$ is the query tower and $ϕ (\cdot)$ is the item tower.

This late-interaction design is the key reason two-tower models dominate retrieval and pre-ranking. The towers can be trained jointly, but item embeddings can then be precomputed and indexed, which makes large-scale ANN retrieval practical.

Two-tower retrieval architecture

Training objectives and the softmax view

Instead of only factorizing a user-item matrix, you can map a query context $x$ through a neural network to a dense representation $ψ (x)$ and score the catalog with a softmax layer:

$z (x) = ψ (x) V^{⊤}$

$p (i ∣ x) = \frac{\exp (z_{i})}{\sum_{j = 1}^{| I |} \exp (z_{j})}$

where $V$ contains the learned item representations.

In practice, exact softmax over a large catalog is too expensive, so industrial systems usually rely on sampled softmax, negative sampling, hard negatives, BPR-style pairwise losses, or contrastive objectives such as InfoNCE. Google’s negative-sampling subsection is worth reading because it gives a concrete explanation of folding: if you train only on positive pairs, embeddings from unrelated categories can collapse into the same region and produce spurious recommendations. The Shaped deep dive also notes that pointwise log loss is still common when the retriever is trained as a coarse candidate generator ahead of a stronger ranker.

Training a softmax recommendation model

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

Aspect	Matrix factorization	Softmax DNN / dual-encoder training
Query and side features	Not easy to include directly	Can incorporate richer query, context, and side features
Cold start	Weak by default, though heuristics and projection tricks can help	Handles new queries more naturally when query features are available
Folding risk	Less prone to folding; WALS-style weighting can help control it	More prone to folding and usually needs negative sampling or related regularization
Training scalability	Easier to scale to very large sparse corpora	Harder to scale; often needs sampling, hashing, or other approximations
Serving cost	Very cheap when user and item embeddings are static or cheaply updated	Item embeddings can be cached, but query embeddings often need to be computed online

Google’s summary judgment is useful: matrix factorization is usually the better retrieval choice for very large corpora, while DNN-based retrieval becomes attractive when you need richer query features and more personalized relevance modeling.

Training versus serving

This is where the architecture becomes operationally attractive:

During training, the two towers are optimized jointly so that relevant query-item pairs are close in the embedding space and irrelevant pairs are pushed apart.
During serving, the item tower is run offline over the full catalog and its embeddings are stored in an ANN index.
At request time, the system computes only the query embedding online, queries the ANN index, and returns a top- $K$ candidate set for downstream ranking.

This decoupling is why two-tower models are common in candidate generation, related-item retrieval, and pre-ranking stages with strict latency budgets.

Two-tower training and serving workflow

Tower design choices

The towers do not have to be simple MLPs. As the Shaped article emphasizes, the query tower may consume user IDs, demographics, session state, search context, or sequential behavior, while the item tower may consume item IDs, metadata, text, image embeddings, or other modality-specific features.

Common choices include:

MLPs over concatenated embeddings and dense features
Sequence models or transformers on the query side for recent behavior
Text or multimodal encoders on the item side for semantic retrieval
Symmetric dual encoders when both sides have similar modalities
Asymmetric dual encoders when the query and item spaces are very different

For smaller catalogs, the raw two-tower score may be enough to rank directly. For very large catalogs, it is almost always used as a retrieval or pre-ranking model ahead of a richer scorer.

Limitations and promising extensions

The main weakness is also the reason the model is fast: user-item interaction is restricted to the final dot product. This creates an information bottleneck.

In practice, that means:

Fine-grained cross-feature interactions are not modeled explicitly
Subtle conditional preferences can be missed
The retriever usually needs a downstream ranker to recover accuracy

The Reach Sumit survey is useful here because it covers several extensions aimed at reducing this bottleneck while keeping most of the serving efficiency:

DAT (Dual Augmented Two-Tower): augments each tower with cross-side historical interaction signals
IntTower: adds feature-importance calibration, fine-grained early interaction, and contrastive interaction regularization
ColBERT-style late interaction: preserves query-item decoupling better than a full cross-encoder while keeping richer token-level matching than a pure dot product

These models live in the space between pure representation-based retrieval and full interaction-heavy ranking models.

Interaction-enhanced two-tower variants

6.2 Neural collaborative filtering

Neural collaborative filtering keeps the collaborative setup of user-item interactions, but learns the interaction function with a neural network instead of relying only on a dot product.

In NeuMF from D2L, a generalized matrix factorization (GMF) path is combined with an MLP path
This can capture more complex nonlinear relationships than matrix factorization alone
It is most useful when interaction volume is high enough to support a richer model

NeuMF also fits naturally with pairwise ranking and negative sampling, rather than only explicit rating prediction.

NeuMF architecture

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

6.3 Variational autoencoders for collaborative filtering

Variational autoencoder approaches learn a compressed latent representation of a user’s interaction history and then reconstruct likely missing interactions.

Useful for implicit-feedback recommendation
Helps capture nonlinear structure in sparse user-item behavior
Often treated as a reconstruction problem over interaction vectors

VAE-style collaborative filtering architecture

6.4 Contextual sequence learning

Session-based recommenders often care less about static preference and more about what the user is likely to do next.

In D2L’s sequence-aware recommendation section, the featured model is Caser, which uses horizontal and vertical convolutions over the recent interaction matrix
Horizontal filters capture union-level patterns across multiple recent actions
Vertical filters capture point-level effects of individual recent actions
RNN, LSTM, GRU, and transformer models are also widely used for this setting
Inputs can include both ordered actions and contextual features such as time, device, or location
This is especially relevant in streaming, shopping, and short-session products

Caser architecture

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

D2L also provides a useful view of how sequence-aware samples are constructed from chronological user histories, including the held-out next item and sampled negatives:

Sequence-aware data generation

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

6.5 Wide-and-deep style models

Wide-and-deep architectures combine memorization and generalization.

The wide component captures simpler feature interactions that may occur rarely
The deep component learns richer nonlinear structure through embeddings and dense layers
This pattern is effective when recommendation quality depends on both handcrafted cross-features and learned representations

Wide-and-deep recommendation architecture

6.6 DLRM-style models

DLRM-style models are designed for recommendation data with many categorical features and some numerical features.

Embeddings handle sparse categorical inputs
MLP layers process dense features
Explicit pairwise feature interactions are then modeled before final prediction

DLRM-style recommendation architecture

These models are widely used in large-scale ranking and click-through prediction systems.

7. Production Concerns

The model taxonomy is excellent, but real systems also require these decisions.

7.1 Retrieval + ranking architecture

Most large systems are two-stage:

Candidate generation (fast, high recall)
Ranking (slower, richer features/objective)

Without this separation, serving cost or latency becomes prohibitive.

Google’s course extends this into a practical three-stage view:

Candidate generation
Scoring or ranking
Re-ranking

The extra re-ranking stage matters because the best ranked list for raw engagement is often not the best final surface once you account for freshness, diversity, fairness, or business constraints.

Recommendation process architecture

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

In production, candidate generation is usually itself a mixture of sources:

Embedding nearest neighbors from a two-tower or matrix-factorization model
Co-visitation or graph-based retrieval
Popularity or trending backfills
Rule-based inventory or policy constraints

One key Google point is that scores from different candidate generators are usually not comparable. That is why a separate scorer or ranker is often necessary after retrieval.

For neural retrieval, Google also stresses approximate nearest-neighbor search rather than exact brute-force scoring over the full catalog. Libraries such as ScaNN are used to make this practical at large scale.

Approximate nearest-neighbor retrieval in embedding space

Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.

7.2 Label design and negatives

For implicit data, non-click is not always negative. You need:

Exposure-aware negatives
Position-bias-aware training
Time-windowed labels matching product goals

Google’s scoring module makes a related point: you need to be explicit about what you are optimizing. A model trained for click probability can converge to clickbait. A model trained for watch time may overserve long items. A model trained for immediate conversion can hurt long-term trust or retention.

In other words, score definition is part of the product design, not just a modeling choice.

For feed-style or slate recommendation, Google also recommends distinguishing between:

Position-dependent models, which estimate utility at a fixed slot
Position-independent models, which try to estimate intrinsic relevance before layout effects

That distinction matters because position bias can make top slots look artificially better even when the item itself is not more relevant.

7.3 Evaluating recommender and ranking systems

D2L’s NeuMF evaluator is a good starting point for implicit-feedback ranking evaluation. The protocol uses a chronological split, holds out a future ground-truth item $g_{u}$ for each user $u$ , and ranks that item against items the user has not interacted with.

Two core metrics in that setup are:

$Hit @ K = \frac{1}{| U |} \sum_{u \in U} 1 ({rank}_{u, g_{u}} \leq K)$

and

$AUC = \frac{1}{| U |} \sum_{u \in U} \frac{1}{| I ∖ S_{u} |} \sum_{j \in I ∖ S_{u}} 1 ({rank}_{u, g_{u}} < {rank}_{u, j})$

where $U$ is the user set, $I$ is the item set, and $S_{u}$ is the set of items already associated with user $u$ .

This evaluator is useful because it respects time order and measures whether the held-out future item is surfaced near the top. But for production recommendation systems, you usually need a wider evaluation stack than $H i t @ K$ and AUC alone.

Stage-specific offline metrics

Retrieval: use $R e c a l l @ M$ or candidate hit rate to verify that the candidate generator is not dropping relevant items before the ranker sees them.
Ranking: use $N D C G @ K$ , $R e c a l l @ K$ , and $M R R$ for top-of-list quality. If you have multiple relevant held-out items per user, $M A P$ is also useful.
Rating prediction: use $R M S E$ or $M A E$ only when explicit rating prediction is the real product task. These metrics are much less informative for feed ranking or item recommendation.

Let $G_{u}$ denote the relevant items for user $u$ , let $C_{u} (M)$ be the top- $M$ retrieval candidate set, and let $L_{u} (K)$ be the top- $K$ ranked list.

For retrieval, a standard candidate-stage metric is:

$Recall @ M = \frac{1}{| U |} \sum_{u \in U} \frac{| G_{u} \cap C_{u} (M) |}{| G_{u} |}$

For the final ranked list, the analogous metric is:

$Recall @ K = \frac{1}{| U |} \sum_{u \in U} \frac{| G_{u} \cap L_{u} (K) |}{| G_{u} |}$

To reward correct ordering near the top, define

$DCG @ K (u) = \sum_{j = 1}^{K} \frac{2^{{rel}_{u, j}} - 1}{\log_{2} (j + 1)}$

and then normalize by the ideal ordering:

$NDCG @ K = \frac{1}{| U |} \sum_{u \in U} \frac{DCG @ K (u)}{IDCG @ K (u)}$

where ${rel}_{u, j}$ is the relevance label of the item at position $j$ for user $u$ .

If you care about the position of the first relevant result, use mean reciprocal rank:

$MRR = \frac{1}{| U |} \sum_{u \in U} \frac{1}{r_{u}}$

where $r_{u}$ is the rank position of the first relevant item for user $u$ , with reciprocal rank taken as $0$ if no relevant item appears.

If multiple relevant items can appear in the list, average precision is also useful:

$AP @ K (u) = \frac{1}{min (| G_{u} |, K)} \sum_{j = 1}^{K} Precision @ j (u), 1 (i_{u, j} \in G_{u})$

$MAP @ K = \frac{1}{| U |} \sum_{u \in U} AP @ K (u)$

where $i_{u, j}$ is the item shown at rank $j$ to user $u$ .

Among these, $N D C G @ K$ is often the strongest single ranking metric because it rewards putting the most relevant items near the top rather than merely somewhere in the top $K$ .

Protocol choices matter as much as the metric

Use chronological splits for implicit and sequence-aware tasks. Random splits can leak future information.
State clearly whether evaluation is full-catalog, sampled-negative, or candidate-set based. Numbers are not comparable across these protocols.
Evaluate on exposed or eligible items when possible. Treating every unclicked item in the full catalog as a negative can distort results.
Report by segment: new users, power users, new items, head items, and long-tail items often behave very differently.
When the system has multiple stages, evaluate each stage separately and end-to-end.

Beyond ranking accuracy

Accuracy metrics alone can produce a recommender that is brittle or bad for the product.

Coverage: how much of the catalog is ever recommended
Diversity: how different the recommended items are from one another
Novelty and serendipity: whether the system only repeats obvious items
Calibration: whether recommendations match the user’s current intent, not just their historical average
Fairness and marketplace health: whether some suppliers, creators, or item groups are systematically suppressed

These matter because a system with slightly lower $N D C G @ K$ can still be better for long-term engagement if it improves diversity, catalog health, or repeat-user satisfaction.

Online evaluation

Offline metrics are necessary but insufficient. You still need A/B tests with:

Primary metrics: CTR, conversion, watch time, retention, revenue, or long-term value depending on the product
Guardrails: latency, page-load impact, complaint rate, hide/block rate, unsafe-content rate
Diagnostic cuts: new vs returning users, cold-start items, geography, device, heavy-user cohorts

For ranking changes, it is also useful to monitor the full funnel:

Candidate recall
Ranker win rate on exposed impressions
Final-surface engagement
Downstream business outcomes

Offline-to-online recommender evaluation flow

7.4 Re-ranking, freshness, diversity, and exploration

Pure exploitation can collapse catalog diversity. You need controlled exploration:

Epsilon-greedy or Thompson-style policies
Re-ranking for diversity/novelty
Periodic calibration checks

Google’s reranking material is especially useful here. In practice, re-ranking is where you inject constraints that the base ranker usually misses:

Freshness so the feed does not go stale
Diversity so near-duplicate items do not dominate
Fairness or marketplace balance so one creator, seller, or provider is not systematically overexposed
Local policy constraints such as demotions, blocks, maturity filters, or legal limits

This stage is often simpler than the main ranker, but it has outsized product impact because it controls the final list actually seen by the user.

7.5 Reliability and monitoring

Data scientists should treat recommenders as continuously monitored systems:

Feature drift and embedding drift
Candidate recall degradation
Online metric drift and alerting
Safe fallback policies

7.6 Recommended systems reading

If you want a broader production-systems reference beyond recommender-specific papers, Chip Huyen’s Designing Machine Learning Systems is a strong complement to this chapter.

It is not a recommender-systems book specifically, but it is useful for recommender practitioners because it covers the operational side of production ML: data pipelines, iterative development, deployment tradeoffs, monitoring, feedback loops, and the gap between offline model quality and deployed system behavior. That makes it a good companion once you move from method selection into long-term system ownership.

8. Practical Build Sequence for Data Scientists

Use this as a practical order of operations rather than a rigid recipe. The point is to add complexity only when the simpler stage has already been validated.

8.1 Define the objective hierarchy

Start by making the optimization target explicit.

Clarify whether the system is optimizing short-term CTR, conversion, watch time, retention, revenue, or long-term value
Decide which goals are primary and which are guardrails
Make sure the metric definition matches the actual user and business objective

This step matters because the wrong target can make even a technically strong ranker harmful in production.

8.2 Build strong non-ML baselines

Before training complex models, establish hard-to-beat baselines:

popularity
recency
co-visitation
simple item-to-item similarity

These baselines are useful for debugging, launch safety, and calibration. If a more complex model cannot beat them offline and online, the model is probably not production-ready.

8.3 Add collaborative filtering

Once you have meaningful interaction data, collaborative filtering is usually the first serious model family to try.

start with matrix factorization or neighborhood approaches
use this stage to learn whether interaction data alone is already enough to support useful personalization
evaluate retrieval quality separately from final ranking quality

This is often the point where recommendation becomes genuinely personalized rather than mostly heuristic.

8.4 Add metadata for hybrid robustness

After collaborative filtering is working, bring in user, item, and context features.

item metadata for cold-start items
user/context features for sparse users or context-sensitive surfaces
hybrid factorization or feature-rich ranking models

This stage improves robustness, especially when the system has to handle new items, changing inventory, or sparse users.

8.5 Introduce two-stage retrieval and ranking

As the catalog grows, a single heavy model over the full candidate space becomes impractical.

add a fast, high-recall candidate generator
follow with a richer ranker using better features and objectives
add re-ranking if you need freshness, diversity, fairness, or policy constraints

This is usually the architectural step that turns a workable recommender into a scalable one.

8.6 Establish experiment and monitoring standards

Once the system is live, treat it as a continuously evaluated product system.

define offline and online success criteria
use A/B testing with guardrails
monitor candidate recall, latency, drift, and business impact
keep safe fallback policies available

At this point, the core challenge is no longer just model training. It is maintaining quality under changing data, product goals, and operational constraints.

9. Summary

The current handbook path is meant to give data scientists a compact but practical map of the field:

Explicit vs implicit feedback
Recommendation tasks such as rating prediction, top- $n$ ranking, sequence-aware recommendation, CTR prediction, and cold-start
Benchmark data practices such as MovieLens sparsity analysis and chronological evaluation splits
Content-based, collaborative, contextual, and hybrid filtering
Embedding-space candidate generation, similarity design, and matrix factorization variants
AutoRec and ranking objectives such as BPR and hinge loss
Feature-rich recommendation with factorization machines and DeepFM
Deep recommenders such as two-tower retrieval, interaction-enhanced dual encoders, NCF, VAE-style models, wide-and-deep models, and DLRM-style architectures
Three-stage production design with retrieval, scoring, and re-ranking for freshness, diversity, and fairness

For practicing data scientists, the differentiator is not only model choice. It is operational quality: robust labeling, unbiased evaluation, scalable serving, and disciplined online experimentation.

The next two chapters extend the handbook in two directions:

References groups the main materials behind the handbook.
Survey Papers and Further Reading points to broader literature when you want to go beyond a chapter-level guide.

10. References

Use this chapter as the source map for the handbook. The list is grouped by how the material is most useful in practice.

10.1 Guides and courses

10.2 Core modeling papers

10.3 Tools and implementations

10.4 Industrial ads CTR papers

11. Survey Papers and Further Reading

This chapter is not meant to be exhaustive. The goal is to help you widen your map of the field without losing the thread of the handbook.

Before jumping into the reading list, it helps to be explicit about how practitioners can use survey papers efficiently.

11.1 How to use survey papers efficiently

For practitioners, the best way to use a survey paper is usually not to read every cited method in order.

Read the taxonomy first and decide which branch actually matches your product surface.
Focus on the evaluation section to see which metrics and data assumptions are standard in that subfield.
Pull out the open-problems section to understand what still breaks in real systems.
Use the references to identify a small number of landmark papers rather than trying to read the full citation graph.

If you are building production systems rather than writing a paper, the main value of surveys is not completeness. It is faster problem framing and better judgment about which methods are mature enough to operationalize.

With that reading strategy in mind, here is a compact starting set.

11.2 Start here

If you only want a small reading set after finishing this handbook, start with these:

This subset gives you coverage of classical-to-neural evolution, sequence recommendation, bias, multimodality, LLM-era recommender work, and industrial deployment constraints.

11.3 General and industry-facing surveys

A Survey of Real-World Recommender Systems: Challenges, Constraints, and Industrial Perspectives (2025). This survey is unusually valuable for practitioners because it centers deployment constraints rather than only benchmark performance. It discusses the industrial tradeoffs around latency, retraining cadence, marketplace constraints, product surfaces, organizational limitations, and evaluation gaps between offline results and business outcomes. It is relevant to this handbook because it is the closest survey match to the practical orientation of Production Concerns and Build Sequence.

Contemporary Recommendation Systems on Big Data and Their Applications: A Survey (first posted 2022; updated 2024). Broad and application-heavy survey that is more useful for field scanning than for deep method selection.
A Comprehensive Review of Recommender Systems: Transitioning from Theory to Practice (2024). Useful when you want a paper explicitly framed around the gap between elegant methods and deployable systems.
Recent Developments in Recommender Systems: A Survey (2023). Broad recent-literature snapshot that is useful when you want a compact update on post-classic recommender directions.

11.4 Neural and deep-learning recommender surveys

A Survey on Accuracy-oriented Neural Recommendation: From Collaborative Filtering to Information-rich Recommendation (2021). This survey is one of the most useful bridges between classic collaborative filtering and the modern neural recommendation literature. It does not just list architectures; it organizes the field by how recommendation moves from pure user-item interaction modeling toward richer inputs such as context, content, knowledge, and multi-behavior signals. It is relevant to this handbook because it helps place the Model Families, Feature-Rich Recommendation, and Deep Models chapters into one coherent research map instead of treating them as separate toolkits.

A Comprehensive Survey of Recommender Systems Based on Deep Learning (2023). Good catalog of deep-learning recommendation settings, especially if you want a wider taxonomy after reading the core deep chapters.
Graph Learning based Recommender Systems: A Review (2021). Useful when graph-based message passing, neighborhood propagation, and graph-enhanced collaborative filtering are central to your method selection.

11.5 Sequential, session-based, and decision-oriented surveys

A Survey on Session-based Recommender Systems (2022). This survey focuses on settings where long-run user histories are weak, absent, or less useful than short-horizon intent. It lays out the data structure, problem formulation, and method families for session-based recommendation, including the shift from simple Markov-style methods to GRU-, attention-, and transformer-based models. It is relevant to this handbook because the Explicit vs. Implicit Feedback, Model Families, and Deep Models chapters only introduce sequence-aware recommendation at a high level; this paper is the right follow-on when your product is driven by recency, intent shifts, and within-session behavior.

Reinforcement Learning based Recommender Systems: A Survey (2021). Best for long-horizon optimization, exploration, delayed rewards, and policy-learning formulations of recommendation.

11.6 Responsible and robust recommendation

Bias and Debias in Recommender System: A Survey and Future Directions (2020). This survey is useful because it reframes recommendation quality as partly a data-generation problem rather than only a model-design problem. It catalogs major bias sources such as exposure bias, selection bias, position bias, and popularity bias, then reviews algorithmic and evaluation-side responses. It is relevant to this handbook because many of the production issues discussed in Production Concerns and the evaluation caveats in the ranking chapters become much easier to reason about once you view recommendation pipelines through a bias-and-debias lens.

Fairness and Diversity in Recommender Systems: A Survey (published online in 2024; ACM TIST issue in 2025). Strong choice if your concern is multi-objective ranking quality rather than only relevance or engagement lift.

11.7 Emerging directions

Multimodal Recommender Systems: A Survey (2023; updated 2024). This survey covers recommendation systems that use more than IDs and tabular metadata, such as text, images, audio, and video. It is especially useful for understanding how representation learning changes when item understanding itself becomes a multimodal problem instead of a pure interaction problem. It is relevant to this handbook because it extends the Feature-Rich Recommendation and Deep Models chapters into the part of the field where content understanding and recommendation become tightly coupled.

How Can Recommender Systems Benefit from Large Language Models: A Survey (2025). This survey is a strong next read if you want to understand where LLMs fit into recommendation without collapsing everything into hype. It organizes the space around representation, reasoning, generation, user understanding, and agentic or interactive recommendation settings, while also discussing limits such as latency, hallucination, and evaluation mismatch. It is relevant to this handbook because it helps extend the Deep Models and Production Concerns chapters into the current wave of LLM-assisted retrieval, ranking, explanation, and recommendation workflows.

11.8 Foundational pre-2020 surveys worth keeping

These are older than the 2020+ focus of this section, but they are still worth keeping because they remain useful orientation documents.

Deep Learning based Recommender System: A Survey and New Perspectives (2017; later journal publication in 2019). Still one of the clearest early maps of neural recommender systems.
Deep Learning Based Recommender System: A Survey and New Perspectives (2019). The ACM Computing Surveys publication version of the same work, useful if you prefer the journal reference.
Recommender system application developments: a survey (2015). Older and broader than the rest, but still useful if you want historical perspective on application areas and early design patterns.