Understanding Recommender Systems
Single-page version for printing or saving as PDF.
Use the handbook version if you want chapter navigation and the left sidebar.
Chapter Guide
- Why Recommender Systems Matter
- Explicit vs. Implicit Feedback
- Model Families
- Matrix Factorization
- Feature-Rich Recommendation
- Deep Models
- Production Concerns
- Build Sequence
- Summary
- References
- Survey Papers and Further Reading
1. Why Recommender Systems Matter
Recommender systems help users navigate very large item catalogs (videos, products, courses, jobs, music) by ranking items likely to be relevant to each user. See the background overview in Wikipedia: Recommender system.
Image credit: Dive into Deep Learning, CC BY-SA 4.0.
For data scientists, this is usually not a pure prediction task. It is a ranking and decision problem with constraints:
- Relevance and personalization
- Diversity and novelty
- Latency and serving cost
- Business goals (retention, conversion, long-term value)
1.1 Common applications
- E-commerce and retail: cross-sell, upsell, “complete the look”, and basket expansion
- Media and entertainment: personalized ranking of video, music, articles, and ads
- Banking and financial services: product recommendations, offers, and next-best action
1.2 Business value
- Helps users discover items they would not have found through search alone
- Increases engagement, session depth, and content consumption
- Improves conversion, basket size, and retention when recommendations are well-targeted
Google’s recommender course also makes a useful product distinction between two common surfaces:
- Homepage recommendations, where the query is the user or session context
- Related-item recommendations, where the query is the current item being viewed
That distinction matters because homepage recommendation usually starts from a user or context embedding, while related-item recommendation often starts from the item embedding itself and retrieves nearby items in embedding space.
2. Explicit vs. Implicit Feedback
As in the reference article, the first key split is the type of supervision.
2.1 Explicit feedback
Examples:
- Star ratings
- Like/dislike labels
- Written reviews with sentiment scores
Pros:
- Direct preference signal
- Easier to define regression-style losses
Cons:
- Sparse in most real products
- Selection bias (only some users rate)
2.2 Implicit feedback
Examples:
- Clicks
- Watch time
- Purchases
- Add-to-cart, save, dwell
Pros:
- High volume
- Better behavioral coverage
Cons:
- Noisy preference proxy
- Requires careful negative sampling and weighting
In both cases, interactions define a sparse user-item matrix with entries over user-item pairs
2.3 Recommendation tasks
Following D2L Chapter 21, it helps to separate recommendation work by task:
- Rating prediction: estimate a user’s explicit rating for an item
- Top-
recommendation: rank candidate items and return a personalized list - Sequence-aware recommendation: use ordered behavior and timestamps
- Click-through rate prediction: predict whether a shown item or ad will be clicked
- Cold-start recommendation: serve new users or new items when history is limited
These tasks overlap, but they drive different labels, evaluation protocols, and model choices.
2.4 Benchmark datasets and split strategy
The MovieLens 100K dataset remains the standard conceptual benchmark for explicit-feedback recommendation.
- 100,000 ratings
- 943 users
- 1,682 movies
- Ratings from 1 to 5
- Approximate matrix sparsity of 93.7%
Two split strategies from D2L are especially useful in practice:
- Random split for rating prediction and general offline evaluation
- Sequence-aware split, where the most recent interaction is held out per user
This distinction matters because sequence-aware recommendation should be evaluated with a chronological split, not a random one.
3. Model Families
Model choice depends heavily on what data you have. If you only observe interactions, collaborative filtering is usually the first serious approach. If you also have user and item attributes, content-based or hybrid models become more useful. If the current situation matters, such as device, country, time, or within-session behavior, then contextual models become important.
3.1 Content-based filtering
Use user/item attributes and metadata.
- Item vectors from text/category/tags/embeddings
- User representation from demographics and/or consumed item profiles
- Similarity models (cosine, k-NN) or supervised models over features
Strength:
- Better cold-start for new items (and sometimes new users)
Limitation:
- Limited collaborative signal; can over-specialize
- Can become overly narrow if the feature space does not capture richer or emerging interests
Google’s course also emphasizes that content-based systems are often easier to explain and easier to cold-start for new items, but they tend to be weaker at serendipity than collaborative models.
| Aspect | Advantages | Disadvantages |
|---|---|---|
| User specificity | Does not require interaction data from other users, so it can scale cleanly across many users and preserve privacy better | Quality depends heavily on having good item features and user profiles |
| Discovery pattern | Can serve niche items that match a user’s known interests very well | Usually expands less well beyond existing interests, so serendipity is weaker |
| Modeling burden | Easier to explain because recommendations can be tied back to item attributes | Requires substantial domain knowledge and hand-engineered or high-quality learned features |
Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.
3.2 Collaborative filtering
Use interaction patterns across all users/items.
- Neighborhood methods (user-user, item-item)
- Latent-factor methods (matrix factorization)
Strength:
- Often strong personalization with enough interaction history
Limitation:
- Cold-start if no history exists
| Aspect | Advantages | Disadvantages |
|---|---|---|
| Feature engineering | Embeddings are learned automatically, so little prior domain knowledge is required | Harder to incorporate side features such as demographics, metadata, or context without model extensions |
| Discovery pattern | Can introduce serendipity because similar users can pull an item into a user’s candidate set | Suffers from cold-start for fresh users and fresh items without interactions |
| Practical role | Strong starting point because a feedback matrix alone can power a usable candidate generator | Production systems usually need extra machinery such as WALS projection heuristics, side-feature augmentation, or hybrid models to fill the gaps |
3.3 Contextual filtering
Contextual filtering incorporates information about the current situation into the recommendation process.
- Examples of context: device, country, date, time, session state, or recent action sequence
- Useful when the same user may want different items under different circumstances
- Often framed as next-action or next-item prediction rather than only long-run preference estimation
3.4 Hybrid models
Combine metadata with interaction learning.
- Best default choice in many production systems
- Handles cold-start better than pure collaborative filtering
- Usually outperforms pure content-based methods once enough interactions accumulate
3.5 Embedding spaces and similarity measures for candidate generation
The Google Developers course sharpens an important operational point: candidate generation is usually a nearest-neighbor search problem in an embedding space. Given a query embedding
Common choices are:
If the embeddings are normalized, cosine, dot product, and squared Euclidean distance induce closely related rankings. Without normalization, however, they behave differently:
- Dot product favors larger embedding norms, which often correlates with popular or frequent items
- Cosine focuses more on angular alignment, which can be better for semantic similarity
- Euclidean distance emphasizes physical closeness in the embedding space
Google also suggests a useful interpolation between pure cosine and pure dot product:
This lets you keep some popularity signal without letting large-norm items dominate retrieval.
Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.
Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.
4. Matrix Factorization
The reference article emphasizes matrix factorization variants. This remains foundational for data scientists.
4.1 PMF / latent factors (explicit feedback)
Model:
where user and item embeddings
Regularized loss over observed pairs
Optimization:
- SGD (simple, flexible)
- ALS (efficient for large sparse systems)
- Practical implementations are available in the Surprise library and its documentation
With ALS, you alternate between solving for user factors while holding item factors fixed and solving for item factors while holding user factors fixed. That makes large sparse factorization problems easier to optimize in practice.
Image credit: Dive into Deep Learning, CC BY-SA 4.0.
4.2 SVD-style bias terms
A common extension adds global/user/item bias terms:
Biases capture broad effects (strict users, broadly popular items) and usually improve quality.
4.3 Implicit-feedback factorization
Following the article’s logic, implicit events are treated as preference plus confidence.
One common setup:
- Preference:
from interaction presence - Confidence:
, where is interaction strength
Objective:
This is the core weighted-implicit matrix factorization approach used in large-scale recommenders.
The Google course adds an important weighted-matrix-factorization view that is especially useful in industrial retrieval systems. Let
Here
4.4 Evaluation for rating prediction
For explicit-feedback recommendation, D2L’s matrix factorization section uses RMSE as the primary evaluation measure:
where
RMSE is appropriate for rating prediction, but it is not sufficient for top-
4.5 AutoRec for nonlinear rating prediction
AutoRec extends collaborative filtering with an autoencoder-style reconstruction objective.
- Input is a partially observed user vector or item vector from the rating matrix
- The network reconstructs missing entries through a hidden representation
- Only observed ratings should contribute to the training loss
For item-based AutoRec, D2L writes the input as the
The learning objective minimizes reconstruction error over observed entries only:
Conceptually, AutoRec matters because it is one of the earliest examples in D2L of moving from linear collaborative filtering to nonlinear neural reconstruction for rating prediction.
4.6 Personalized ranking objectives
D2L makes an important distinction between rating prediction objectives and ranking objectives.
- Pointwise objectives model one user-item interaction at a time
- Pairwise objectives model relative preference between a positive and a negative item
- Listwise objectives optimize properties of an entire ranked list
| Objective family | Training signal | Pros | Cons | Typical use |
|---|---|---|---|---|
| Pointwise | One labeled user-item example at a time | Simple to implement, works with standard regression or classification losses, easy to calibrate as a score or probability | Does not optimize ordering directly, sensitive to label noise and exposure bias, can overfocus on absolute score accuracy | CTR prediction, rating prediction, coarse ranking baselines |
| Pairwise | Positive item compared against a sampled negative item | Better aligned with top- | Quality depends heavily on negative sampling, does not model full-list effects, can miss business constraints beyond pair comparisons | Candidate generation, implicit-feedback retrieval, pre-ranking |
| Listwise | Entire ranked list or slate | Best conceptual match to ranking metrics such as NDCG, can optimize position effects and whole-list quality | More complex objectives, heavier computation, harder data construction and serving alignment | Final-stage ranking, search ranking, slate optimization |
For top-
The two core D2L losses are:
- Bayesian Personalized Ranking (BPR), which encourages the positive item to score above a sampled negative item:
- Hinge ranking loss, which pushes the positive item away from the negative item by a margin
:
These are central for implicit-feedback recommendation because they optimize relative ordering rather than absolute score accuracy.
4.7 SVD++ intuition
SVD++ augments user representation with signals from interacted items, helping when explicit feedback is sparse but interaction history exists.
5. Feature-Rich Recommendation
As D2L section 21.8 emphasizes, interaction data is often sparse and noisy. In many production settings, recommendation is better framed as impression-level prediction with rich side features.
5.1 Feature-rich recommendation and CTR
Feature-rich recommendation is common in ads, feeds, and product surfaces.
- Labels are often binary, such as click vs no click
- Inputs include many categorical fields rather than only user and item IDs
- The D2L advertising example uses 34 fields, with the first column as the click label and the remaining columns as categorical features
This setting is different from classic matrix factorization because the goal is often click-through rate prediction over impression-level examples rather than rating reconstruction.
CTR is defined as:
5.2 Factorization machines
Factorization machines are one of the most important bridges between collaborative filtering and feature-rich prediction.
For a feature vector
Interpretation:
- The first two terms are linear
- The last term models pairwise feature interactions
- If one feature encodes user identity and another encodes item identity, the interaction term reduces to a collaborative-filtering-style embedding interaction
D2L also highlights the computational trick that reduces FM interaction cost from
5.3 DeepFM
DeepFM extends FM by combining low-order feature interactions from FM with high-order nonlinear interactions from a deep network.
- The FM branch captures low-order interactions
- The deep branch uses shared embeddings and an MLP to learn higher-order interactions
- Both outputs are combined into a final prediction
D2L presents the DeepFM prediction as:
DeepFM is especially useful when simple pairwise interactions are not expressive enough, but you still want the inductive bias of factorization-based feature interaction.
Image credit: Dive into Deep Learning, CC BY-SA 4.0.
5.4 Hybrid factorization with features (LightFM-style)
- User embedding = sum of user-feature embeddings
- Item embedding = sum of item-feature embeddings
- Score uses dot product (+ optional biases)
Why data scientists use this:
- Stronger cold-start behavior
- Smooth path between collaborative and content-based modeling
- Practical when metadata quality is reasonable
The Google course makes the same idea concrete from a matrix-factorization angle: you can augment the original interaction matrix with user-feature and item-feature blocks, then factorize the augmented matrix so that side features learn embeddings alongside users and items. Conceptually, this is one of the cleanest bridges between classic WALS-style recommender systems and modern hybrid feature-based models.
5.5 Industrial ads CTR architectures
If you want a more industrial view of feature-rich ranking, two recent papers are worth reading after the material above.
- DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction (2022) is an ads CTR paper about feature interaction modeling at online advertising scale. Its main point is that different interaction modules capture different useful signals, so a hierarchical ensemble can outperform committing to a single interaction design.
- InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction (2025) is a Meta Ads CTR paper. It focuses on learning stronger interactions between heterogeneous signals such as profile features, context features, and behavior sequences.
These are useful extensions of the chapter because they show what happens when feature-rich recommendation is pushed into industrial ads ranking. They should not be read as broad recommender-system blueprints or as the state of the art for recommender systems in general. They are narrower and more specific: both are ads CTR architecture papers shaped by impression-level prediction, extreme scale, and production serving constraints.
6. Deep Models
The NVIDIA glossary adds an important extension: deep learning recommenders build on embeddings and factorization ideas, but replace simple linear interactions with more expressive neural architectures.
Useful model families include:
- Feedforward networks and multilayer perceptrons for flexible nonlinear scoring
- Convolutional models when image content matters
- Recurrent networks and transformers for sequential, session-based behavior
6.1 Two-tower retrieval models
The Google course gives the retrieval intuition, and the two blog references sharpen how that intuition gets productionized. A two-tower model is a dual-encoder architecture: one tower maps the query side into an embedding, and the other maps the item side into the same embedding space. The interaction is deliberately delayed until the very end.
If
where
This late-interaction design is the key reason two-tower models dominate retrieval and pre-ranking. The towers can be trained jointly, but item embeddings can then be precomputed and indexed, which makes large-scale ANN retrieval practical.
Training objectives and the softmax view
Instead of only factorizing a user-item matrix, you can map a query context
where
In practice, exact softmax over a large catalog is too expensive, so industrial systems usually rely on sampled softmax, negative sampling, hard negatives, BPR-style pairwise losses, or contrastive objectives such as InfoNCE. Google’s negative-sampling subsection is worth reading because it gives a concrete explanation of folding: if you train only on positive pairs, embeddings from unrelated categories can collapse into the same region and produce spurious recommendations. The Shaped deep dive also notes that pointwise log loss is still common when the retriever is trained as a coarse candidate generator ahead of a stronger ranker.
Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.
| Aspect | Matrix factorization | Softmax DNN / dual-encoder training |
|---|---|---|
| Query and side features | Not easy to include directly | Can incorporate richer query, context, and side features |
| Cold start | Weak by default, though heuristics and projection tricks can help | Handles new queries more naturally when query features are available |
| Folding risk | Less prone to folding; WALS-style weighting can help control it | More prone to folding and usually needs negative sampling or related regularization |
| Training scalability | Easier to scale to very large sparse corpora | Harder to scale; often needs sampling, hashing, or other approximations |
| Serving cost | Very cheap when user and item embeddings are static or cheaply updated | Item embeddings can be cached, but query embeddings often need to be computed online |
Google’s summary judgment is useful: matrix factorization is usually the better retrieval choice for very large corpora, while DNN-based retrieval becomes attractive when you need richer query features and more personalized relevance modeling.
Training versus serving
This is where the architecture becomes operationally attractive:
- During training, the two towers are optimized jointly so that relevant query-item pairs are close in the embedding space and irrelevant pairs are pushed apart.
- During serving, the item tower is run offline over the full catalog and its embeddings are stored in an ANN index.
- At request time, the system computes only the query embedding online, queries the ANN index, and returns a top-
candidate set for downstream ranking.
This decoupling is why two-tower models are common in candidate generation, related-item retrieval, and pre-ranking stages with strict latency budgets.
Tower design choices
The towers do not have to be simple MLPs. As the Shaped article emphasizes, the query tower may consume user IDs, demographics, session state, search context, or sequential behavior, while the item tower may consume item IDs, metadata, text, image embeddings, or other modality-specific features.
Common choices include:
- MLPs over concatenated embeddings and dense features
- Sequence models or transformers on the query side for recent behavior
- Text or multimodal encoders on the item side for semantic retrieval
- Symmetric dual encoders when both sides have similar modalities
- Asymmetric dual encoders when the query and item spaces are very different
For smaller catalogs, the raw two-tower score may be enough to rank directly. For very large catalogs, it is almost always used as a retrieval or pre-ranking model ahead of a richer scorer.
Limitations and promising extensions
The main weakness is also the reason the model is fast: user-item interaction is restricted to the final dot product. This creates an information bottleneck.
In practice, that means:
- Fine-grained cross-feature interactions are not modeled explicitly
- Subtle conditional preferences can be missed
- The retriever usually needs a downstream ranker to recover accuracy
The Reach Sumit survey is useful here because it covers several extensions aimed at reducing this bottleneck while keeping most of the serving efficiency:
- DAT (Dual Augmented Two-Tower): augments each tower with cross-side historical interaction signals
- IntTower: adds feature-importance calibration, fine-grained early interaction, and contrastive interaction regularization
- ColBERT-style late interaction: preserves query-item decoupling better than a full cross-encoder while keeping richer token-level matching than a pure dot product
These models live in the space between pure representation-based retrieval and full interaction-heavy ranking models.
6.2 Neural collaborative filtering
Neural collaborative filtering keeps the collaborative setup of user-item interactions, but learns the interaction function with a neural network instead of relying only on a dot product.
- In NeuMF from D2L, a generalized matrix factorization (GMF) path is combined with an MLP path
- This can capture more complex nonlinear relationships than matrix factorization alone
- It is most useful when interaction volume is high enough to support a richer model
NeuMF also fits naturally with pairwise ranking and negative sampling, rather than only explicit rating prediction.
Image credit: Dive into Deep Learning, CC BY-SA 4.0.
6.3 Variational autoencoders for collaborative filtering
Variational autoencoder approaches learn a compressed latent representation of a user’s interaction history and then reconstruct likely missing interactions.
- Useful for implicit-feedback recommendation
- Helps capture nonlinear structure in sparse user-item behavior
- Often treated as a reconstruction problem over interaction vectors
6.4 Contextual sequence learning
Session-based recommenders often care less about static preference and more about what the user is likely to do next.
- In D2L’s sequence-aware recommendation section, the featured model is Caser, which uses horizontal and vertical convolutions over the recent interaction matrix
- Horizontal filters capture union-level patterns across multiple recent actions
- Vertical filters capture point-level effects of individual recent actions
- RNN, LSTM, GRU, and transformer models are also widely used for this setting
- Inputs can include both ordered actions and contextual features such as time, device, or location
- This is especially relevant in streaming, shopping, and short-session products
Image credit: Dive into Deep Learning, CC BY-SA 4.0.
D2L also provides a useful view of how sequence-aware samples are constructed from chronological user histories, including the held-out next item and sampled negatives:
Image credit: Dive into Deep Learning, CC BY-SA 4.0.
6.5 Wide-and-deep style models
Wide-and-deep architectures combine memorization and generalization.
- The wide component captures simpler feature interactions that may occur rarely
- The deep component learns richer nonlinear structure through embeddings and dense layers
- This pattern is effective when recommendation quality depends on both handcrafted cross-features and learned representations
6.6 DLRM-style models
DLRM-style models are designed for recommendation data with many categorical features and some numerical features.
- Embeddings handle sparse categorical inputs
- MLP layers process dense features
- Explicit pairwise feature interactions are then modeled before final prediction
These models are widely used in large-scale ranking and click-through prediction systems.
7. Production Concerns
The model taxonomy is excellent, but real systems also require these decisions.
7.1 Retrieval + ranking architecture
Most large systems are two-stage:
- Candidate generation (fast, high recall)
- Ranking (slower, richer features/objective)
Without this separation, serving cost or latency becomes prohibitive.
Google’s course extends this into a practical three-stage view:
- Candidate generation
- Scoring or ranking
- Re-ranking
The extra re-ranking stage matters because the best ranked list for raw engagement is often not the best final surface once you account for freshness, diversity, fairness, or business constraints.
Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.
In production, candidate generation is usually itself a mixture of sources:
- Embedding nearest neighbors from a two-tower or matrix-factorization model
- Co-visitation or graph-based retrieval
- Popularity or trending backfills
- Rule-based inventory or policy constraints
One key Google point is that scores from different candidate generators are usually not comparable. That is why a separate scorer or ranker is often necessary after retrieval.
For neural retrieval, Google also stresses approximate nearest-neighbor search rather than exact brute-force scoring over the full catalog. Libraries such as ScaNN are used to make this practical at large scale.
Image credit: Google for Developers Recommendation Systems course, CC BY 4.0.
7.2 Label design and negatives
For implicit data, non-click is not always negative. You need:
- Exposure-aware negatives
- Position-bias-aware training
- Time-windowed labels matching product goals
Google’s scoring module makes a related point: you need to be explicit about what you are optimizing. A model trained for click probability can converge to clickbait. A model trained for watch time may overserve long items. A model trained for immediate conversion can hurt long-term trust or retention.
In other words, score definition is part of the product design, not just a modeling choice.
For feed-style or slate recommendation, Google also recommends distinguishing between:
- Position-dependent models, which estimate utility at a fixed slot
- Position-independent models, which try to estimate intrinsic relevance before layout effects
That distinction matters because position bias can make top slots look artificially better even when the item itself is not more relevant.
7.3 Evaluating recommender and ranking systems
D2L’s NeuMF evaluator is a good starting point for implicit-feedback ranking evaluation. The protocol uses a chronological split, holds out a future ground-truth item
Two core metrics in that setup are:
and
where
This evaluator is useful because it respects time order and measures whether the held-out future item is surfaced near the top. But for production recommendation systems, you usually need a wider evaluation stack than
Stage-specific offline metrics
- Retrieval: use
or candidate hit rate to verify that the candidate generator is not dropping relevant items before the ranker sees them. - Ranking: use
, , and for top-of-list quality. If you have multiple relevant held-out items per user, is also useful. - Rating prediction: use
or only when explicit rating prediction is the real product task. These metrics are much less informative for feed ranking or item recommendation.
Let
For retrieval, a standard candidate-stage metric is:
For the final ranked list, the analogous metric is:
To reward correct ordering near the top, define
and then normalize by the ideal ordering:
where
If you care about the position of the first relevant result, use mean reciprocal rank:
where
If multiple relevant items can appear in the list, average precision is also useful:
where
Among these,
Protocol choices matter as much as the metric
- Use chronological splits for implicit and sequence-aware tasks. Random splits can leak future information.
- State clearly whether evaluation is full-catalog, sampled-negative, or candidate-set based. Numbers are not comparable across these protocols.
- Evaluate on exposed or eligible items when possible. Treating every unclicked item in the full catalog as a negative can distort results.
- Report by segment: new users, power users, new items, head items, and long-tail items often behave very differently.
- When the system has multiple stages, evaluate each stage separately and end-to-end.
Beyond ranking accuracy
Accuracy metrics alone can produce a recommender that is brittle or bad for the product.
- Coverage: how much of the catalog is ever recommended
- Diversity: how different the recommended items are from one another
- Novelty and serendipity: whether the system only repeats obvious items
- Calibration: whether recommendations match the user’s current intent, not just their historical average
- Fairness and marketplace health: whether some suppliers, creators, or item groups are systematically suppressed
These matter because a system with slightly lower
Online evaluation
Offline metrics are necessary but insufficient. You still need A/B tests with:
- Primary metrics: CTR, conversion, watch time, retention, revenue, or long-term value depending on the product
- Guardrails: latency, page-load impact, complaint rate, hide/block rate, unsafe-content rate
- Diagnostic cuts: new vs returning users, cold-start items, geography, device, heavy-user cohorts
For ranking changes, it is also useful to monitor the full funnel:
- Candidate recall
- Ranker win rate on exposed impressions
- Final-surface engagement
- Downstream business outcomes
7.4 Re-ranking, freshness, diversity, and exploration
Pure exploitation can collapse catalog diversity. You need controlled exploration:
- Epsilon-greedy or Thompson-style policies
- Re-ranking for diversity/novelty
- Periodic calibration checks
Google’s reranking material is especially useful here. In practice, re-ranking is where you inject constraints that the base ranker usually misses:
- Freshness so the feed does not go stale
- Diversity so near-duplicate items do not dominate
- Fairness or marketplace balance so one creator, seller, or provider is not systematically overexposed
- Local policy constraints such as demotions, blocks, maturity filters, or legal limits
This stage is often simpler than the main ranker, but it has outsized product impact because it controls the final list actually seen by the user.
7.5 Reliability and monitoring
Data scientists should treat recommenders as continuously monitored systems:
- Feature drift and embedding drift
- Candidate recall degradation
- Online metric drift and alerting
- Safe fallback policies
7.6 Recommended systems reading
If you want a broader production-systems reference beyond recommender-specific papers, Chip Huyen’s Designing Machine Learning Systems is a strong complement to this chapter.
It is not a recommender-systems book specifically, but it is useful for recommender practitioners because it covers the operational side of production ML: data pipelines, iterative development, deployment tradeoffs, monitoring, feedback loops, and the gap between offline model quality and deployed system behavior. That makes it a good companion once you move from method selection into long-term system ownership.
8. Practical Build Sequence for Data Scientists
Use this as a practical order of operations rather than a rigid recipe. The point is to add complexity only when the simpler stage has already been validated.
8.1 Define the objective hierarchy
Start by making the optimization target explicit.
- Clarify whether the system is optimizing short-term CTR, conversion, watch time, retention, revenue, or long-term value
- Decide which goals are primary and which are guardrails
- Make sure the metric definition matches the actual user and business objective
This step matters because the wrong target can make even a technically strong ranker harmful in production.
8.2 Build strong non-ML baselines
Before training complex models, establish hard-to-beat baselines:
- popularity
- recency
- co-visitation
- simple item-to-item similarity
These baselines are useful for debugging, launch safety, and calibration. If a more complex model cannot beat them offline and online, the model is probably not production-ready.
8.3 Add collaborative filtering
Once you have meaningful interaction data, collaborative filtering is usually the first serious model family to try.
- start with matrix factorization or neighborhood approaches
- use this stage to learn whether interaction data alone is already enough to support useful personalization
- evaluate retrieval quality separately from final ranking quality
This is often the point where recommendation becomes genuinely personalized rather than mostly heuristic.
8.4 Add metadata for hybrid robustness
After collaborative filtering is working, bring in user, item, and context features.
- item metadata for cold-start items
- user/context features for sparse users or context-sensitive surfaces
- hybrid factorization or feature-rich ranking models
This stage improves robustness, especially when the system has to handle new items, changing inventory, or sparse users.
8.5 Introduce two-stage retrieval and ranking
As the catalog grows, a single heavy model over the full candidate space becomes impractical.
- add a fast, high-recall candidate generator
- follow with a richer ranker using better features and objectives
- add re-ranking if you need freshness, diversity, fairness, or policy constraints
This is usually the architectural step that turns a workable recommender into a scalable one.
8.6 Establish experiment and monitoring standards
Once the system is live, treat it as a continuously evaluated product system.
- define offline and online success criteria
- use A/B testing with guardrails
- monitor candidate recall, latency, drift, and business impact
- keep safe fallback policies available
At this point, the core challenge is no longer just model training. It is maintaining quality under changing data, product goals, and operational constraints.
9. Summary
The current handbook path is meant to give data scientists a compact but practical map of the field:
- Explicit vs implicit feedback
- Recommendation tasks such as rating prediction, top-
ranking, sequence-aware recommendation, CTR prediction, and cold-start - Benchmark data practices such as MovieLens sparsity analysis and chronological evaluation splits
- Content-based, collaborative, contextual, and hybrid filtering
- Embedding-space candidate generation, similarity design, and matrix factorization variants
- AutoRec and ranking objectives such as BPR and hinge loss
- Feature-rich recommendation with factorization machines and DeepFM
- Deep recommenders such as two-tower retrieval, interaction-enhanced dual encoders, NCF, VAE-style models, wide-and-deep models, and DLRM-style architectures
- Three-stage production design with retrieval, scoring, and re-ranking for freshness, diversity, and fairness
For practicing data scientists, the differentiator is not only model choice. It is operational quality: robust labeling, unbiased evaluation, scalable serving, and disciplined online experimentation.
The next two chapters extend the handbook in two directions:
Referencesgroups the main materials behind the handbook.Survey Papers and Further Readingpoints to broader literature when you want to go beyond a chapter-level guide.
10. References
Use this chapter as the source map for the handbook. The list is grouped by how the material is most useful in practice.
10.1 Guides and courses
- Recommender Systems — A Complete Guide to Machine Learning Models
- 21. Recommender Systems
- Google for Developers: Recommendation Systems course
- NVIDIA Glossary: Recommendation System
- Wikipedia: Recommender system
10.2 Core modeling papers
- Mnih and Salakhutdinov (2007): Probabilistic Matrix Factorization
- Hu, Koren, Volinsky (2008): Collaborative Filtering for Implicit Feedback Datasets
- Koren, Bell, Volinsky (2009): Matrix Factorization Techniques for Recommender Systems
- Koren (2008): Factorization Meets the Neighborhood (SVD++)
- Kula (2015): Metadata Embeddings for User and Item Cold-start Recommendations (LightFM)
- Shaped: The Two-Tower Model for Recommendation Systems: A Deep Dive
- Sumit’s Diary: Two Tower Model Architecture: Current State and Promising Extensions
10.3 Tools and implementations
- Surprise Python package
- Surprise documentation
- Simon Funk (2006): Netflix Update - Try This at Home
- ScaNN
10.4 Industrial ads CTR papers
- DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction (2022)
- InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction (2025, Meta Ads)
11. Survey Papers and Further Reading
This chapter is not meant to be exhaustive. The goal is to help you widen your map of the field without losing the thread of the handbook.
Before jumping into the reading list, it helps to be explicit about how practitioners can use survey papers efficiently.
11.1 How to use survey papers efficiently
For practitioners, the best way to use a survey paper is usually not to read every cited method in order.
- Read the taxonomy first and decide which branch actually matches your product surface.
- Focus on the evaluation section to see which metrics and data assumptions are standard in that subfield.
- Pull out the open-problems section to understand what still breaks in real systems.
- Use the references to identify a small number of landmark papers rather than trying to read the full citation graph.
If you are building production systems rather than writing a paper, the main value of surveys is not completeness. It is faster problem framing and better judgment about which methods are mature enough to operationalize.
With that reading strategy in mind, here is a compact starting set.
11.2 Start here
If you only want a small reading set after finishing this handbook, start with these:
- A Survey on Accuracy-oriented Neural Recommendation: From Collaborative Filtering to Information-rich Recommendation (2021)
- A Survey on Session-based Recommender Systems (2022)
- Bias and Debias in Recommender System: A Survey and Future Directions (2020)
- Multimodal Recommender Systems: A Survey (2023; updated 2024)
- How Can Recommender Systems Benefit from Large Language Models: A Survey (2025)
- A Survey of Real-World Recommender Systems: Challenges, Constraints, and Industrial Perspectives (2025)
This subset gives you coverage of classical-to-neural evolution, sequence recommendation, bias, multimodality, LLM-era recommender work, and industrial deployment constraints.
11.3 General and industry-facing surveys
A Survey of Real-World Recommender Systems: Challenges, Constraints, and Industrial Perspectives (2025). This survey is unusually valuable for practitioners because it centers deployment constraints rather than only benchmark performance. It discusses the industrial tradeoffs around latency, retraining cadence, marketplace constraints, product surfaces, organizational limitations, and evaluation gaps between offline results and business outcomes. It is relevant to this handbook because it is the closest survey match to the practical orientation of Production Concerns and Build Sequence.
- Contemporary Recommendation Systems on Big Data and Their Applications: A Survey (first posted 2022; updated 2024). Broad and application-heavy survey that is more useful for field scanning than for deep method selection.
- A Comprehensive Review of Recommender Systems: Transitioning from Theory to Practice (2024). Useful when you want a paper explicitly framed around the gap between elegant methods and deployable systems.
- Recent Developments in Recommender Systems: A Survey (2023). Broad recent-literature snapshot that is useful when you want a compact update on post-classic recommender directions.
11.4 Neural and deep-learning recommender surveys
A Survey on Accuracy-oriented Neural Recommendation: From Collaborative Filtering to Information-rich Recommendation (2021). This survey is one of the most useful bridges between classic collaborative filtering and the modern neural recommendation literature. It does not just list architectures; it organizes the field by how recommendation moves from pure user-item interaction modeling toward richer inputs such as context, content, knowledge, and multi-behavior signals. It is relevant to this handbook because it helps place the Model Families, Feature-Rich Recommendation, and Deep Models chapters into one coherent research map instead of treating them as separate toolkits.
- A Comprehensive Survey of Recommender Systems Based on Deep Learning (2023). Good catalog of deep-learning recommendation settings, especially if you want a wider taxonomy after reading the core deep chapters.
- Graph Learning based Recommender Systems: A Review (2021). Useful when graph-based message passing, neighborhood propagation, and graph-enhanced collaborative filtering are central to your method selection.
11.5 Sequential, session-based, and decision-oriented surveys
A Survey on Session-based Recommender Systems (2022). This survey focuses on settings where long-run user histories are weak, absent, or less useful than short-horizon intent. It lays out the data structure, problem formulation, and method families for session-based recommendation, including the shift from simple Markov-style methods to GRU-, attention-, and transformer-based models. It is relevant to this handbook because the Explicit vs. Implicit Feedback, Model Families, and Deep Models chapters only introduce sequence-aware recommendation at a high level; this paper is the right follow-on when your product is driven by recency, intent shifts, and within-session behavior.
- Reinforcement Learning based Recommender Systems: A Survey (2021). Best for long-horizon optimization, exploration, delayed rewards, and policy-learning formulations of recommendation.
11.6 Responsible and robust recommendation
Bias and Debias in Recommender System: A Survey and Future Directions (2020). This survey is useful because it reframes recommendation quality as partly a data-generation problem rather than only a model-design problem. It catalogs major bias sources such as exposure bias, selection bias, position bias, and popularity bias, then reviews algorithmic and evaluation-side responses. It is relevant to this handbook because many of the production issues discussed in Production Concerns and the evaluation caveats in the ranking chapters become much easier to reason about once you view recommendation pipelines through a bias-and-debias lens.
- Fairness and Diversity in Recommender Systems: A Survey (published online in 2024; ACM TIST issue in 2025). Strong choice if your concern is multi-objective ranking quality rather than only relevance or engagement lift.
11.7 Emerging directions
Multimodal Recommender Systems: A Survey (2023; updated 2024). This survey covers recommendation systems that use more than IDs and tabular metadata, such as text, images, audio, and video. It is especially useful for understanding how representation learning changes when item understanding itself becomes a multimodal problem instead of a pure interaction problem. It is relevant to this handbook because it extends the Feature-Rich Recommendation and Deep Models chapters into the part of the field where content understanding and recommendation become tightly coupled.
How Can Recommender Systems Benefit from Large Language Models: A Survey (2025). This survey is a strong next read if you want to understand where LLMs fit into recommendation without collapsing everything into hype. It organizes the space around representation, reasoning, generation, user understanding, and agentic or interactive recommendation settings, while also discussing limits such as latency, hallucination, and evaluation mismatch. It is relevant to this handbook because it helps extend the Deep Models and Production Concerns chapters into the current wave of LLM-assisted retrieval, ranking, explanation, and recommendation workflows.
11.8 Foundational pre-2020 surveys worth keeping
These are older than the 2020+ focus of this section, but they are still worth keeping because they remain useful orientation documents.
- Deep Learning based Recommender System: A Survey and New Perspectives (2017; later journal publication in 2019). Still one of the clearest early maps of neural recommender systems.
- Deep Learning Based Recommender System: A Survey and New Perspectives (2019). The ACM Computing Surveys publication version of the same work, useful if you prefer the journal reference.
- Recommender system application developments: a survey (2015). Older and broader than the rest, but still useful if you want historical perspective on application areas and early design patterns.