5. Feature-Rich Recommendation
As D2L section 21.8 emphasizes, interaction data is often sparse and noisy. In many production settings, recommendation is better framed as impression-level prediction with rich side features.
5.1 Feature-rich recommendation and CTR
Feature-rich recommendation is common in ads, feeds, and product surfaces.
- Labels are often binary, such as click vs no click
- Inputs include many categorical fields rather than only user and item IDs
- The D2L advertising example uses 34 fields, with the first column as the click label and the remaining columns as categorical features
This setting is different from classic matrix factorization because the goal is often click-through rate prediction over impression-level examples rather than rating reconstruction.
CTR is defined as:
5.2 Factorization machines
Factorization machines are one of the most important bridges between collaborative filtering and feature-rich prediction.
For a feature vector
Interpretation:
- The first two terms are linear
- The last term models pairwise feature interactions
- If one feature encodes user identity and another encodes item identity, the interaction term reduces to a collaborative-filtering-style embedding interaction
D2L also highlights the computational trick that reduces FM interaction cost from
5.3 DeepFM
DeepFM extends FM by combining low-order feature interactions from FM with high-order nonlinear interactions from a deep network.
- The FM branch captures low-order interactions
- The deep branch uses shared embeddings and an MLP to learn higher-order interactions
- Both outputs are combined into a final prediction
D2L presents the DeepFM prediction as:
DeepFM is especially useful when simple pairwise interactions are not expressive enough, but you still want the inductive bias of factorization-based feature interaction.
Image credit: Dive into Deep Learning, CC BY-SA 4.0.
5.4 Hybrid factorization with features (LightFM-style)
- User embedding = sum of user-feature embeddings
- Item embedding = sum of item-feature embeddings
- Score uses dot product (+ optional biases)
Why data scientists use this:
- Stronger cold-start behavior
- Smooth path between collaborative and content-based modeling
- Practical when metadata quality is reasonable
The Google course makes the same idea concrete from a matrix-factorization angle: you can augment the original interaction matrix with user-feature and item-feature blocks, then factorize the augmented matrix so that side features learn embeddings alongside users and items. Conceptually, this is one of the cleanest bridges between classic WALS-style recommender systems and modern hybrid feature-based models.
5.5 Industrial ads CTR architectures
If you want a more industrial view of feature-rich ranking, two recent papers are worth reading after the material above.
- DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction (2022) is an ads CTR paper about feature interaction modeling at online advertising scale. Its main point is that different interaction modules capture different useful signals, so a hierarchical ensemble can outperform committing to a single interaction design.
- InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction (2025) is a Meta Ads CTR paper. It focuses on learning stronger interactions between heterogeneous signals such as profile features, context features, and behavior sequences.
These are useful extensions of the chapter because they show what happens when feature-rich recommendation is pushed into industrial ads ranking. They should not be read as broad recommender-system blueprints or as the state of the art for recommender systems in general. They are narrower and more specific: both are ads CTR architecture papers shaped by impression-level prediction, extreme scale, and production serving constraints.