5. Feature-Rich Recommendation

As D2L section 21.8 emphasizes, interaction data is often sparse and noisy. In many production settings, recommendation is better framed as impression-level prediction with rich side features.

5.1 Feature-rich recommendation and CTR

Feature-rich recommendation is common in ads, feeds, and product surfaces.

Labels are often binary, such as click vs no click
Inputs include many categorical fields rather than only user and item IDs
The D2L advertising example uses 34 fields, with the first column as the click label and the remaining columns as categorical features

This setting is different from classic matrix factorization because the goal is often click-through rate prediction over impression-level examples rather than rating reconstruction.

CTR is defined as:

$CTR = \frac{clicks}{impressions} \times 100$

5.2 Factorization machines

Factorization machines are one of the most important bridges between collaborative filtering and feature-rich prediction.

For a feature vector $x \in R^{d}$ , the two-way FM model is:

$\hat{y} (x) = w_{0} + \sum_{i = 1}^{d} w_{i} x_{i} + \sum_{i = 1}^{d} \sum_{j = i + 1}^{d} ⟨ v_{i}, v_{j} ⟩ x_{i} x_{j}$

Interpretation:

The first two terms are linear
The last term models pairwise feature interactions
If one feature encodes user identity and another encodes item identity, the interaction term reduces to a collaborative-filtering-style embedding interaction

D2L also highlights the computational trick that reduces FM interaction cost from $O (k d^{2})$ to $O (k d)$ , which is why FM remains practical on high-dimensional sparse data.

5.3 DeepFM

DeepFM extends FM by combining low-order feature interactions from FM with high-order nonlinear interactions from a deep network.

The FM branch captures low-order interactions
The deep branch uses shared embeddings and an MLP to learn higher-order interactions
Both outputs are combined into a final prediction

D2L presents the DeepFM prediction as:

$\hat{y} = σ ({\hat{y}}^{(F M)} + {\hat{y}}^{(D N N)})$

DeepFM is especially useful when simple pairwise interactions are not expressive enough, but you still want the inductive bias of factorization-based feature interaction.

DeepFM architecture

Image credit: Dive into Deep Learning, CC BY-SA 4.0.

5.4 Hybrid factorization with features (LightFM-style)

User embedding = sum of user-feature embeddings
Item embedding = sum of item-feature embeddings
Score uses dot product (+ optional biases)

Why data scientists use this:

Stronger cold-start behavior
Smooth path between collaborative and content-based modeling
Practical when metadata quality is reasonable

The Google course makes the same idea concrete from a matrix-factorization angle: you can augment the original interaction matrix with user-feature and item-feature blocks, then factorize the augmented matrix so that side features learn embeddings alongside users and items. Conceptually, this is one of the cleanest bridges between classic WALS-style recommender systems and modern hybrid feature-based models.

5.5 Industrial ads CTR architectures

If you want a more industrial view of feature-rich ranking, two recent papers are worth reading after the material above.

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction (2022) is an ads CTR paper about feature interaction modeling at online advertising scale. Its main point is that different interaction modules capture different useful signals, so a hierarchical ensemble can outperform committing to a single interaction design.
InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction (2025) is a Meta Ads CTR paper. It focuses on learning stronger interactions between heterogeneous signals such as profile features, context features, and behavior sequences.

These are useful extensions of the chapter because they show what happens when feature-rich recommendation is pushed into industrial ads ranking. They should not be read as broad recommender-system blueprints or as the state of the art for recommender systems in general. They are narrower and more specific: both are ads CTR architecture papers shaped by impression-level prediction, extreme scale, and production serving constraints.

Last updated on Wed, Mar 11, 2026