6. Optimization and Linear Models

Optimization is the engine underneath most machine learning. Even if you never derive gradients by hand, you should still understand what is being minimized and why.

Learning goals

understand objective functions and gradient descent at a conceptual level
connect optimization to linear and logistic regression
see why regularization matters

Objective functions

Training a model usually means choosing parameters that minimize an error or loss function.

In compact form, many supervised-learning problems look like:

$$ \min_{\theta} L(\theta)

\min_{\theta} \frac{1}{n}\sum_{i=1}^{n} \ell\left(y_i, f_{\theta}(x_i)\right) $$

Conceptually, the model asks:

How wrong am I right now?
In which direction should I adjust my parameters to become less wrong?

This turns model fitting into an optimization problem.

Gradient descent intuition

Gradient descent is an iterative way to reduce the loss.

The rough picture is:

start with an initial parameter setting
measure the slope of the loss surface
move in the direction that lowers the loss
repeat until improvement becomes small or a stopping rule is reached

That is the core idea behind a wide range of models, not just neural networks.

The canonical update is:

$θ^{(t + 1)} = θ^{(t)} - η \nabla_{θ} L (θ^{(t)})$

where $η$ is the learning rate.

Linear regression

Linear regression predicts a numeric outcome as a weighted combination of input features.

It is most useful when:

interpretability matters
the signal is reasonably smooth
you want a strong baseline for numerical prediction

Even when the world is not perfectly linear, linear regression is often worth fitting because it clarifies direction, magnitude, and baseline difficulty.

The basic model and squared-error loss are:

$\hat{y} = β_{0} + x^{⊤} β, L (β) = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - β_{0} - x_{i}^{⊤} β)}^{2}$

From linear to logistic regression

Classification needs outputs that behave like probabilities or decisions, not unrestricted numbers. Logistic regression adapts the linear idea by transforming the output through a sigmoid-like mapping and optimizing a classification-appropriate loss.

This makes logistic regression a foundational model for binary classification:

simple
interpretable
often surprisingly competitive

The usual probability mapping is:

$P (y = 1 ∣ x) = σ (β_{0} + x^{⊤} β), σ (z) = \frac{1}{1 + e^{- z}}$

Regularization

Regularization adds a penalty for overly large or overly flexible parameter values.

Its main job is to reduce overfitting and improve generalization.

A useful intuition:

without regularization, a model may memorize quirks
with too much regularization, a model may become too rigid

Good regularization is not about making a model smaller for its own sake. It is about trading a little training fit for better out-of-sample behavior.

Two classic regularized objectives are:

$L_{ridge} (β) = L (β) + λ ‖ β ‖_{2}^{2}, L_{lasso} (β) = L (β) + λ ‖ β ‖_{1}$

Why these models still matter

Linear and logistic regression remain valuable because they:

train quickly
establish strong baselines
offer interpretability
force clean thinking about features and assumptions

They also provide the conceptual bridge to more advanced optimization-based models.

Model	Core output	Strength	Typical limitation
Linear regression	numeric prediction	fast, interpretable baseline	misses nonlinear structure
Logistic regression	class probability	strong calibrated baseline for binary tasks	decision boundary is linear in feature space
Ridge regression	shrunk linear coefficients	stable when features are correlated	does not do feature selection directly
Lasso regression	sparse linear coefficients	can simplify wide feature sets	unstable when strong predictors are highly correlated

Chapter takeaway

Optimization is easier to understand when you connect it to familiar models first. Linear and logistic regression are ideal for building that intuition.

Practice

Write down one problem where you would begin with logistic regression before trying a tree or neural network. Explain why.

Then continue to Boosting, Neural Networks, and AutoML.

Last updated on Sat, Mar 14, 2026