6. Optimization and Linear Models
Optimization is the engine underneath most machine learning. Even if you never derive gradients by hand, you should still understand what is being minimized and why.
Learning goals
- understand objective functions and gradient descent at a conceptual level
- connect optimization to linear and logistic regression
- see why regularization matters
Objective functions
Training a model usually means choosing parameters that minimize an error or loss function.
In compact form, many supervised-learning problems look like:
$$ \min_{\theta} L(\theta)
\min_{\theta} \frac{1}{n}\sum_{i=1}^{n} \ell\left(y_i, f_{\theta}(x_i)\right) $$
Conceptually, the model asks:
- How wrong am I right now?
- In which direction should I adjust my parameters to become less wrong?
This turns model fitting into an optimization problem.
Gradient descent intuition
Gradient descent is an iterative way to reduce the loss.
The rough picture is:
- start with an initial parameter setting
- measure the slope of the loss surface
- move in the direction that lowers the loss
- repeat until improvement becomes small or a stopping rule is reached
That is the core idea behind a wide range of models, not just neural networks.
The canonical update is:
where
Linear regression
Linear regression predicts a numeric outcome as a weighted combination of input features.
It is most useful when:
- interpretability matters
- the signal is reasonably smooth
- you want a strong baseline for numerical prediction
Even when the world is not perfectly linear, linear regression is often worth fitting because it clarifies direction, magnitude, and baseline difficulty.
The basic model and squared-error loss are:
From linear to logistic regression
Classification needs outputs that behave like probabilities or decisions, not unrestricted numbers. Logistic regression adapts the linear idea by transforming the output through a sigmoid-like mapping and optimizing a classification-appropriate loss.
This makes logistic regression a foundational model for binary classification:
- simple
- interpretable
- often surprisingly competitive
The usual probability mapping is:
Regularization
Regularization adds a penalty for overly large or overly flexible parameter values.
Its main job is to reduce overfitting and improve generalization.
A useful intuition:
- without regularization, a model may memorize quirks
- with too much regularization, a model may become too rigid
Good regularization is not about making a model smaller for its own sake. It is about trading a little training fit for better out-of-sample behavior.
Two classic regularized objectives are:
Why these models still matter
Linear and logistic regression remain valuable because they:
- train quickly
- establish strong baselines
- offer interpretability
- force clean thinking about features and assumptions
They also provide the conceptual bridge to more advanced optimization-based models.
| Model | Core output | Strength | Typical limitation |
|---|---|---|---|
| Linear regression | numeric prediction | fast, interpretable baseline | misses nonlinear structure |
| Logistic regression | class probability | strong calibrated baseline for binary tasks | decision boundary is linear in feature space |
| Ridge regression | shrunk linear coefficients | stable when features are correlated | does not do feature selection directly |
| Lasso regression | sparse linear coefficients | can simplify wide feature sets | unstable when strong predictors are highly correlated |
Chapter takeaway
Optimization is easier to understand when you connect it to familiar models first. Linear and logistic regression are ideal for building that intuition.
Practice
Write down one problem where you would begin with logistic regression before trying a tree or neural network. Explain why.
Then continue to Boosting, Neural Networks, and AutoML.