3. Machine Learning Essentials
Machine learning is one tool inside data science, not the whole job. The minimum useful skill is not knowing every model family. It is knowing how to frame a prediction problem, build a strong baseline, evaluate honestly, and recognize when a model is not trustworthy.
Start with the problem type
The first split is usually:
- regression: predict a continuous value
- classification: predict a category or probability
- ranking: order items by relevance or value
Before picking a model, make sure the target is defined clearly and available at prediction time.
The minimum viable modeling workflow
- define the target, decision, and metric
- choose a split strategy that matches reality
- build a simple baseline
- compare stronger models only after the baseline is solid
- inspect errors, calibration, and leakage risks
- monitor after deployment because data changes over time
The baseline matters more than many new practitioners expect. A well-defined baseline exposes whether additional complexity is actually earning its keep.
General objective: fit the data without memorizing noise
A broad way to write many supervised learning problems is:
The first term rewards fit to the training data. The second term regularizes complexity.
That trade-off is the center of practical ML.
Bias, variance, and overfitting
- high bias: model is too rigid and misses important structure
- high variance: model is too sensitive to quirks of the training sample
- overfitting: training performance looks strong but generalization degrades
This is why you should not judge a model only by how well it fits the training set.
Model families worth knowing first
| Model family | Why it matters | Main caution |
|---|---|---|
| linear and logistic regression | fast, interpretable, strong baseline | misses nonlinear structure unless features help |
| decision trees and tree ensembles | strong default for many tabular problems | can overfit or hide leakage behind high accuracy |
| nearest neighbors | builds intuition for similarity-based prediction | sensitive to scaling and irrelevant features |
| clustering and PCA | useful for structure discovery and compression | easy to over-interpret unsupervised outputs |
For many structured-data tasks, tree ensembles and regularized linear models are better defaults than deep learning.
Evaluation is part of modeling
Pick metrics that match the decision:
- MAE or RMSE for regression
- log loss, precision, recall, ROC-AUC, or PR-AUC for classification depending on cost structure
- ranking metrics for ordering problems
Also choose the right split:
- random split for stable iid-style data
- time-based split for forecasting or delayed outcomes
- group-aware split when related entities would otherwise leak across folds
Cross-validation and tuning
Cross-validation helps estimate how sensitive performance is to the exact split. Hyperparameter tuning helps search model settings. Neither rescues a badly framed problem.
Treat both as support tools, not substitutes for problem understanding.
Interpretability, calibration, and error analysis
A model that predicts probabilities should also be calibrated well enough that predicted probabilities mean what they appear to mean.
Error analysis should ask:
- which segments perform poorly
- whether the target or features are stale
- whether the model is exploiting leakage or proxies
- whether class imbalance is distorting the headline metric
What to remove from your mental checklist
You do not need to memorize every derivation, every kernel trick, or every neural architecture to be effective early on.
You do need to know how to answer these questions:
- What is the baseline?
- What would leakage look like here?
- How could this fail after deployment?
- Is a simpler model already good enough?
Chapter takeaway
Good machine learning practice is usually conservative before it is clever.
From here, the best next step is depth in a subdomain. If you want a broader modeling workflow for structured data, continue with Applied Machine Learning for Tabular Data. If you want deeper coverage of tree models, try Decision Trees and Ensemble Methods in Machine Learning. If you want a domain-specific example of end-to-end modeling systems, try Understanding Recommender Systems.
Next: SQL and Data Modeling.