2. Evaluation and Data Quality

Evaluation is where many tabular ML projects quietly go wrong. A model can look strong on paper and still be misleading if the split strategy, metric, or data-quality assumptions are poor.

Learning goals

  • choose metrics that match the task
  • use train, validation, and test data correctly
  • recognize overfitting, leakage, imbalance, and missing-data issues

Match the metric to the task

For regression, common choices include:

  • mean absolute error when average miss size is easy to interpret
  • mean squared error when larger mistakes should be penalized more strongly
  • root mean squared error when you want the metric back on the original unit scale

For classification, common choices include:

  • accuracy when classes are balanced and mistakes have similar cost
  • precision when false positives are expensive
  • recall when false negatives are expensive
  • F1 when precision and recall both matter and you want a single summary

The main discipline is not to choose a metric because it is familiar. Choose it because it reflects the downstream cost of being wrong.

A few core formulas are worth recognizing:

MAE=1ni=1n|yiy^i|,RMSE=1ni=1n(yiy^i)2

Precision=TPTP+FP,Recall=TPTP+FN,F1=2PrecisionRecallPrecision+Recall

MetricBest fitMain caution
Accuracybalanced classification with similar error costscan look good on imbalanced tasks for the wrong reasons
Precisioncostly false positivesmay hide missed positives
Recallcostly false negativesmay encourage too many alerts
F1need one summary for precision and recallhides the threshold trade-off and business context
MAEeasy-to-interpret average miss sizetreats all errors linearly
RMSElarger misses should hurt morecan be dominated by a few large outliers

Train, validation, and test

A useful mental model is:

  • training set: fit the model
  • validation set: compare modeling choices
  • test set: estimate final generalization quality only after the workflow is fixed

If the same data keeps influencing every modeling decision, the test set stops behaving like truly unseen data.

Cross-validation is especially useful when data is limited or when model performance is sensitive to the specific split.

Roles of train, validation, and test data in one workflow

Underfitting and overfitting

Underfitting means the model is too simple or poorly specified to capture meaningful structure.

Overfitting means the model has adapted too much to quirks of the training data and no longer generalizes well.

Common warning signs of overfitting:

  • training performance far better than validation performance
  • unstable results across folds or random seeds
  • large gains after increasingly specific feature tweaks but weak out-of-sample improvement

Exploratory data analysis

Before training, inspect the data directly.

Useful first passes include:

  • descriptive statistics
  • feature distributions
  • scatterplots for important numeric relationships
  • correlation matrices
  • target balance checks

EDA is not a separate academic ritual. It is how you discover scale differences, outliers, impossible values, and target imbalance before those issues become modeling mistakes.

Missing data

Missingness is rarely just a technical nuisance. Sometimes the fact that a value is missing is itself informative.

A practical workflow is:

  • quantify how much data is missing and where
  • decide whether to drop, impute, or explicitly encode missingness
  • keep the choice consistent inside the model pipeline

Simple methods such as mean, median, or most-frequent imputation are often strong enough to start with. What matters most is whether the method is applied correctly and reproducibly.

Imbalanced data

If one class is rare, accuracy can become actively misleading. A model that predicts the majority class almost all the time may still look numerically strong while being operationally useless.

In those settings:

  • track precision and recall
  • inspect confusion matrices
  • compare against naive baselines
  • think carefully about threshold choice

Chapter takeaway

Reliable evaluation is not a reporting step. It is part of model design.

Practice

Take a real or hypothetical classification problem and answer:

  • Which mistake is worse: a false positive or a false negative?
  • Which metric should matter most?
  • What would a bad split strategy accidentally leak?

Then continue to KNN, Scaling, and Pipelines.

Previous
Next