Calibration plots give us a snapshot to understand how well a model's prediction correlate with a measure of probability that it is correct in its prediction.
Many binary classification models output a value that is named confidence or probability. This allows for flexibility in the classification by using a working point that decides the final prediction. While confidence is often interpreted as the probability of the sample to belong to the predicted class or as model’s uncertainty, in reality, in many cases the value is not calibrated. This means that the confidence does not match the observed distribution. Looking at a simple example, if we collect 100 outputs of a model that all have a confidence of 0.7, we would expect it to be correct in about ~70 predictions. However, if the model was only correct in 30 of the predictions or if it was correct for all of the predictions we will say that the model is not calibrated.
To measure calibration we use Accuracy: The number of correct predictions / The total number of predictions. The samples are divided into bins using the outputted confidence, and for each bin the accuracy is calculated. Note that for binary classifier that outputs a single value, the lowest confidence possible is
where th is the working point (in our case it is 0.5, see the figure above). Anything bellow this confidence will result in the opposite class . To include it in the calibration plot we use the complementary probability (1 - probability).
The results are plotted on the calibration plot (as shown above).
For a well calibrated model we would expect the confidence in each bin to correspond to the accuracy in the same bin or simply follow the dashed line. Over the calibration line, we can measure the Estimated Calibration Error (ECE) which summarizes the calibration quality into a single number (lower error is better)
To learn more about calibration plots, please refer to https://machinelearningmastery.com/calibrated-classification-model-in-scikit-learn/