Scoring Rules
Table of Contents
1) Introduction
Scoring rules are used to evaluate the performance of probabilistic forecasts. Roughly, a scoring rule is a function that maps a forecast and an observation to a real number, which is interpreted as a measure of the quality of the forecast. For a rational agent who wants to maximize the expected score, the best forecast is the one that is most likely to be the true observation. In this post, we will discuss a couple of definitions related to scoring rules . Then we will examine some of the commonly used proper scoring rules in the literature. At the end we’ll briefly discuss the relationship between scoring rules and predictive certainty.
2) Definitions
2.1) Scoring Rule
A scoring rule is any extended real-valued function \(\mathbf{S}: \mathcal{Q} \times \Omega \rightarrow \mathbb {R}\) where \(\mathcal{Q}\) is a family of probability distributions over the space \(\Omega\), such that \(\mathbf{S}(Q, \cdot)\) is \(\mathcal{Q}\)-quasi-integrable* for all \(Q \in \mathcal{Q}\). Output of the function \(\mathbf{S}(Q, y)\) represents the loss or penalty when the forecast \(Q \in \mathcal{Q}\) is issued and the observation \(y \in \Omega\) is realized.
*In our definition, we use the term “quasi-integrable” to mean that the function is integrable with respect to the measure induced by the probability distribution. This is just a technical condition that ensures the existence of the expected score. So, nothing to worry about here.
2.2) Proper Scoring Rule
A scoring rule \(\mathbf{S}\) is called a proper scoring rule, if and only if
\[\label{eq:psr} \max_{Q \in \mathcal{Q}} \mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] = \mathbb{E}_{Y \sim P}[\mathbf{S}(P, Y)] \; .\]In other words, score function \(\mathbf{S}\) is a proper scoring rule, if it is maximized when the forecaster gives exactly the ground truth distribution \(P(Y)\) as its probabilistic forecast \(Q \in \mathcal{Q}\).
2.3) Strictly Proper Scoring Rule
A scoring rule \(\mathbf{S}\) is called a strictly proper scoring rule, if and only if
-
\(\mathbf{S}\) is a proper scoring rule, and
-
\(\operatorname*{arg\,max}_{Q \in \mathcal{Q}} \; \mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] = P\) is the unique maximizer of \(\mathbf{S}\) in \(Q\).
In other words, a strictly proper scoring rule is maximized only when the the forecaster gives exactly the ground truth distribution \(P(Y)\) as its probabilistic forecast \(Q \in \mathcal{Q}\).
So, there is a slight difference between proper scoring rules and strictly proper scoring rules. In an ideal scenario, we want our scoring rule to be strictly proper so that it awards the maximum reward only to the forecaster who gives the true distribution as its forecast.
3) Scoring Rules
3.1) Log Probability
A log (probability) scoring rule \(S(q, y)\) is as a scoring rule that measures the quality of a probabilistic forecast in decision theory. Formally, it can be defined in discrete or continuous form as follows:
1) Log scoring rule for binary classification:
\[S(q, y) = \left\{ \begin{array}{rl} \log q \; , & \text{if} \; y = 1 \\ \log(1-q) \; , & \text{if} \; y = 0 \end{array} \right.\]which can be expressed as
\[\label{eq:binary-lpsr} S(q, y) = y \log q + (1-y) \log (1-q)\]Note that the expressions given above have slightly different domains. For the first equation, the domain is \(D_1 = ([0,1) \times \left\lbrace 0 \right\rbrace) \cup ((0, 1] \times \left\lbrace 1 \right\rbrace)\), while for the second equation, the domain is \(D_2 = (0,1) \times \left\lbrace 0,1 \right\rbrace\).
2) Log scoring rule for multiclass classification:
\[\label{eq:multiclass-lpsr} S(q, y) = \sum_k y_k \log q_k(x) = \log q_{y^*}(x)\]where \(y^*\) is the true class and \(q\) is the predicted probability distribution over the classes. We have \(y_k = 1\), if the true class is \(k\) and \(y_k = 0\) otherwise.
3) Log scoring rule for regression (continuous case):
\[\label{eq:regression-lpsr} S(q, y) = \log q(y)\]where \(q\) is the predicted probability distribution over the continuous space and \(y\) is the true value.
3.2) Brier Score
1) Brier score for binary classification:
\[S(q, y) = -(q - y)^2\]\(q\) represents the predicted probability of the positive class (\(Y = 1\)) and \(y\) is the true class label. Since we want the output of the scoring rule to be maximized when the predicted probability is close to the true class label, we use the negative of the squared difference between the predicted probability and the true class label.
2) Brier score for multiclass classification:
\[S(q, y) = -\sum_k (q_k - y_k)^2 = -(q_{y^*} - 1)^2 -\sum_{k \neq y^*} q_k^2\]where \(q_k\) is the predicted probability of the class \(y^*\) is the true class label. Similar to the log probability score, we have \(y_k = 1\), if the true class is \(k\) and \(y_k = 0\) otherwise.
Although there is no direct version of Brier score for regression, we can use the squared error loss as a scoring rule for regression problems.
4) Proofs
Having defined the scoring rules and given examples for them, we can now prove that those scoring rules are strictly proper. We will start with the log probability scoring rule.
4.1) Theorem (Log Probability Scoring Rule)
Theorem: The log (probability) scoring rule is a strictly proper scoring rule.
Proof: We will show that all versions of the log probability scoring rule (binary/multiclass/regression) are strictly proper scoring rules.
1) Binary log probability scoring rule:
\[\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] = P(Y = 1) \log q + P(Y = 0) \log (1 - q)\]Let \(p\) be the true probability of the event \(Y = 1\). Then, the expected score is:
\[\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] = p \log q + (1 - p) \log (1 - q)\]To find the maxima, take the derivative with respect to \(q\) and set it to zero:
\[\begin{split} \frac{\partial}{\partial q}\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] &= \frac{p}{q} - \frac{1-p}{1 - q} \\ 0 &= \frac{p - pq - q + pq}{q (1-q)} \\ 0 &= \frac{p - q}{q (1-q)} \\ \Rightarrow p - q &= 0 \\ \Rightarrow p &= q \end{split}\]Now, we need to check the second derivative to see, if it is a maximum for the properness condition and if it is the only maximizer for the strictness condition:
\[\begin{split} \frac{\partial^2}{\partial q^2}\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] &= -\frac{p}{q^2} - \frac{1-p}{(1 - q)^2} \\ &= - \left( \underbrace{\frac{p}{q^2}}_\textrm{> 0} + \underbrace{\frac{1-p}{(1 - q)^2}}_\textrm{> 0} \right) < 0 \end{split}\]Except for the cases \(q=0\) and \(q=1\), the second derivative is always negative, which means that the function is concave and the maximum is unique. For \(q = 1\), maximum is achieved only if \(p = 1\), and similarly for \(q = 0\), maximum is achieved only if \(p = 0\). Therefore, \(p = q\) is the only maximizer and the log probability scoring rule for binary classification is strictly proper.
2a) Multiclass log probability scoring rule (Proof #1):
\[S(q, y) = \sum_k^K y_k \log q_k(x)\]Let \(p_k\) be the true probability of the event \(Y = k\). Since \(q_k\) is the predicted probability for class \(k\), we know that \(\sum_i q_i = 1\). Then, the expected score is:
\[\begin{split} \mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] &= \sum_k P(Y = k|x) \log(q_k(x)) \\ &= p_1 \log(q_1(x)) + p_2\log(q_2(x)) + ... + p_K \log(q_K(x)) \\ &= p_1 \log(q_1(x)) + p_2\log(q_2(x)) + ... + p_K \log(1 - \sum_{i \neq K} q_i(x)) \end{split}\]Taking the derivative with respect to \(q_j\) and setting it to zero:
\[\begin{split} \frac{\partial}{\partial q_j}\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] &= \frac{p_j}{q_j} -\frac{p_K}{1- \sum_{i \neq K} q_i(x)} \\ 0 &= \frac{p_j}{q_j} - \frac{p_K}{q_K} \\ \Rightarrow \frac{p_j}{q_j} &= \frac{p_K}{q_K} \end{split}\]This equality holds for any \(j\):
\[\frac{p_1}{q_1} = \frac{p_2}{q_2} = ... = \frac{p_K}{q_K} = \lambda\]Each \(q_i\) can be represented as a constant multiple of \(p_i\) as follows: \(q_i = \lambda \ p_i\)
\[\begin{split} \sum_i q_i &= 1 \\ \sum_i \lambda \ p_i &= 1 \\ \lambda \sum_i p_i &= 1 \\ \lambda = 1 \end{split}\]Since \(\lambda = 1\), we have \(p_i = q_i\) for all \(i\). Now, we need to check the second derivative to see, if it is a maximum for the properness condition and if it is the only maximizer for the strictness condition:
\[\begin{split} \frac{\partial^2}{\partial q_j^2}\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] &= -\frac{p_j}{q_j^2} - \frac{p_K}{(1- \sum_{i \neq K} q_i(x))^2} \\ &= - \left( \underbrace{\frac{p_j}{q_j^2}}_\textrm{> 0} + \underbrace{\frac{p_K}{(1- \sum_{i \neq K} q_i(x))^2}}_\textrm{> 0} \right) < 0 \end{split}\]Except for the cases \(q_j=0\) and \(q_j=1\), the second derivative is always negative, which means that the function is concave and the maximum is unique. For \(q_j = 1\), maximum is achieved only if \(p_j = 1\), and similarly for \(q_j = 0\) maximum is achieved only if \(p_j = 0\). Therefore, \(p_j = q_j\) is the only maximizer and the log probability scoring rule for multiclass classification is strictly proper.
2b) Multiclass log probability scoring rule (Proof #2):
Alternatively, we can solve the optimization problem with Lagrange multipliers. The Lagrangian is:
\[\mathcal{L}(q, \lambda) = \sum_k P(Y = k|x) \log(q_k(x)) + \lambda \left(1 - \sum_k q_k(x)\right)\]Taking the derivative with respect to \(q_j\) and setting it to zero:
\[\begin{split} \frac{\partial}{\partial q_j}\mathcal{L}(q, \lambda) &= \frac{p_j}{q_j} - \lambda \\ 0 &= \frac{p_j}{q_j} - \lambda \\ \Rightarrow \frac{p_j}{q_j} &= \lambda \end{split}\]The rest of the proof follows as in the first proof.
3) Continuous log probability scoring rule:
\[S(q, y) = \log q(y)\]Let \(p(y)\) be the true probability density function of the event \(Y = y\). Then, the expected score is:
\[\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] = \int p(y) \log q(y) dy\]Let \(X = \frac{q(y)}{p(y)}\) and \(\phi = \log(\cdot)\) (a concave function). By Jensen’s inequality, we know that \(f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]\), if \(f\) is concave. Therefore, we have:
\[\begin{split} \int p(y) \log \frac{q(y)}{p(y)} dy &\leq \log \int p(y)\frac{q(y)}{p(y)} dy \\ \int p(y) \log \frac{q(y)}{p(y)} dy &\leq \log \int q(y) dy \\ \int p(y) \log \frac{q(y)}{p(y)} dy &\leq \log(1) \\ \int p(y) \log \frac{q(y)}{p(y)} dy &\leq 0 \end{split}\]The same result can be obtained by using the Kullback-Leibler divergence. The Kullback-Leibler divergence is always non-negative, therefore \(\text{E} - \text{CE} = \text{KL} \geq 0\). The resulting expression is \(-\text{KL}\), which is always non-positive. It is maximized only when \(q(y) = p(y)\) which means that the log probability scoring rule for continuous classification is strictly proper.
An alternative argument for uniqueness of the maximum point can be proposed as follows: \(\int p(y) \log \frac{q(y)}{p(y)} dy\) can be equal to \(0\) in two cases: Either \(\frac{q(y)}{p(y)}\) is equal to \(1\) for each value or the expression \(\log ( \frac{q(y)}{p(y)})\) takes positive and negative values summing up to \(0\) at the end. The second case cannot occur, because it means that there exists a \(y_0\) such that \(q(y_0) > p(y_0)\), implying that Jensen’s inequality is violated. Therefore, the maximum is achieved, if and only if \(q = p\).
4.2) Theorem (Brier Scoring Rule)
Theorem: The Brier scoring rule is a strictly proper scoring rule.
Proof: We will show that the Brier scoring rule is strictly proper for both binary and multiclass classification.
1) Binary Brier Scoring rule:
\[\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] = - P(Y = 1) (q - 1)^2 + P(Y = 0) -q^2\]Let \(p\) be the true probability of the event \(Y = 1\). Then, the expected score is:
\[\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] = - p (q - 1)^2 - (1 - p) q^2\]To find the maxima, take the derivative with respect to \(q\) and set it to zero:
\[\begin{split} \frac{\partial}{\partial q}\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] &= -2p(q - 1) - 2(1 - p)q \\ &= -2pq + 2p - 2q + 2pq \\ &= 2p - 2q \\ 0 &= 2p - 2q \\ \Rightarrow p &= q \end{split}\]We need to check the second derivative to see, if it is a maximum for the properness condition and if it is the only maximizer for the strictness condition:
\[\begin{split} \frac{\partial^2}{\partial q^2}\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] &= -2 < 0 \\ \end{split}\]Second derivative is always negative, which means that the function is concave and the maximum is unique. Therefore, \(p = q\) is the only maximizer and the Brier scoring rule for binary classification is strictly proper.
2) Multiclass Brier Scoring rule:
\[\begin{split} \mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] &= \sum_k P(Y = k) \bigg[ -\sum_i (q_i - y_i)^2 \bigg]\\ &= \sum_k P(Y = k) \bigg[ -(q_{k} - 1)^2 -\sum_{i \neq k} q_i^2 \bigg] \\ &= \sum_k P(Y = k) \bigg[ -(q_{k} - 1)^2 + q_k^2 -\sum_{i} q_i^2 \bigg] \\ &= \sum_k P(Y = k) \bigg[ -q_{k}^2 - 1 + 2q_k + q_k^2 -\sum_{i} q_i^2 \bigg] \\ &= \sum_k P(Y = k) \bigg[ 2q_k - 1 -\sum_{i} q_i^2 \bigg] \\ &= \sum_k P(Y = k)(2q_k - 1) - \sum_k P(Y = k) \bigg(\sum_i q_i^2\bigg) \\ &= \sum_k P(Y = k)(2q_k - 1) - \sum_i q_i^2 \bigg(\underbrace{\sum_k P(Y = k)}_\textrm{1}\bigg) \\ &= \sum_k P(Y = k)(2q_k - 1) - \sum_i q_i^2 \\ &= \sum_k P(Y = k)(2q_k - 1) - q_k^2 \end{split}\]Similar to what we did for log probability, this expression can be expressed as follows (replacing \(q_K\) with \(1 - \sum_{i \neq K} q_i\)):
\[\begin{split} \mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] &= p_1(2q_1 - 1) - q_1^2 + p_2(2q_2 - 1) - q_2^2 + ... + p_K(2q_K - 1) - q_K^2 \\ &= p_1(2q_1 - 1) -q_1^2 + p_2(2q_2 - 1) -q_2^2 + ... + p_K(1 - 2\sum_{i \neq K} q_i) - (1 - \sum_{i \neq K} q_i)^2\\ \end{split}\]Taking the derivative with respect to \(q_j\) and setting it to zero:
\[\begin{split} \frac{\partial}{\partial q_j}\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] &= 2p_j - 2q_j - 2p_K + 2 (1 - \sum_{i \neq K} q_i) \\ &= 2p_j - 2q_j - 2p_K + 2 q_K \\ &= (p_j - q_j) + (q_K - p_K) \\ (p_j - q_j) &= (p_K - q_K) \\ \end{split}\]For all derivaties, the following equality holds:
\[p_1 - q_1 = p_2 - q_2 = ... = p_K - q_K = \lambda\]We know that \(\sum_i q_i = 1\) and \(\sum_i p_i = 1\), therefore:
\[\begin{split} \sum_i p_i - q_i &= K \cdot \lambda = 0 \\ \Rightarrow \lambda &= 0\\ \Rightarrow p_i &= q_i \end{split}\]Now, we need to check the second derivative to see, if it is a maximum for the properness condition and if it is the only maximizer for the strictness condition:
\[\begin{split} \frac{\partial^2}{\partial q_j^2}\mathbb{E}_{Y \sim P}[\mathbf{S}(Q, Y)] &= - 2 - 2 < 0 \\ \end{split}\]The second derivative is always negative, which means that the function is concave and the maximum is unique. Therefore, \(p = q\) is the only maximizer and the Brier scoring rule for multiclass classification is strictly proper.
5) Application: Predictive Certainty
Ok, so far so good. We have seen that there is a concept called scoring rule, and we have seen that log probability and Brier scoring rules are strictly proper. But, why do we care about scoring rules? There are lots of different application areas for scoring rules, but among them, one of the most important ones is predictive certainty. Predictive certainty is the measure of how certain the model is about its predictions. In other words, it is the measure of how much the model trusts its predictions. Scoring rules can be used to measure the predictive certainty of a model.
5.1) Max-Prob Confidence
How can one measure the predictive certainty? Well, let’s say we are dealing with a binary classification problem. Our predicted output can take values between 0 and 1. Here, intuitively we say that our model is certain about its prediction (as positive) when it is very close to 1, and again it is certain about its prediction (as negative) when it is very close to 0. Therefore, we can use the predicted probability as a measure of the predictive certainty. In other words, we can use the predicted probability as a measure of how much the model trusts its predictions. This is called max-prob confidence.
Let’s define a random variable \(L\) which represents the correctness of our prediction.
\[\begin{equation*} L = \begin{cases} 1 &\text{if prediction is correct}\\ 0 &\text{otherwise} \end{cases} \end{equation*}\]Then \(P(L = 1)\) corresponds to the probability that our prediction is correct. In the deployment environment, we don’t really know if our prediction is wrong or not (otherwise we wouldn’t need to make a prediction at all). Our aim is to get the most reliable “estimate” for \(P(L=1)\) so that we really know how confident our model is about its predictions. For a real life scenario, think about a doctor who is trying to diagnose a patient. Diagnosing a patient who is not really sick (false positive), may not be a huge problem (except the stress and the cost). But, not diagnosing a patient who is really sick (false negative), is a serious problem. Therefore, the doctor should be very confident about his/her diagnosis. An ML algorithm tailored for analyzing medical documents might conclude that the patient is in good health. If the doctor is provided with the confidence level of the algorithm, he/she can decide whether to trust the algorithm or not. If the confidence level is high, the doctor may trust the algorithm. If the confidence level is low, the doctor may attempt to run some additional tests to make sure that the patient is not sick.
We want to reward the “reporter” of the confidence level if the reported confidence is close to the real confidence \(P(L=1)\). Does it sound familiar? So, here is the part where scoring rules come into play. We can use a proper scoring rule to encourate the reporter to provide confidence levels as close as possible to the real (ground-truth) confidence level \(P(L=1)\).
5.2) BCE Loss
Here’s the good news! Majority of the traditional loss functions we employ during the training of our models are already proper scoring rules for predicting the confidence level using max-prob confidence. In a sense, they do not only aim to achieve highest accuracy, but also the truthfulness of the max-prob confidence. So, if an algorithm predicts 0.9 for its prediction, we can be sure that this prediction is more likely to be true. Here, it’s important not to conflate the concepts of “accuracy” and “confidence.” Our algorithm might demonstrate high accuracy but lack confidence in its predictions, perhaps consistently outputting probabilities around 0.55 for positive cases and 0.45 for negative cases. Conversely, it could exhibit high confidence despite lower accuracy; for instance, consistently outputting around 0.9 for negative samples, despite this leading to incorrect predictions. We do not want our model to be underconfident or overconfident.
Now, let’s show that one of the most commonly used loss functions, the negative of Binary Cross-Entropy (BCE) loss, is also a proper scoring rule for max-prob confidence. Since loss is usually minimized and score is maximized, we use the negative of BCE loss as a scoring rule.
Careful! It is not that using BCE loss aims to maximize the reported confidence level. Both overconfidence and underconfidence are undesired calibration errors. The aim is to minimize the loss and maximize the score (latter means encouraging the reporter to provide confidence levels as close as possible to the real confidence level \(P(L=1)\)).
Let’s start from the beginning. Now, we want to create a game where there is a reporter of the confidence and the ground truth confidence level \(P(L=1)\). We would like to make sure that the reported confidence level is as close as possible to the real confidence level. We have to design a scoring rule to reward the reporter. We know that log probability score is a strictly proper scoring rule, therefore it is only maximized when the reporter gives the true distribution as its probabilistic forecast. Let \(c(x) = \max(f(x), 1 - f(x))\) and remember the binary log probability scoring rule:
\[\begin{equation*} S(c, L) = \begin{cases} \log(c(x)) &\text{if $L$ = 1}\\ \log(1 - c(x)) &\text{if $L$ = 0} \end{cases} \end{equation*}\]We know that \(c = p(L)\) is the unique maximizer of this scoring function. Let’s evaluate the cases:
- \(Y = 0\) and \(\hat{Y} = 0\): Our prediction is correct, scoring function outputs \(\log(1-f(x))\)
- \(Y = 0\) and \(\hat{Y} = 1\): Our prediction is wrong, scoring function outputs \(\log(1 - f(x))\)
- \(Y = 1\) and \(\hat{Y} = 0\): Our prediction is wrong, scoring function outputs \(\log(f(x))\)
- \(Y = 1\) and \(\hat{Y} = 1\): Our prediction is correct, scoring function outputs \(\log(f(x))\)
Let’s re-formulate the scoring function:
\[S(f, Y) = \begin{cases} \log (f(x)) \;\;\;\;\;\;\;\;\;\;\;\;\text{if}\; Y = 1 \\ \log(1 -f(x)) \;\;\;\;\;\ \text{if} \; Y = 0 \end{cases}\]where \(f\) is the predicted probability and \(x\) is the input data.
This is negative BCE! As claimed, it turns out that minimization of binary cross-entropy loss (or maximization of negative BCE) encourages not only the correctness of classification \(f(x)\), but also the truthfulness of the max-prob confidence \(c(x) = \max (f(x), 1 - f(x))\).
Conclusion
In this post, we have discussed the concept of scoring rules and their importance in evaluating the performance of probabilistic forecasts. We have defined scoring rules and given examples for proper scoring rules. We have shown that log probability and Brier scoring rules are strictly proper. We have also discussed the relationship between scoring rules and predictive certainty. We have shown that the negative of Binary Cross-Entropy (BCE) loss is a proper scoring rule for max-prob confidence. As a result, we have seen that the minimization of BCE loss encourages not only the correctness of classification, but also the truthfulness of the max-prob confidence. Negative Log-Likelihood (NLL), which is the same as Cross Entropy Loss (they have slightly different input expectations in PyTorch, so be careful), is also a proper scoring rule for max-prob confidence. It is a popular metric for evaluating predictive uncertainty as well (Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles).
References: Bálint Mucsányi, Michael Kirchhof, Elisa Nguyen, Alexander Rubinstein, Seong Joon Oh (2023): “Proper/Strictly Proper Scoring Rule”; in: Trustworthy Machine Learning; URL: https://trustworthyml.io/; DOI: 10.48550/arXiv.2310.08215.