Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

python - Interpreting AUC, accuracy and f1-score on the unbalanced dataset

I am trying to understand how AUC is a better metric than classification accuracy in the case when the dataset is unbalanced.
Suppose a dataset is containing 1000 examples of 3 classes as follows:

a = [[1.0, 0, 0]]*950 + [[0, 1.0, 0]]*30 + [[0, 0, 1.0]]*20

Clearly, this data is unbalanced.
A naive strategy is to predict every point belonging to the first class.
Suppose we have a classifier with the following predictions:

b = [[0.7, 0.1, 0.2]]*1000

With the true labels in the list a and predictions in the list b, classification accuracy is 0.95.
So one would believe that the model is really doing good on the classification task, but it is not because the model is predicting every point in one class.
Therefore, the AUC metric is suggested for evaluating an unbalanced dataset.
If we predict AUC using TF Keras AUC metric, we obtain ~0.96.
If we predict f1-score using sklearn f1-score metric by setting b=[[1,0,0]]*1000, we obtain 0.95.

Now I am a little bit confused because all the metrics (Accuracy, AUC and f1-score) are showing high value which means that the model is really good at the prediction task (which is not the case here).

Which point I am missing here and how we should interpret these values?
Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You are very likely using the average='micro' parameter to calculate the F1-score. According to the docs, specifying 'micro' as the averaging startegy will:

Calculate metrics globally by counting the total true positives, false negatives and false positives.

In classification tasks where every test case is guaranteed to be assigned to exactly one class, computing a micro F1-score is equivalent to computing the accuracy score. Just check it out:

from sklearn.metrics import accuracy_score, f1_score

y_true = [[1, 0, 0]]*950 + [[0, 1, 0]]*30 + [[0, 0, 1]]*20
y_pred = [[1, 0, 0]]*1000

print(accuracy_score(y_true, y_pred)) # 0.95

print(f1_score(y_true, y_pred, average='micro')) # 0.9500000000000001

You basically computed the same metric twice. By specifying average='macro' instead, the F1-score will be computed for each label independently first, and then averaged:

print(f1_score(y_true, y_pred, average='macro')) # 0.3247863247863248

As you can see, the overall F1-score depends on the averaging strategy, and a macro F1-score of less than 0.33 is a clear indicator of a model's deficiency in the prediction task.


EDIT:

Since the OP asked when to choose which strategy, and I think it might be useful for others as well, I will try to elaborate a bit on this issue.

scikit-learn actually implements four different stratgies for metrics that support averages for multiclass and multilabel classification tasks. Conveniently, the classification_report will return all of those that apply for a given classification task for Precision, Recall and F1-score:

from sklearn.metrics import classification_report

# The same example but without nested lists. This avoids sklearn to interpret this as a multilabel problem.
y_true = [0 for i in range(950)] + [1 for i in range(30)] + [2 for i in range(20)]
y_pred = [0 for i in range(1000)]

print(classification_report(y_true, y_pred, zero_division=0))

######################### output ####################

              precision    recall  f1-score   support

           0       0.95      1.00      0.97       950
           1       0.00      0.00      0.00        30
           2       0.00      0.00      0.00        20

    accuracy                           0.95      1000
   macro avg       0.32      0.33      0.32      1000
weighted avg       0.90      0.95      0.93      1000

All of them provide a different perspective depending on how much emphasize one puts on the class distributions.

  1. micro average is a global strategy that basically ignores that there is a distinction between classes. This might be useful or justified if someone is really just interested in overall disagreement in terms of true postives, false negatives and false positives, and is not concerned about differences within the classes. As hinted before, if the underlying problem is not a multilabel classification task, this actually equals the accuracy score. (This is also why the classification_report function returned accuracy instead of micro avg).

  2. macro average as a strategy will calculate each metric for each label separately and return their unweighted mean. This is suitable if each class is of equal importance and the result shall not be skewed in favor of any of the classes in the dataset.

  3. weighted average will also first calculate each metric for each label separately. But the average is weighted according to the classes' support. This is desirable if the importance of the classes is proportional to their importance, i.e. a class that is underrepresented is considered less important.

  4. samples average is only meaningful for multilabel classification and therefore not returned by classification_report in this example and also not discussed here ;)

So the choice of averaging strategy and the resulting number to trust really depends on the importance of the classes. Do I even care about class differences (if no --> micro average) and if so, are all classes equally important (if yes --> macro average) or is the class with higher support more important (--> weighted average).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share

2.1m questions

2.1m answers

63 comments

56.6k users

...