Performance metrics are a part of every machine learning pipeline, Which ones are not the performance metrics used in the Machine learning?
Every machine learning task can be broken down to either Regression or Classification, just like the performance metrics.
Metrics are used to monitor and measure the performance of a model (during training and testing), and do not need to be differentiable.
Regression metrics
Regression models have continuous output. So, we need a metric based on calculating some sort of distance between predicted and ground truth.
In order to evaluate Regression models, we'll discuss these metrics in detail:
* Mean Absolute Error (MAE),
* Mean Squared Error (MSE),
* Root Mean Squared Error (RMSE),
* R (R-Squared).
Mean Squared Error (MSE)
Mean squared error is perhaps the most popular metric used for regression problems. It essentially finds the average of the squared difference between the target value and the value predicted by the regression model.
Few key points related to MSE:
* It's differentiable, so it can be optimized better.
* It penalizes even small errors by squaring them, which essentially leads to an overestimation of how bad the model is.
* Error interpretation has to be done with squaring factor(scale) in mind. For example in our Boston Housing regression problem, we got MSE=21.89 which primarily corresponds to (Prices).
* Due to the squaring factor, it's fundamentally more prone to outliers than other metrics.
Mean Absolute Error (MAE)
Mean Absolute Error is the average of the difference between the ground truth and the predicted values.
Few key points for MAE
* It's more robust towards outliers than MAE, since it doesn't exaggerate errors.
* It gives us a measure of how far the predictions were from the actual output. However, since MAE uses absolute value of the residual, it doesn't give us an idea of the direction of the error, i.e. whether we're under-predicting or over-predicting the data.
* Error interpretation needs no second thoughts, as it perfectly aligns with the original degree of the variable.
* MAE is non-differentiable as opposed to MSE, which is differentiable.
Root Mean Squared Error (RMSE)
Root Mean Squared Error corresponds to the square root of the average of the squared difference between the target value and the value predicted by the regression model.
Few key points related to RMSE:
* It retains the differentiable property of MSE.
* It handles the penalization of smaller errors done by MSE by square rooting it.
* Error interpretation can be done smoothly, since the scale is now the same as the random variable.
* Since scale factors are essentially normalized, it's less prone to struggle in the case of outliers.
R Coefficient of determination
R Coefficient of determination actually works as a post metric, meaning it's a metric that's calcu-lated using other metrics.
The point of even calculating this coefficient is to answer the question ''How much (what %) of the total variation in Y(target) is explained by the variation in X(regression line)''
Few intuitions related to R results:
If the sum of Squared Error of the regression line is small => R will be close to 1 (Ideal), meaning the regression was able to capture 100% of the variance in the target variable.
Conversely, if the sum of squared error of the regression line is high => R will be close to 0, meaning the regression wasn't able to capture any variance in the target variable.
You might think that the range of R is (0,1) but it's actually (-,1) because the ratio of squared errors of the regression line and mean can surpass the value 1 if the squared error of regression line is too high (>squared error of the mean).
Classification metrics
Classification problems are one of the world's most widely researched areas. Use cases are present in almost all production and industrial environments. Speech recognition, face recognition, text classification -- the list is endless.
Classification models have discrete output, so we need a metric that compares discrete classes in some form. Classification Metrics evaluate a model's performance and tell you how good or bad the classification is, but each of them evaluates it in a different way.
So in order to evaluate Classification models, we'll discuss these metrics in detail:
Accuracy
Confusion Matrix (not a metric but fundamental to others)
Precision and Recall
F1-score
AU-ROC
Accuracy
Classification accuracy is perhaps the simplest metric to use and implement and is defined as the number of correct predictions divided by the total number of predictions, multiplied by 100.
We can implement this by comparing ground truth and predicted values in a loop or simply utilizing the scikit-learn module to do the heavy lifting for us (not so heavy in this case).
Confusion Matrix
Confusion Matrix is a tabular visualization of the ground-truth labels versus model predictions. Each row of the confusion matrix represents the instances in a predicted class and each column represents the instances in an actual class. Confusion Matrix is not exactly a performance metric but sort of a basis on which other metrics evaluate the results.
Each cell in the confusion matrix represents an evaluation factor. Let's understand these factors one by one:
* True Positive(TP) signifies how many positive class samples your model predicted correctly.
* True Negative(TN) signifies how many negative class samples your model predicted correctly.
* False Positive(FP) signifies how many negative class samples your model predicted incorrectly. This factor represents Type-I error in statistical nomenclature. This error positioning in the confusion matrix depends on the choice of the null hypothesis.
* False Negative(FN) signifies how many positive class samples your model predicted incorrectly. This factor represents Type-II error in statistical nomenclature. This error positioning in the confu-sion matrix also depends on the choice of the null hypothesis.
Precision
Precision is the ratio of true positives and total positives predicted
Recall/Sensitivity/Hit-Rate
A Recall is essentially the ratio of true positives to all the positives in ground truth.
Precision-Recall tradeoff
To improve your model, you can either improve precision or recall -- but not both! If you try to re-duce cases of non-cancerous patients being labeled as cancerous (FN/type-II), no direct effect will take place on cancerous patients being labeled as non-cancerous.
F1-score
The F1-score metric uses a combination of precision and recall. In fact, the F1 score is the harmonic mean of the two.
AUROC (Area under Receiver operating characteristics curve)
Better known as AUC-ROC score/curves. It makes use of true positive rates(TPR) and false posi-tive rates(FPR).
Currently there are no comments in this discussion, be the first to comment!