Regression vs Classification Losses on Simple Low Dimensional Example
Comparing Regression vs Classification Losses on Simple Low Dimensional Example
Simple XOR network with 1 hidden layer has known loss surface https://arxiv.org/abs/1804.02411
In this post, we will explore how regression and classification loss functions affect the geometry of the loss function we wish to optimize.
More specifically, how does the choice of loss function type affect the density/sparsity of the gradient in the last layer, and how does that affect the learning dynamic
How does the loss function affect the optimal batch size and learning rate? First, we must have some intuition on how batch size and learning rate affect each other.
In standard SGD, the gradient of the loss with respect to the weights tells us the direction to update the weights, and the learning rate tells us how far to go.
Our goal with every gradient update is to make the weights a better fit to the dataset, as judged by the loss function. The larger the batch size, the more likely it is that the datapoints inside will be representative of the whole dataset, and thus we are more confident that the gradient of the batch represents the right direction for fitting the whole dataset. The more confidence we have in the direction we should go, the bigger the steps we can take to get there (at least, initially).
Offsetting this is the need for the learning rate to be sufficiently small in order to prevent from overshooting the minima. While small batch sizes are better suited for small learning rates, this works because they are able to make more updates per epoch, and it turns out that learning a dataset and getting feedback/updates after a shorter period of image batches is easier than trying to learn a whole dataset before getting any feedback. While large batch sizes can use higher learning rates initially,
The two extremes of batch size are when: 1) The batch contains just a single example (vanilla SGD). This one datapoint could be noisy or an outlier compared to the rest of the dataset, so we have the little confidence about the direction of the update, so we would want a smaller learning rate.
2) The batch contains the whole training dataset. Since all training points are considered for a single gradient update, we are the most confident that the direction of the update will be useful to fitting the whole dataset, so we would want a larger learning rate.
difference loss functions used for multiclass regression affect the loss landscape
Regression for Classification Loss Functions
There is no reason why you cant use mse or mae for a classification problem as long as you throw a softmax on there. A classification problem assumes that there is only a single instance of a single class present in each dataset, so we want a probability distribution over the predicted classes. For example, if an image is either a plane or a train or a car, it wouldn’t make any sense to predict the odds the object in the image is 50% plane, 50%, and 25% car. Rather, you want the probabilities to add up to 100%.
Since mae is linear and mse is a quadratic of the error, in general the loss will be some polynomial of the error (side note: is that why they call it a p-norm because p is a polynomial?)
Similar paper that uses external memory to cache states of a transformer and distills them down so that you can save of training data https://arxiv.org/abs/2303.15553https://arxiv.org/abs/2303.15553 “MoViT leverages an external memory structure to cache history attention snapshots during the training stage. To prevent overfitting, we incorporate an innovative memory update scheme, attention temporal moving average, to update the stored external memories with the historical moving average. For inference speedup, we design a prototypical attention learning method to distill the external memory into smaller representative subsets. “
Q and A with chatgpt
“Categorical cross-entropy is used in classification tasks because it measures the difference between two probability distributions—predicted class probabilities and actual class labels. This aligns with the goal of classification, which is to predict the probability distribution over classes accurately. MSE or MAE are better suited for regression tasks where the objective is to minimize the difference between predicted continuous values and actual values. Using MSE or MAE in classification would not capture the probabilistic nature of the problem effectively.”
what if we did a softmax before doing mae or mse? “Applying a softmax before using MSE or MAE would ensure that the output is a probability distribution. However, MSE or MAE would still be suboptimal for classification because they don’t penalize incorrect class predictions as effectively as categorical cross-entropy. Categorical cross-entropy directly measures the log likelihood of the correct class and better captures the difference between predicted and actual class probabilities, leading to better convergence and performance in classification tasks.”
why would it not penalize it incorrect predictions? ChatGPT MSE and MAE measure the average difference between predicted and actual values. When applied to classification, even after a softmax, these metrics do not emphasize the correct class as strongly as categorical cross-entropy does. Categorical cross-entropy increases the penalty exponentially for incorrect predictions, focusing more on maximizing the probability of the correct class, which aligns better with the objectives of classification. This exponential penalty helps the model to learn faster and more accurately in distinguishing between classes.
So here is the key to understanding cross entropy. If we have a 100 classes to predict over, for any given input, it doesn’t matter what the predictions for the 99 wrong classes are! Even though cross entropy only ends up weighing your prediction for the correct class think about what this does to the weights.
When using one-hot encoded labels, the incorrect probabilities are multiplied by 0. The key aspect of categorical cross-entropy is the log function on the predicted probability of the correct class.
Given a one-hot encoded true label (\mathbf{y}) and a predicted probability distribution (\mathbf{p}), categorical cross-entropy is defined as:
\[\text{CCE}(\mathbf{y}, \mathbf{p}) = -\sum_{i} y_i \log(p_i)\]Since $(\mathbf{y}$) is one-hot encoded, only the term corresponding to the correct class (j) remains:
\[\text{CCE} = - \log(p_j)\]This means:
- When the predicted probability $p_j$ of the correct class is close to 1, $\log(p_j)$ is close to 0, resulting in a small penalty.
- When $p_j$ is close to 0, $\log(p_j)$ becomes a large negative number, resulting in a large positive penalty when multiplied by -1.
This exponential nature of the log function means that incorrect predictions (where (p_j) is low) are penalized much more heavily than correct predictions (where (p_j) is high), which effectively drives the learning process to favor high confidence in the correct class. MSE or MAE do not have this property and treat all deviations linearly, hence they are less effective for classification tasks.
This means:
- When the predicted probability (p_j) of the correct class is close to 1, (\log(p_j)) is close to 0, resulting in a small penalty.
- When (p_j) is close to 0, (\log(p_j)) becomes a large negative number, resulting in a large positive penalty when multiplied by -1.
This exponential nature of the log function means that incorrect predictions (where (p_j) is low) are penalized much more heavily than correct predictions (where (p_j) is high), which effectively drives the learning process to favor high confidence in the correct class. MSE or MAE do not have this property and treat all deviations linearly, hence they are less effective for classification tasks.