7.5. Batch Normalization — Dive into Deep Learning 0.1.0 documentation (2023)

Training deep neural nets is difficult. And getting them to converge ina reasonable amount of time can be tricky. In this section, we describebatch normalization (BN) , a popular andeffective technique that consistently accelerates the convergence ofdeep nets. Together with residual blocks—covered inSection 7.6—BN has made it possible for practitioners toroutinely train networks with over 100 layers.

7.5.1. Training Deep Networks

To motivate batch normalization, let us review a few practicalchallenges that arise when training ML models and neural nets inparticular.

  1. Choices regarding data preprocessing often make an enormousdifference in the final results. Recall our application of multilayerperceptrons to predicting house prices(sec_kaggle_house). Our first step when working with realdata was to standardize our input features to each have a mean ofzero and variance of one. Intuitively, this standardization playsnicely with our optimizers because it puts the parameters a-priori ata similar scale.

  2. For a typical MLP or CNN, as we train, the activations inintermediate layers may take values with widely varyingmagnitudes—both along the layers from the input to the output, acrossnodes in the same layer, and over time due to our updates to themodel’s parameters. The inventors of batch normalization postulatedinformally that this drift in the distribution of activations couldhamper the convergence of the network. Intuitively, we mightconjecture that if one layer has activation values that are 100x thatof another layer, this might necessitate compensatory adjustments inthe learning rates.

  3. Deeper networks are complex and easily capable of overfitting. Thismeans that regularization becomes more critical.

Batch normalization is applied to individual layers (optionally, to allof them) and works as follows: In each training iteration, we firstnormalize the inputs (of batch normalization) by subtracting their meanand dividing by their standard deviation, where both are estimated basedon the statistics of the current minibatch. Next, we apply a scalingcoefficient and a scaling offset. It is precisely due to thisnormalization based on batch statistics that batch normalizationderives its name.

Note that if we tried to apply BN with minibatches of size \(1\), wewould not be able to learn anything. That is because after subtractingthe means, each hidden node would take value \(0\)! As you mightguess, since we are devoting a whole section to BN, with large enoughminibatches, the approach proves effective and stable. One takeaway hereis that when applying BN, the choice of minibatch size may be even moresignificant than without BN.

Formally, BN transforms the activations at a given layer\(\mathbf{x}\) according to the following expression:

(7.5.1)\[\mathrm{BN}(\mathbf{x}) = \mathbf{\gamma} \odot \frac{\mathbf{x} - \hat{\mathbf{\mu}}}{\hat\sigma} + \mathbf{\beta}\]

Here, \(\hat{\mathbf{\mu}}\) is the minibatch sample mean and\(\hat{\mathbf{\sigma}}\) is the minibatch sample standarddeviation. After applying BN, the resulting minibatch of activations haszero mean and unit variance. Because the choice of unit variance (vssome other magic number) is an arbitrary choice, we commonly includecoordinate-wise scaling coefficients \(\mathbf{\gamma}\) and offsets\(\mathbf{\beta}\). Consequently, the activation magnitudes forintermediate layers cannot diverge during training because BN activelycenters and rescales them back to a given mean and size (via\(\mathbf{\mu}\) and \(\sigma\)). One piece of practitioner’sintuition/wisdom is that BN seems to allows for more aggressive learningrates.

Formally, denoting a particular minibatch by \(\mathcal{B}\), wecalculate \(\hat{\mathbf{\mu}}_\mathcal{B}\) and\(\hat\sigma_\mathcal{B}\) as follows:

(7.5.2)\[\hat{\mathbf{\mu}}_\mathcal{B} \leftarrow \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} \mathbf{x}\text{ and }\hat{\mathbf{\sigma}}_\mathcal{B}^2 \leftarrow \frac{1}{|\mathcal{B}|} \sum_{\mathbf{x} \in \mathcal{B}} (\mathbf{x} - \mathbf{\mu}_{\mathcal{B}})^2 + \epsilon\]

Note that we add a small constant \(\epsilon > 0\) to the varianceestimate to ensure that we never attempt division by zero, even in caseswhere the empirical variance estimate might vanish. The estimates\(\hat{\mathbf{\mu}}_\mathcal{B}\) and\(\hat{\mathbf{\sigma}}_\mathcal{B}\) counteract the scaling issueby using noisy estimates of mean and variance. You might think that thisnoisiness should be a problem. As it turns out, this is actuallybeneficial.

This turns out to be a recurring theme in deep learning. For reasonsthat are not yet well-characterized theoretically, various sources ofnoise in optimization often lead to faster training and lessoverfitting. While traditional machine learning theorists might buckleat this characterization, this variation appears to act as a form ofregularization. In some preliminary research,[Teye et al., 2018] and [Luo et al., 2018]relate the properties of BN to Bayesian Priors and penaltiesrespectively. In particular, this sheds some light on the puzzle of whyBN works best for moderate minibatches sizes in the\(50\)\(100\) range.

Fixing a trained model, you might (rightly) think that we would preferto use the entire dataset to estimate the mean and variance. Oncetraining is complete, why would we want the same image to be classifieddifferently, depending on the batch in which it happens to reside?During training, such exact calculation is infeasible because theactivations for all data points change every time we update our model.However, once the model is trained, we can calculate the means andvariances of each layer’s activations based on the entire dataset.Indeed this is standard practice for models employing batchnormalization and thus BN layers function differently in training mode(normalizing by minibatch statistics) and in prediction mode(normalizing by dataset statistics).

We are now ready to take a look at how batch normalization works inpractice.

7.5.2. Batch Normalization Layers

Batch normalization implementations for fully-connected layers andconvolutional layers are slightly different. We discuss both casesbelow. Recall that one key differences between BN and other layers isthat because BN operates on a full minibatch at a time, we cannot justignore the batch dimension as we did before when introducing otherlayers.

7.5.2.1. Fully-Connected Layers

When applying BN to fully-connected layers, we usually insert BN afterthe affine transformation and before the nonlinear activation function.Denoting the input to the layer by \(\mathbf{x}\), the lineartransform (with weights \(\theta\)) by \(f_{\theta}(\cdot)\),the activation function by \(\phi(\cdot)\), and the BN operationwith parameters \(\mathbf{\beta}\) and \(\mathbf{\gamma}\) by\(\mathrm{BN}_{\mathbf{\beta}, \mathbf{\gamma}}\), we can expressthe computation of a BN-enabled, fully-connected layer\(\mathbf{h}\) as follows:

(7.5.3)\[\mathbf{h} = \phi(\mathrm{BN}_{\mathbf{\beta}, \mathbf{\gamma}}(f_{\mathbf{\theta}}(\mathbf{x}) ) )\]

Recall that mean and variance are computed on the same minibatch\(\mathcal{B}\) on which the transformation is applied. Also recallthat the scaling coefficient \(\mathbf{\gamma}\) and the offset\(\mathbf{\beta}\) are parameters that need to be learned jointlywith the more familiar parameters \(\mathbf{\theta}\).

7.5.2.2. Convolutional Layers

Similarly, with convolutional layers, we typically apply BN after theconvolution and before the nonlinear activation function. When theconvolution has multiple output channels, we need to carry out batchnormalization for each of the outputs of these channels, and eachchannel has its own scale and shift parameters, both of which arescalars. Assume that our minibatches contain \(m\) each and that foreach channel, the output of the convolution has height \(p\) andwidth \(q\). For convolutional layers, we carry out each batchnormalization over the \(m \cdot p \cdot q\) elements per outputchannel simultaneously. Thus we collect the values over all spatiallocations when computing the mean and variance and consequently (withina given channel) apply the same \(\hat{\mathbf{\mu}}\) and\(\hat{\mathbf{\sigma}}\) to normalize the values at each spatiallocation.

7.5.2.3. Batch Normalization During Prediction

As we mentioned earlier, BN typically behaves differently in trainingmode and prediction mode. First, the noise in \(\mathbf{\mu}\) and\(\mathbf{\sigma}\) arising from estimating each on minibatches areno longer desirable once we have trained the model. Second, we might nothave the luxury of computing per-batch normalization statistics, e.g.,we might need to apply our model to make one prediction at a time.

Typically, after training, we use the entire dataset to compute stableestimates of the activation statistics and then fix them at predictiontime. Consequently, BN behaves differently during training and at testtime. Recall that dropout also exhibits this characteristic.

7.5.3. Implementation from Scratch

Below, firstly we get all the relevant libraries needed to implementBatchNorm. After that, we implement a batch normalization layer withNDArrays from scratch:

%load ../utils/djl-imports%load ../utils/plot-utils%load ../utils/Training.java%load ../utils/Accumulator.java
import ai.djl.basicdataset.cv.classification.*;import org.apache.commons.lang3.ArrayUtils;
public NDList batchNormUpdate(NDArray X, NDArray gamma, NDArray beta, NDArray movingMean, NDArray movingVar, float eps, float momentum, boolean isTraining) { // attach moving mean and var to submanager to close intermediate computation values // at the end to avoid memory leak try(NDManager subManager = movingMean.getManager().newSubManager()){ movingMean.attach(subManager); movingVar.attach(subManager); NDArray xHat; NDArray mean; NDArray var; if (!isTraining) { // If it is the prediction mode, directly use the mean and variance // obtained from the incoming moving average xHat = X.sub(movingMean).div(movingVar.add(eps).sqrt()); } else { if (X.getShape().dimension() == 2) { // When using a fully connected layer, calculate the mean and // variance on the feature dimension mean = X.mean(new int[]{0}, true); var = X.sub(mean).pow(2).mean(new int[]{0}, true); } else { // When using a two-dimensional convolutional layer, calculate the // mean and variance on the channel dimension (axis=1). Here we // need to maintain the shape of `X`, so that the broadcast // operation can be carried out later mean = X.mean(new int[]{0, 2, 3}, true); var = X.sub(mean).pow(2).mean(new int[]{0, 2, 3}, true); } // In training mode, the current mean and variance are used for the // standardization xHat = X.sub(mean).div(var.add(eps).sqrt()); // Update the mean and variance of the moving average movingMean = movingMean.mul(momentum).add(mean.mul(1.0f - momentum)); movingVar = movingVar.mul(momentum).add(var.mul(1.0f - momentum)); } NDArray Y = xHat.mul(gamma).add(beta); // Scale and shift // attach moving mean and var back to original manager to keep their values movingMean.attach(subManager.getParentManager()); movingVar.attach(subManager.getParentManager()); return new NDList(Y, movingMean, movingVar); }}

We can now create a proper BatchNorm layer. Our layer will maintainproper parameters corresponding for scale gamma and shift beta, both ofwhich will be updated in the course of training. Additionally, our layerwill maintain a moving average of the means and variances for subsequentuse during model prediction. The numFeatures parameter required by theBatchNorm instance is the number of outputs for a fully-connected layerand the number of output channels for a convolutional layer. ThenumDimensions parameter also required by this instance is 2 for afully-connected layer and 4 for a convolutional layer.

Putting aside the algorithmic details, note the design patternunderlying our implementation of the layer. Typically, we define themath in a separate function, say batchNormUpdate. We then integratethis functionality into a custom layer, whose code mostly addressesbookkeeping matters, such as moving data to the right device context,allocating and initializing any required variables, keeping track ofrunning averages (here for mean and variance), etc. This pattern enablesa clean separation of math from boilerplate code. Also note that for thesake of convenience we did not worry about automatically inferring theinput shape here, thus we need to specify the number of featuresthroughout. Do not worry, the DJL BatchNorm layer will care of thisfor us.

public class BatchNormBlock extends AbstractBlock { private NDArray movingMean; private NDArray movingVar; private Parameter gamma; private Parameter beta; private Shape shape; // num_features: the number of outputs for a fully-connected layer // or the number of output channels for a convolutional layer. // num_dims: 2 for a fully-connected layer and 4 for a convolutional layer. public BatchNormBlock(int numFeatures, int numDimensions) { if (numDimensions == 2) { shape = new Shape(1, numFeatures); } else { shape = new Shape(1, numFeatures, 1, 1); } // The scale parameter and the shift parameter involved in gradient // finding and iteration are initialized to 0 and 1 respectively gamma = addParameter( Parameter.builder() .setName("gamma") .setType(Parameter.Type.GAMMA) .optShape(shape) .build()); beta = addParameter( Parameter.builder() .setName("beta") .setType(Parameter.Type.BETA) .optShape(shape) .build()); // All the variables not involved in gradient finding and iteration are // initialized to 0. Create a base manager to maintain their values // throughout the entire training process NDManager manager = NDManager.newBaseManager(); movingMean = manager.zeros(shape); movingVar = manager.zeros(shape); } @Override public String toString() { return "BatchNormBlock()"; } @Override protected NDList forwardInternal( ParameterStore parameterStore, NDList inputs, boolean training, PairList<String, Object> params) { NDList result = batchNormUpdate(inputs.singletonOrThrow(), gamma.getArray(), beta.getArray(), this.movingMean, this.movingVar, 1e-12f, 0.9f, training); // close previous NDArray before assigning new values if(training){ this.movingMean.close(); this.movingVar.close(); } // Save the updated `movingMean` and `movingVar` this.movingMean = result.get(1); this.movingVar = result.get(2); return new NDList(result.get(0)); } @Override public Shape[] getOutputShapes(Shape[] inputs) { Shape[] current = inputs; for (Block block : children.values()) { current = block.getOutputShapes(current); } return current; }}

7.5.4. Using a Batch Normalization LeNet

To see how to apply BatchNorm in context, below we apply it to atraditional LeNet model (Section 6.6). Recall that BN istypically applied after the convolutional layers and fully-connectedlayers but before the corresponding activation functions.

SequentialBlock net = new SequentialBlock() .add( Conv2d.builder() .setKernelShape(new Shape(5, 5)) .setFilters(6).build()) .add(new BatchNormBlock(6, 4)) .add(Pool.maxPool2dBlock(new Shape(2, 2), new Shape(2, 2))) .add( Conv2d.builder() .setKernelShape(new Shape(5, 5)) .setFilters(16).build()) .add(new BatchNormBlock(16, 4)) .add(Activation::sigmoid) .add(Pool.maxPool2dBlock(new Shape(2, 2), new Shape(2, 2))) .add(Blocks.batchFlattenBlock()) .add(Linear.builder().setUnits(120).build()) .add(new BatchNormBlock(120, 2)) .add(Activation::sigmoid) .add(Blocks.batchFlattenBlock()) .add(Linear.builder().setUnits(84).build()) .add(new BatchNormBlock(84, 2)) .add(Activation::sigmoid) .add(Linear.builder().setUnits(10).build());

Let’s initialize the batchSize, numEpochs and the relevant arrays tostore the data from the training function.

int batchSize = 256;int numEpochs = Integer.getInteger("MAX_EPOCH", 10);double[] trainLoss;double[] testAccuracy;double[] epochCount;double[] trainAccuracy;epochCount = new double[numEpochs];for (int i = 0; i < epochCount.length; i++) { epochCount[i] = i+1;}

As before, we will train our network on the Fashion-MNIST dataset. Thiscode is virtually identical to that when we first trained LeNet(Section 6.6). The main difference is the considerably largerlearning rate.

FashionMnist trainIter = FashionMnist.builder() .optUsage(Dataset.Usage.TRAIN) .setSampling(batchSize, true) .optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE)) .build();FashionMnist testIter = FashionMnist.builder() .optUsage(Dataset.Usage.TEST) .setSampling(batchSize, true) .optLimit(Long.getLong("DATASET_LIMIT", Long.MAX_VALUE)) .build();trainIter.prepare();testIter.prepare();
float lr = 1.0f;Loss loss = Loss.softmaxCrossEntropyLoss();Tracker lrt = Tracker.fixed(lr);Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).build();DefaultTrainingConfig config = new DefaultTrainingConfig(loss) .optOptimizer(sgd) // Optimizer (loss function) .optDevices(Engine.getInstance().getDevices(1)) // single GPU .addEvaluator(new Accuracy()) // Model Accuracy .addTrainingListeners(TrainingListener.Defaults.logging()); // LoggingModel model = Model.newInstance("batch-norm");model.setBlock(net);Trainer trainer = model.newTrainer(config);trainer.initialize(new Shape(1, 1, 28, 28));Map<String, double[]> evaluatorMetrics = new HashMap<>();double avgTrainTimePerEpoch = Training.trainingChapter6(trainIter, testIter, numEpochs, trainer, evaluatorMetrics);
INFO Training on: 1 GPUs.INFO Load MXNet Engine Version 1.8.0 in 0.080 ms.
Training: 100% |████████████████████████████████████████| Accuracy: 0.78, SoftmaxCrossEntropyLoss: 0.62Validating: 100% |████████████████████████████████████████|
INFO Epoch 1 finished.INFO Train: Accuracy: 0.78, SoftmaxCrossEntropyLoss: 0.62INFO Validate: Accuracy: 0.77, SoftmaxCrossEntropyLoss: 0.69
Training: 100% |████████████████████████████████████████| Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.39Validating: 100% |████████████████████████████████████████|
INFO Epoch 2 finished.INFO Train: Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.39INFO Validate: Accuracy: 0.85, SoftmaxCrossEntropyLoss: 0.41
Training: 100% |████████████████████████████████████████| Accuracy: 0.87, SoftmaxCrossEntropyLoss: 0.35Validating: 100% |████████████████████████████████████████|
INFO Epoch 3 finished.INFO Train: Accuracy: 0.87, SoftmaxCrossEntropyLoss: 0.35INFO Validate: Accuracy: 0.81, SoftmaxCrossEntropyLoss: 0.55
Training: 100% |████████████████████████████████████████| Accuracy: 0.88, SoftmaxCrossEntropyLoss: 0.33Validating: 100% |████████████████████████████████████████|
INFO Epoch 4 finished.INFO Train: Accuracy: 0.88, SoftmaxCrossEntropyLoss: 0.33INFO Validate: Accuracy: 0.88, SoftmaxCrossEntropyLoss: 0.34
Training: 100% |████████████████████████████████████████| Accuracy: 0.89, SoftmaxCrossEntropyLoss: 0.31Validating: 100% |████████████████████████████████████████|
INFO Epoch 5 finished.INFO Train: Accuracy: 0.89, SoftmaxCrossEntropyLoss: 0.31INFO Validate: Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.38
Training: 100% |████████████████████████████████████████| Accuracy: 0.89, SoftmaxCrossEntropyLoss: 0.29Validating: 100% |████████████████████████████████████████|
INFO Epoch 6 finished.INFO Train: Accuracy: 0.89, SoftmaxCrossEntropyLoss: 0.29INFO Validate: Accuracy: 0.81, SoftmaxCrossEntropyLoss: 0.54
Training: 100% |████████████████████████████████████████| Accuracy: 0.90, SoftmaxCrossEntropyLoss: 0.28Validating: 100% |████████████████████████████████████████|
INFO Epoch 7 finished.INFO Train: Accuracy: 0.90, SoftmaxCrossEntropyLoss: 0.28INFO Validate: Accuracy: 0.78, SoftmaxCrossEntropyLoss: 0.57
Training: 100% |████████████████████████████████████████| Accuracy: 0.90, SoftmaxCrossEntropyLoss: 0.27Validating: 100% |████████████████████████████████████████|
INFO Epoch 8 finished.INFO Train: Accuracy: 0.90, SoftmaxCrossEntropyLoss: 0.27INFO Validate: Accuracy: 0.84, SoftmaxCrossEntropyLoss: 0.50
Training: 100% |████████████████████████████████████████| Accuracy: 0.90, SoftmaxCrossEntropyLoss: 0.26Validating: 100% |████████████████████████████████████████|
INFO Epoch 9 finished.INFO Train: Accuracy: 0.90, SoftmaxCrossEntropyLoss: 0.26INFO Validate: Accuracy: 0.85, SoftmaxCrossEntropyLoss: 0.42
Training: 100% |████████████████████████████████████████| Accuracy: 0.91, SoftmaxCrossEntropyLoss: 0.25Validating: 100% |████████████████████████████████████████|
INFO Epoch 10 finished.INFO Train: Accuracy: 0.91, SoftmaxCrossEntropyLoss: 0.25INFO Validate: Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.39
trainLoss = evaluatorMetrics.get("train_epoch_SoftmaxCrossEntropyLoss");trainAccuracy = evaluatorMetrics.get("train_epoch_Accuracy");testAccuracy = evaluatorMetrics.get("validate_epoch_Accuracy");System.out.printf("loss %.3f,", trainLoss[numEpochs - 1]);System.out.printf(" train acc %.3f,", trainAccuracy[numEpochs - 1]);System.out.printf(" test acc %.3f\n", testAccuracy[numEpochs - 1]);System.out.printf("%.1f examples/sec", trainIter.size() / (avgTrainTimePerEpoch / Math.pow(10, 9)));System.out.println();
loss 0.254, train acc 0.907, test acc 0.8602434.5 examples/sec

Let us have a look at the scale parameter gamma and the shiftparameter beta learned from the first batch normalization layer.

// Printing the value of gamma and beta in the first BatchNorm layer.List<Parameter> batchNormFirstParams = net.getChildren().values().get(1).getParameters().values();System.out.println("gamma " + batchNormFirstParams.get(0).getArray().reshape(-1));System.out.println("beta " + batchNormFirstParams.get(1).getArray().reshape(-1));
gamma ND: (6) gpu(0) float32[1.227 , 1.3049, 1.142 , 1.6613, 1.4366, 1.0348]beta ND: (6) gpu(0) float32[-1.83557404e-07, -1.13574046e-07, 2.08738555e-08, -9.70295346e-08, -3.63770596e-08, -5.06048607e-08]

7.5. Batch Normalization — Dive into Deep Learning 0.1.0 documentation (1)

Fig. 7.5.1 Contour Gradient Descent.

String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];Arrays.fill(lossLabel, 0, trainLoss.length, "train loss");Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length, trainLoss.length + testAccuracy.length + trainAccuracy.length, "test acc");Table data = Table.create("Data").addColumns( DoubleColumn.create("epoch", ArrayUtils.addAll(epochCount, ArrayUtils.addAll(epochCount, epochCount))), DoubleColumn.create("metrics", ArrayUtils.addAll(trainLoss, ArrayUtils.addAll(trainAccuracy, testAccuracy))), StringColumn.create("lossLabel", lossLabel));render(LinePlot.create("", data, "epoch", "metrics", "lossLabel"),"text/html");

7.5. Batch Normalization — Dive into Deep Learning 0.1.0 documentation (2)

7.5.5. Concise Implementation

Compared with the BatchNorm class, which we just defined ourselves,the BatchNorm class defined by nn module in DJL is easier touse. In DJL, we do not have to worry about numFeatures ornumDimensions. Instead, these parameter values will be inferredautomatically via delayed initialization. Otherwise, the code looksvirtually identical to the application our implementation above.

SequentialBlock block = new SequentialBlock() .add( Conv2d.builder() .setKernelShape(new Shape(5, 5)) .setFilters(6).build()) .add(BatchNorm.builder().build()) .add(Pool.maxPool2dBlock(new Shape(2, 2), new Shape(2, 2))) .add( Conv2d.builder() .setKernelShape(new Shape(5, 5)) .setFilters(16).build()) .add(BatchNorm.builder().build()) .add(Activation::sigmoid) .add(Pool.maxPool2dBlock(new Shape(2, 2), new Shape(2, 2))) .add(Blocks.batchFlattenBlock()) .add(Linear.builder().setUnits(120).build()) .add(BatchNorm.builder().build()) .add(Activation::sigmoid) .add(Blocks.batchFlattenBlock()) .add(Linear.builder().setUnits(84).build()) .add(BatchNorm.builder().build()) .add(Activation::sigmoid) .add(Linear.builder().setUnits(10).build());

Below, we use the same hyperparameters to train out model. Note that asusual, the high-level API variant runs much faster because its code hasbeen compiled to C++/CUDA while our custom implementation must beinterpreted by Python.

Loss loss = Loss.softmaxCrossEntropyLoss();Tracker lrt = Tracker.fixed(1.0f);Optimizer sgd = Optimizer.sgd().setLearningRateTracker(lrt).build();Model model = Model.newInstance("batch-norm");model.setBlock(block);DefaultTrainingConfig config = new DefaultTrainingConfig(loss) .optOptimizer(sgd) // Optimizer (loss function) .addEvaluator(new Accuracy()) // Model Accuracy .addTrainingListeners(TrainingListener.Defaults.logging()); // LoggingTrainer trainer = model.newTrainer(config);trainer.initialize(new Shape(1, 1, 28, 28));Map<String, double[]> evaluatorMetrics = new HashMap<>();double avgTrainTimePerEpoch = 0;
INFO Training on: 4 GPUs.INFO Load MXNet Engine Version 1.8.0 in 0.022 ms.
avgTrainTimePerEpoch = Training.trainingChapter6(trainIter, testIter, numEpochs, trainer, evaluatorMetrics);
Training: 100% |████████████████████████████████████████| Accuracy: 0.73, SoftmaxCrossEntropyLoss: 0.96Validating: 100% |████████████████████████████████████████|
INFO Epoch 1 finished.INFO Train: Accuracy: 0.73, SoftmaxCrossEntropyLoss: 0.96INFO Validate: Accuracy: 0.70, SoftmaxCrossEntropyLoss: 0.86
Training: 100% |████████████████████████████████████████| Accuracy: 0.84, SoftmaxCrossEntropyLoss: 0.44Validating: 100% |████████████████████████████████████████|
INFO Epoch 2 finished.INFO Train: Accuracy: 0.84, SoftmaxCrossEntropyLoss: 0.44INFO Validate: Accuracy: 0.83, SoftmaxCrossEntropyLoss: 0.45
Training: 100% |████████████████████████████████████████| Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.38Validating: 100% |████████████████████████████████████████|
INFO Epoch 3 finished.INFO Train: Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.38INFO Validate: Accuracy: 0.74, SoftmaxCrossEntropyLoss: 0.63
Training: 100% |████████████████████████████████████████| Accuracy: 0.87, SoftmaxCrossEntropyLoss: 0.35Validating: 100% |████████████████████████████████████████|
INFO Epoch 4 finished.INFO Train: Accuracy: 0.87, SoftmaxCrossEntropyLoss: 0.35INFO Validate: Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.37
Training: 100% |████████████████████████████████████████| Accuracy: 0.88, SoftmaxCrossEntropyLoss: 0.33Validating: 100% |████████████████████████████████████████|
INFO Epoch 5 finished.INFO Train: Accuracy: 0.88, SoftmaxCrossEntropyLoss: 0.33INFO Validate: Accuracy: 0.79, SoftmaxCrossEntropyLoss: 0.66
Training: 100% |████████████████████████████████████████| Accuracy: 0.88, SoftmaxCrossEntropyLoss: 0.32Validating: 100% |████████████████████████████████████████|
INFO Epoch 6 finished.INFO Train: Accuracy: 0.88, SoftmaxCrossEntropyLoss: 0.32INFO Validate: Accuracy: 0.85, SoftmaxCrossEntropyLoss: 0.40
Training: 100% |████████████████████████████████████████| Accuracy: 0.89, SoftmaxCrossEntropyLoss: 0.30Validating: 100% |████████████████████████████████████████|
INFO Epoch 7 finished.INFO Train: Accuracy: 0.89, SoftmaxCrossEntropyLoss: 0.30INFO Validate: Accuracy: 0.84, SoftmaxCrossEntropyLoss: 0.45
Training: 100% |████████████████████████████████████████| Accuracy: 0.89, SoftmaxCrossEntropyLoss: 0.29Validating: 100% |████████████████████████████████████████|
INFO Epoch 8 finished.INFO Train: Accuracy: 0.89, SoftmaxCrossEntropyLoss: 0.29INFO Validate: Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.39
Training: 100% |████████████████████████████████████████| Accuracy: 0.90, SoftmaxCrossEntropyLoss: 0.28Validating: 100% |████████████████████████████████████████|
INFO Epoch 9 finished.INFO Train: Accuracy: 0.90, SoftmaxCrossEntropyLoss: 0.28INFO Validate: Accuracy: 0.85, SoftmaxCrossEntropyLoss: 0.43
Training: 100% |████████████████████████████████████████| Accuracy: 0.90, SoftmaxCrossEntropyLoss: 0.28Validating: 100% |████████████████████████████████████████|
INFO Epoch 10 finished.INFO Train: Accuracy: 0.90, SoftmaxCrossEntropyLoss: 0.28INFO Validate: Accuracy: 0.88, SoftmaxCrossEntropyLoss: 0.33
trainLoss = evaluatorMetrics.get("train_epoch_SoftmaxCrossEntropyLoss");trainAccuracy = evaluatorMetrics.get("train_epoch_Accuracy");testAccuracy = evaluatorMetrics.get("validate_epoch_Accuracy");System.out.printf("loss %.3f,", trainLoss[numEpochs - 1]);System.out.printf(" train acc %.3f,", trainAccuracy[numEpochs - 1]);System.out.printf(" test acc %.3f\n", testAccuracy[numEpochs - 1]);System.out.printf("%.1f examples/sec", trainIter.size() / (avgTrainTimePerEpoch / Math.pow(10, 9)));System.out.println();
loss 0.275, train acc 0.897, test acc 0.8802103.4 examples/sec

7.5. Batch Normalization — Dive into Deep Learning 0.1.0 documentation (3)

Fig. 7.5.2 Contour Gradient Descent.

String[] lossLabel = new String[trainLoss.length + testAccuracy.length + trainAccuracy.length];Arrays.fill(lossLabel, 0, trainLoss.length, "train loss");Arrays.fill(lossLabel, trainAccuracy.length, trainLoss.length + trainAccuracy.length, "train acc");Arrays.fill(lossLabel, trainLoss.length + trainAccuracy.length, trainLoss.length + testAccuracy.length + trainAccuracy.length, "test acc");Table data = Table.create("Data").addColumns( DoubleColumn.create("epoch", ArrayUtils.addAll(epochCount, ArrayUtils.addAll(epochCount, epochCount))), DoubleColumn.create("metrics", ArrayUtils.addAll(trainLoss, ArrayUtils.addAll(trainAccuracy, testAccuracy))), StringColumn.create("lossLabel", lossLabel));render(LinePlot.create("", data, "epoch", "metrics", "lossLabel"),"text/html");

7.5. Batch Normalization — Dive into Deep Learning 0.1.0 documentation (4)

7.5.6. Controversy

Intuitively, batch normalization is thought to make the optimizationlandscape smoother. However, we must be careful to distinguish betweenspeculative intuitions and true explanations for the phenomena that weobserve when training deep models. Recall that we do not even know whysimpler deep neural networks (MLPs and conventional CNNs) generalizewell in the first place. Even with dropout and \(L_2\)regularization, they remain so flexible that their ability to generalizeto unseen data cannot be explained via conventional learning-theoreticgeneralization guarantees.

In the original paper proposing batch normalization, the authors, inaddition to introducing a powerful and useful tool, offered anexplanation for why it works: by reducing internal covariate shift.Presumably by internal covariate shift the authors meant somethinglike the intuition expressed above—the notion that the distribution ofactivations changes over the course of training. However there were twoproblems with this explanation: (1) This drift is very different fromcovariate shift, rendering the name a misnomer. (2) The explanationoffers an under-specified intuition but leaves the question of whyprecisely this technique works an open question wanting for a rigorousexplanation. Throughout this book, we aim to convey the intuitions thatpractitioners use to guide their development of deep neural networks.However, we believe that it is important to separate these guidingintuitions from established scientific fact. Eventually, when you masterthis material and start writing your own research papers you will wantto be clear to delineate between technical claims and hunches.

Following the success of batch normalization, its explanation in termsof internal covariate shift has repeatedly surfaced in debates in thetechnical literature and broader discourse about how to present machinelearning research. In a memorable speech given while accepting a Test ofTime Award at the 2017 NeurIPS conference, Ali Rahimi used internalcovariate shift as a focal point in an argument likening the modernpractice of deep learning to alchemy. Subsequently, the example wasrevisited in detail in a position paper outlining troubling trends inmachine learning . In the technicalliterature other authors ([Santurkar et al., 2018])have proposed alternative explanations for the success of BN, someclaiming that BN’s success comes despite exhibiting behavior that is insome ways opposite to those claimed in the original paper.

We note that the internal covariate shift is no more worthy ofcriticism than any of thousands of similarly vague claims made everyyear in the technical ML literature. Likely, its resonance as a focalpoint of these debates owes to its broad recognizability to the targetaudience. Batch normalization has proven an indispensable method,applied in nearly all deployed image classifiers, earning the paper thatintroduced the technique tens of thousands of citations.

7.5.7. Summary

  • During model training, batch normalization continuously adjusts theintermediate output of the neural network by utilizing the mean andstandard deviation of the minibatch, so that the values of theintermediate output in each layer throughout the neural network aremore stable.

  • The batch normalization methods for fully connected layers andconvolutional layers are slightly different.

  • Like a dropout layer, batch normalization layers have differentcomputation results in training mode and prediction mode.

  • Batch Normalization has many beneficial side effects, primarily thatof regularization. On the other hand, the original motivation ofreducing covariate shift seems not to be a valid explanation.

7.5.8. Exercises

  1. Can we remove the fully connected affine transformation before thebatch normalization or the bias parameter in convolution computation?

    • Find an equivalent transformation that applies prior to the fullyconnected layer.

    • Is this reformulation effective. Why (not)?

  2. Compare the learning rates for LeNet with and without batchnormalization.

    • Plot the decrease in training and test error.

    • What about the region of convergence? How large can you make thelearning rate?

  3. Do we need Batch Normalization in every layer? Experiment with it?

  4. Can you replace Dropout by Batch Normalization? How does the behaviorchange?

  5. Fix the coefficients beta and gamma , and observe and analyzethe results.

  6. Review the online documentation for BatchNorm to see the otherapplications for Batch Normalization.

  7. Research ideas: think of other normalization transforms that you canapply? Can you apply the probability integral transform? How about afull rank covariance estimate?

FAQs

Does batch normalization work for batch size of 1? ›

No, it won't; BatchNormalization computes statistics only with respect to a single axis (usually the channels axis, =-1 (last) by default); every other axis is collapsed, i.e. summed over for averaging; details below.

What does batch normalization do in deep learning? ›

Batch normalization is a technique for training very deep neural networks that normalizes the contributions to a layer for every mini-batch. This has the impact of settling the learning process and drastically decreasing the number of training epochs required to train deep neural networks.

How many parameters should be learned in the batch normalization layer? ›

weights, bias) of any network layer, a Batch Norm layer also has parameters of its own: Two learnable parameters called beta and gamma.

Which of the following about batch normalization is true? ›

(ii) True. We are normalizing the input to each layer with batch normalization, so it certainly helps with lessening the coupling between layers and makes the learning process more independent.

Does batch normalization reduce accuracy? ›

I have seen that Batch normalization leads to faster convergence and increased accuracy. But the opposite is happening in my case. By normalizing, my accuracy actually decreased.

Does batch normalization prevent Overfitting? ›

Reduces overfitting. Batch normalisation has a regularising effect since it adds noise to the inputs of every layer. This discourages overfitting since the model no longer produces deterministic values for a given training example alone.

Why do we need batch normalization in CNN? ›

Batch Norm is a normalization technique done between the layers of a Neural Network instead of in the raw data. It is done along mini-batches instead of the full data set. It serves to speed up training and use higher learning rates, making learning easier. the standard deviation of the neurons' output.

Where should I put batch normalization in CNN? ›

It can be used at several points in between the layers of the model. It is often placed just after defining the sequential model and after the convolution and pooling layers.

When should you use batch normalization? ›

When to use Batch Normalization? We can use Batch Normalization in Convolution Neural Networks, Recurrent Neural Networks, and Artificial Neural Networks. In practical coding, we add Batch Normalization after the activation function of the output layer or before the activation function of the input layer.

How do you calculate normalization in batch parameters? ›

The basic formula is x* = (x - E[x]) / sqrt(var(x)) , where x* is the new value of a single component, E[x] is its mean within a batch and var(x) is its variance within a batch. BN extends that formula further to x** = gamma * x* + beta , where x** is the final normalized value. gamma and beta are learned per layer.

Does batch normalization allow higher learning rate? ›

Using batch normalization allows us to use much higher learning rates, which further increases the speed at which networks train. Makes weights easier to initialize — Weight initialization can be difficult, and it's even more difficult when creating deeper networks.

What is the output of batch normalization? ›

Batch normalization applies a transformation that maintains the mean output close to 0 and the output standard deviation close to 1. Importantly, batch normalization works differently during training and during inference.

What is the difference between layer normalization and batch normalization? ›

Batch Normalization vs Layer Normalization

Batch normalization normalizes each feature independently across the mini-batch. Layer normalization normalizes each of the inputs in the batch independently across all features. As batch normalization is dependent on batch size, it's not effective for small batch sizes.

Does batch normalization prevent vanishing gradient? ›

Batch Normalization (BN) does not prevent the vanishing or exploding gradient problem in a sense that these are impossible. Rather it reduces the probability for these to occur.

What is normalization in deep learning? ›

Normalization is a data preparation technique that is frequently used in machine learning. The process of transforming the columns in a dataset to the same scale is referred to as normalization. Every dataset does not need to be normalized for machine learning.

Is batch normalization still used? ›

Batch Normalization is a widely adopted technique that enables faster and more stable training and has become one of the most influential methods.

How does batch normalization help accuracy? ›

Batch normalization is a technique to standardize the inputs to a network, applied to ether the activations of a prior layer or inputs directly. Batch normalization accelerates training, in some cases by halving the epochs or better, and provides some regularization, reducing generalization error.

What is one of the advantage of group normalization over batch normalization? ›

GN is better than IN as GN can exploit the dependence across the channels. It is also better than LN because it allows different distribution to be learned for each group of channels. When the batch size is small, GN consistently outperforms BN.

How do I stop overfitting? ›

How to Prevent Overfitting in Machine Learning
  1. Cross-validation. Cross-validation is a powerful preventative measure against overfitting. ...
  2. Train with more data. It won't work every time, but training with more data can help algorithms detect the signal better. ...
  3. Remove features. ...
  4. Early stopping. ...
  5. Regularization. ...
  6. Ensembling.
6 Jul 2022

Is batch normalization applied during testing? ›

Batch normalization is computed differently during the training and the testing phase. At each hidden layer, Batch Normalization transforms the signal as follow : The BN layer first determines the mean 𝜇 and the variance σ² of the activation values across the batch, using (1) and (2).

What is batch normalization and why does it work? ›

Batch normalization (also known as batch norm) is a method used to make training of artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

Can you use batch normalization and dropout? ›

Example - Using Dropout and Batch Normalization

Now we'll increase the capacity even more, but add dropout to control overfitting and batch normalization to speed up optimization. This time, we'll also leave off standardizing the data, to demonstrate how batch normalization can stabalize the training.

Why do we use batch normalization in training? ›

Now coming back to Batch normalization, it is a process to make neural networks faster and more stable through adding extra layers in a deep neural network. The new layer performs the standardizing and normalizing operations on the input of a layer coming from a previous layer.

Where do you put the normalization layer? ›

Normalization layers usually apply their normalization effect to the previous layer, so it should be put in front of the layer that you want normalized.

What will happen if the learning rate is set too low or too high? ›

If your learning rate is set too low, training will progress very slowly as you are making very tiny updates to the weights in your network. However, if your learning rate is set too high, it can cause undesirable divergent behavior in your loss function.

What is normalization explain with example? ›

Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.

What is batch normalization Tensorflow? ›

Batch normalization applies a transformation that maintains the mean output close to 0 and the output standard deviation close to 1.

Why batch normalization helps in faster convergence? ›

The paper explains that when the input values for the hidden layer are normalized, have close to zero average and about the same covariance, it finds the local minimum faster. This will help us stop the training process earlier by avoiding extra epochs with a proper early-stopping implementation.

Does batch normalization use population statistics to standardize the data? ›

Batch normalization (BN) is a milestone technique in deep learning. It normalizes the activation using mini-batch statistics during training but the estimated population statis- tics during inference.

How do I stop Underfitting? ›

Techniques to reduce underfitting:

Increase model complexity. Increase the number of features, performing feature engineering. Remove noise from the data. Increase the number of epochs or increase the duration of training to get better results.

Why do we normalize image data? ›

the point from normalization comes behind calibrating the different pixels intensities into a normal distribution which makes the image looks better for the visualizer. Main purpose of normalization is to make computation efficient by reducing values between 0 to 1.

What is Axis in batch normalization? ›

axis: Integer, the axis that should be normalized (typically the features axis). For instance, after a Conv2D layer with data_format="channels_first", set axis=1 in BatchNormalization. momentum: Momentum for the moving average. epsilon: Small float added to variance to avoid dividing by zero.

Why do we normalize data in neural network? ›

Normalization can help training of our neural networks as the different features are on a similar scale, which helps to stabilize the gradient descent step, allowing us to use larger learning rates or help models converge faster for a given learning rate.

What is difference between standardization and normalization? ›

Standardization or Z-Score Normalization is the transformation of features by subtracting from mean and dividing by standard deviation.
...
Difference between Normalization and Standardization.
S.NO.NormalizationStandardization
8.It is a often called as Scaling NormalizationIt is a often called as Z-Score Normalization.
7 more rows
12 Nov 2021

What is momentum in batch normalization? ›

Momentum helps in reducing the noise in gradient update term and thus helps to converge faster to the optimal (or near optimal) value. Batch normalization reduces the coupling (sensitivity) of prior layers parameters to the later stage parameters thus stabilizing the inputs feed to a layer.

How do you avoid vanishing gradient? ›

Method to overcome the problem

The vanishing gradient problem is caused by the derivative of the activation function used to create the neural network. The simplest solution to the problem is to replace the activation function of the network. Instead of sigmoid, use an activation function such as ReLU.

How can you avoid vanishing or exploding gradient? ›

Gradient Clipping

Another popular technique to mitigate the exploding gradients problem is to clip the gradients during backpropagation so that they never exceed some threshold. This is called Gradient Clipping. This optimizer will clip every component of the gradient vector to a value between –1.0 and 1.0.

Does batch normalization help in exploding gradients? ›

Solving the exploding gradient problem. As batch normalization smooths the optimization landscape, it gets rid of the extreme gradients that accumulate, leading to the elimination of the major weight fluctuations that result from gradient build-up. This dramatically stabilizes learning.

Does normalization affect accuracy? ›

As will be observed, results indicate that implementing data normalization does not influence the accuracy of the linear classifier while it affects the accuracy of the non-linear classifier drastically.

Which normalization method is best? ›

The best normalization technique is one that empirically works well, so try new ideas if you think they'll work well on your feature distribution. When the feature is more-or-less uniformly distributed across a fixed range. When the feature contains some extreme outliers. When the feature conforms to the power law.

How do you calculate normalization? ›

Here are the steps to use the normalization formula on a data set:
  1. Calculate the range of the data set. ...
  2. Subtract the minimum x value from the value of this data point. ...
  3. Insert these values into the formula and divide. ...
  4. Repeat with additional data points.

Can you use batch normalization with RNNS Why or why not? ›

No, you cannot use Batch Normalization on a recurrent neural network, as the statistics are computed per batch, this does not consider the recurrent part of the network. Weights are shared in an RNN, and the activation response for each "recurrent loop" might have completely different statistical properties.

How does batch size effect batch normalization in neural network training? ›

When the batch size changes, the neural network with BN training tends to converge to a different solution. In- deed, this observation is true even for solving convex optimization problems. It can also be observed that the smaller the batch size, the worse the performance of the solution.

Is batch normalization still used? ›

Batch Normalization is a widely adopted technique that enables faster and more stable training and has become one of the most influential methods.

What is the difference between batch normalization and instance normalization? ›

Both names reveal some information about this technique. Instance normalization tells us that it operates on a single sample. On the other hand, contrast normalization says that it normalizes the contrast between the spatial elements of a sample.

Why batch normalization makes training faster? ›

And as networks get deeper, their gradients get smaller during back propagation so they require even more iterations. Using batch normalization allows us to use much higher learning rates, which further increases the speed at which networks train.

Should batch normalization before or after activation? ›

In practical coding, we add Batch Normalization after the activation function of the output layer or before the activation function of the input layer. Mostly researchers found good results in implementing Batch Normalization after the activation layer.

Why does CNN use batch normalization? ›

Batch Norm is a normalization technique done between the layers of a Neural Network instead of in the raw data. It is done along mini-batches instead of the full data set. It serves to speed up training and use higher learning rates, making learning easier. the standard deviation of the neurons' output.

What is a good batch size for neural network? ›

In all cases the best results have been obtained with batch sizes m = 32 or smaller, often as small as m = 2 or m = 4. — Revisiting Small Batch Training for Deep Neural Networks, 2018. Nevertheless, the batch size impacts how quickly a model learns and the stability of the learning process.

What is a good batch size? ›

Generally batch size of 32 or 25 is good, with epochs = 100 unless you have large dataset. in case of large dataset you can go with batch size of 10 with epochs b/w 50 to 100.

What should be the batch size in deep learning? ›

The size of a batch must be more than or equal to one and less than or equal to the number of samples in the training dataset.

Where should I put batch normalization in CNN? ›

It can be used at several points in between the layers of the model. It is often placed just after defining the sequential model and after the convolution and pooling layers.

What are the advantages of batch normalization? ›

Advantages Of Batch Normalization

Reduces internal covariant shift. Reduces the dependence of gradients on the scale of the parameters or their initial values. Regularizes the model and reduces the need for dropout, photometric distortions, local response normalization and other regularization techniques.

Why is batch normalization useful? ›

Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.

Why layer norm is better than batch norm? ›

Unlike batch normalization, Layer Normalization directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases.

What is one of the advantage of group Normalisation over batch Normalisation? ›

GN is better than IN as GN can exploit the dependence across the channels. It is also better than LN because it allows different distribution to be learned for each group of channels. When the batch size is small, GN consistently outperforms BN.

Top Articles
Latest Posts
Article information

Author: Patricia Veum II

Last Updated: 02/25/2023

Views: 5670

Rating: 4.3 / 5 (44 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Patricia Veum II

Birthday: 1994-12-16

Address: 2064 Little Summit, Goldieton, MS 97651-0862

Phone: +6873952696715

Job: Principal Officer

Hobby: Rafting, Cabaret, Candle making, Jigsaw puzzles, Inline skating, Magic, Graffiti

Introduction: My name is Patricia Veum II, I am a vast, combative, smiling, famous, inexpensive, zealous, sparkling person who loves writing and wants to share my knowledge and understanding with you.