Effect of Normalization in training algorithm
- Tung San
- Aug 14, 2021
- 1 min read

Standardization
Let [x1, x2, ..., xn] and max(xi) = m, min(xi) = n
Just transform xi into (xi-m)/(m-n)
(Min-max Feature Scaling)
Normalization (in statistics)
Let [x1, x2, ..., xn] ~ X and mean(X) = m, s.d.(X) = s
Just transform xi into (xi-m)/s
(Standard score)
Effect of Normalization
Normalization can bring optimal model parameter closer to starting point, usually put, 0. Optimization process can thus takes less iteration.
Normalization on each numeric variables makes such ratio as delta(y)/delta(x) less extreme in scale. It facilitates choosing of step size in optimization process, and avoid overshooting and undershooting in some extent.
When the optimizer adopt iterative algorithm, e.g. gradient descent, normalization is a must.
Features of far smaller scale than other feathers may be neglected by the training model. However, being very small in numeric value does not mean small importance.
Illustration using avalanche-rescue dog dataset
import pandas
import wget
url1 = "https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py"
url2 = "https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/dog-training.csv"
url3 = "https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/m1b_gradient_descent.py"
list(map(wget.download, [url1,url2,url3]))
data = pandas.read_csv("dog-training.csv", delimiter="\t")
del data["rescues_last_year"]
del data["age_last_year"]
del data["weight_last_year"]
print(data.shape)
data.head()

Train simple linear regression using gradient descent
# Without normalization
from m1b_gradient_descent import gradient_descent
import numpy
import graphing
model = gradient_descent(data.month_old_when_trained, data.mean_rescues_per_year, learning_rate=5E-4, number_of_iterations=8000)

# With normalization
data["normalized_age_when_trained"] = (data.month_old_when_trained - numpy.mean(data.month_old_when_trained)) / (numpy.std(data.month_old_when_trained))
model_norm = gradient_descent(data.normalized_age_when_trained, data.mean_rescues_per_year, learning_rate=5E-4, number_of_iterations=8000)

Compare training cost history
cost1 = model.cost_history
cost2 = model_norm.cost_history
# Creates dataframes with the cost history for each model
df1 = pandas.DataFrame({"cost": cost1, "Model":"No feature scaling"})
df1["number of iterations"] = df1.index + 1
df2 = pandas.DataFrame({"cost": cost2, "Model":"With feature scaling"})
df2["number of iterations"] = df2.index + 1
# Concatenate dataframes into a single one that we can use in our plot
df = pandas.concat([df1, df2])
# Plot cost history for both models
fig = graphing.scatter_2D(df, label_x="number of iterations", label_y="cost", title="Training Cost vs Iterations", label_colour="Model")
fig.update_traces(mode='lines')
fig.show()

Comments