top of page

Effect of Normalization in training algorithm

  • Writer: Tung San
    Tung San
  • Aug 14, 2021
  • 1 min read



Standardization

Let [x1, x2, ..., xn] and max(xi) = m, min(xi) = n

Just transform xi into (xi-m)/(m-n)

(Min-max Feature Scaling)



Normalization (in statistics)

Let [x1, x2, ..., xn] ~ X and mean(X) = m, s.d.(X) = s

Just transform xi into (xi-m)/s

(Standard score)



Effect of Normalization

Normalization can bring optimal model parameter closer to starting point, usually put, 0. Optimization process can thus takes less iteration.


Normalization on each numeric variables makes such ratio as delta(y)/delta(x) less extreme in scale. It facilitates choosing of step size in optimization process, and avoid overshooting and undershooting in some extent.


When the optimizer adopt iterative algorithm, e.g. gradient descent, normalization is a must.


Features of far smaller scale than other feathers may be neglected by the training model. However, being very small in numeric value does not mean small importance.



Illustration using avalanche-rescue dog dataset

import pandas

import wget

url1 = "https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py"

url2 = "https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/dog-training.csv"

url3 = "https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/m1b_gradient_descent.py"

list(map(wget.download, [url1,url2,url3]))

data = pandas.read_csv("dog-training.csv", delimiter="\t")


del data["rescues_last_year"]

del data["age_last_year"]

del data["weight_last_year"]


print(data.shape)

data.head()



Train simple linear regression using gradient descent

# Without normalization

from m1b_gradient_descent import gradient_descent

import numpy

import graphing

model = gradient_descent(data.month_old_when_trained, data.mean_rescues_per_year, learning_rate=5E-4, number_of_iterations=8000)




# With normalization

data["normalized_age_when_trained"] = (data.month_old_when_trained - numpy.mean(data.month_old_when_trained)) / (numpy.std(data.month_old_when_trained))

model_norm = gradient_descent(data.normalized_age_when_trained, data.mean_rescues_per_year, learning_rate=5E-4, number_of_iterations=8000)



Compare training cost history

cost1 = model.cost_history

cost2 = model_norm.cost_history


# Creates dataframes with the cost history for each model

df1 = pandas.DataFrame({"cost": cost1, "Model":"No feature scaling"})

df1["number of iterations"] = df1.index + 1

df2 = pandas.DataFrame({"cost": cost2, "Model":"With feature scaling"})

df2["number of iterations"] = df2.index + 1


# Concatenate dataframes into a single one that we can use in our plot

df = pandas.concat([df1, df2])


# Plot cost history for both models

fig = graphing.scatter_2D(df, label_x="number of iterations", label_y="cost", title="Training Cost vs Iterations", label_colour="Model")

fig.update_traces(mode='lines')

fig.show()


The training procedure converge much faster when the feature is scaled.













Comments


Post: Blog2 Post
bottom of page