Machine Translation using RNN with and without Attention - A detailed tracing of information

Tung San
Jun 28, 2022
11 min read

Updated: Jul 4, 2022

Background

The focus of this study will be to trace the information flow, i.e., the flow of data from the very beginning as an input, to the corresponding prediction made.

Machine Translation using Attention mechanism has been adopted by Google. In 2016, a paper released by Google’s research team [1] discussed the rationale and principle. This model architecture has shown good performance in translation. Similar implementation methods of this model architecture are discussed and accessible over the internet, but usually only with the ideas rough explained. This model, especially with attention involves 3 networks: an Encoder network, an Attention network and a Decoder network. Beginners of NLP using neural network may find difficult to understand such complexity and may treat this as a black box. To unveil the myth in it, the best way is to trace closely how the information flows throughout the stream.

Data preview

The dataset in use in this project contains 10,000 lines of English sentences and corresponding French translation. The complexity of sentences increases along the line number. Note that the translation is not one-to-one as an English sentence usually has more than one appropriate translation in French. A sample of 3,000 lines will be drawn randomly for building the model without attention, and a sample of 2,000 lines for building the model with attention. Complexity of the training problem is reduced in the latter case due to significant increase in computation cost to calculate attention.

Text processing

A function named preprocess_sentence() is used to transform a line being read from the source dataset. Regular expression is used as the main tools. For example, a line

Hi! Here are the first few lines of data. Check: (some newlines here)->>

will be transformed into

hi ! here are the first few lines of data . check some newlines here.

Here only three signs ‘.’, ‘!’, and ‘?’ will be left because they may indicate emotions. Each words including signs are separated by 1 white space.

Every French sentence produces 2 truncated French sentences. One truncated at the end while a “BOS” token is inserted at the beginning; another is truncated at the beginning while an “EOS” token is inserted at the end. “BOS” stands for beginning of sentence – it acts as head for generation of French sentence to initiate. “EOS” stands for end of sentence – generation of a French sentence halts once this token is generated. The generation of a sentence will also halt if the sentence length exceeds a certain number, e.g., 2 times maximum French sentence length in the sample. It avoids infinite prediction that is possible in the very early stage of training, as the under-trained networks may fail to generate “EOS”.

For example,

['BOS', 'courez', '!']

['BOS', 'prenez', 'vos', 'jambes', 'a', 'vos', 'cous', '!']

and

['courez', '!', 'EOS']

['prenez', 'vos', 'jambes', 'a', 'vos', 'cous', '!', 'EOS']

are generated from

“Courez !” and “Prenez vos jambes à vos cous !”.

Vocabulary lists for English and French words found in the sample are generated and will be used to convert sentences entries from word to word-index, and from word-index to word. The following are vocab lists generated from the first 10 lines of data.

Check : word2idx_eng

{'.': 1, 'run': 2, '!': 3, 'go': 4, 'hi': 5}

Check : idx2word_eng

{1: '.', 2: 'run', 3: '!', 4: 'go', 5: 'hi'}

Check : word2idx_fren

{'!': 1, 'BOS': 2, 'EOS': 3, '.': 4, 'salut': 5, 'vos': 6, 'va': 7, 'marche': 8, 'bouge': 9, 'cours': 10, 'courez': 11, 'prenez': 12, 'jambes': 13, 'a': 14, 'cous': 15, 'file': 16, 'filez': 17}

Check : idx2word_fren

{1: '!', 2: 'BOS', 3: 'EOS', 4: '.', 5: 'salut', 6: 'vos', 7: 'va', 8: 'marche', 9: 'bouge', 10: 'cours', 11: 'courez', 12: 'prenez', 13: 'jambes', 14: 'a', 15: 'cous', 16: 'file', 17: 'filez'}

Databatch for training and testing are generated in 80%-to-20% ratio. The following illustrates it with the first 10 lines of data. All English sentences and French sentences are padded with zero respectively to ensure equal length. In the following example, English sentences passes into a network which accepts 2 time-steps and the networks deal with French sentences accepts 8 time-steps, so to their state information. Their state information will have shape of (batch_size, 2, hidden_dim), and (batch_size, 8, hidden_dim), respectively. The dimension 0 and dimension 2 of the state information need to agree with each other because the latter network will inherit the state information from the former network as its initial states.

Typical Encoder-Decoder Architecture

Simple architecture, i.e., single network with an English sentence as input, and a French sentence as corresponding output, is not applicable. An input sequence is being processed sequentially to generate an output sequence, but it is common that in natural language, a sentence’s meaning is significantly incomplete without having read the last few words. Punctuation may be also important in translation, because a sentence ends with ‘!’ may expresses anger but ‘.’ may only expresses plain emotion in which they have significantly different expression in other language. English users may not be familiar with the gender rule. In French, male speaker and female speaker pronounces and spells the same word in different way. This gender context, i.e., the gender of the speaker affects the translation results significantly.

So, the context of the input sentence needs to be extracted before translation begins. The first network, as known as the Encoder, handles it. Every time the Encoder process a word of the input sentence (encoder_in), a state information is generated. This is a vector of real numbers in dimension equals encoder’s hidden dimension (encoder_hidden_dim). An Encoder which handles input sentences of length L1, e.g., 5, would have L1-many, e.g., 5, hidden state information upon all words of the input sentence is processed. The last hidden state information will be passed to Decoder as Decoder’s initial state information which enclose information about the whole sentences just being processed by the Encoder.

The Decoder then reads a word of its own input (decoder_in) and makes use of the state information just passed to generate a vector of real numbers in dimension equals to the pre-defined hidden dimension (decoder_hidden_dim) as output accordingly. Note that a typical RNN node returns both an output and a state information, but only the state information being returned will be used in our model.

The new state information generated is then passed to a dense layer of output dimension equals to vocabulary size of French words, e.g., 17, so to predict a word according to the most likely guess, but 1 extra dimension is needed in advance of the vocabulary size of French, as the padded token “0” is added after the forming of vocab, and indeed exists in the data. In total, therefore, e.g., 18 is needed. Otherwise, the under-trained model intending to predict the padded token “0”, which is an invalid prediction, will instead give other predicted word-index and thus this error is ignored. One may terminate further prediction for the current decoder input sentence when “0” is predicted, and when the predicted sequence is too long, i.e., words other than “EOS” is recurrently being predicted and thus the prediction does not halt.

As a prediction is made, another new state information is generated by the Decoder and being used by the Decoder to generate the next predicted word, in accompany with the next word of decoder’s input.

Upon a prediction of a sentence is completed, i.e., terminated by the “EOS” token being predicted. The whole sentence will be compared with the label, i.e., the correct output. For example,

Input: tom just died .

Label: tom vient de mourir . EOS

Predicted: tom l a dementi . EOS.

The loss will be calculated by Sparse Cross-Entropy Loss because the vocabulary size of French can be very large while only a few vocabs will occur in a sentence. The model parameters will be updated accordingly using the loss calculated and optimized by adam.

This architecture, however, has limitation when the sequence length of prediction is large. Suppose the following is the final state information produced by the encoder when the last word of its English input sentence is processed. There are 2 sentences in a batch so that [5 1] is the 1st English sentence and [4 1] is the 2nd. The first vector [-0.00151575 0.00067186 -0.0088729 -0.00694382 -0.00245908] has dimension 5 because the encoder_hidden_dim is set to 5, and is produced when the ‘1’ in [5 1] is processed. The state information is passed to Decoder. Check Decoder’s initial states. This initial state is used along with the first entry [2 2] to give the first state information in the Decoder [-0.01462262 0.01395725 0.00325191 -0.00523921 0.00773933] and [-0.01881059 0.01788215 -0.00122766 -0.00630105 0.00780798]. The problem is, as the Decoder goes along the input from [5 1] to eventually [1 0]. The original state information given by the encoder, which enclose important contextual informal about the English sentence to be translated, becomes less and less influential to the prediction in Decoder. For a long sentence to predict, the Encoder-Decoder may behave as if no context information is given by the Encoder. So, Attention mechanism is employed to adjust such limitation.

Check Encoder Eng_input:

[[5 1]

[4 1]]

(2, 5)

Check Encoder final_state:

[[-0.00151575 0.00067186 -0.0088729 -0.00694382 -0.00245908]

[-0.01021546 0.01021688 -0.01940843 -0.00739546 -0.00296227]]

(2, 5)

Check Decoder_French_input:

[[ 2 12 6 13 14 6 15 1]

[ 2 5 1 0 0 0 0 0]]

(2, 8)

Check Decoder initial_states:

[[-0.00151575 0.00067186 -0.0088729 -0.00694382 -0.00245908]

[-0.01021546 0.01021688 -0.01940843 -0.00739546 -0.00296227]]

(2, 5)

Check Decoder seq_of_states:

[[[-0.01462262 0.01395725 0.00325191 -0.00523921 0.00773933]

[-0.01628003 0.00927662 0.00377389 -0.01390181 0.01857551]

[-0.00371529 -0.00506332 -0.00538249 -0.00254834 0.00981565]

[ 0.00759869 -0.00075043 -0.00525092 -0.00098067 -0.00509327]

[ 0.01383088 0.00012096 -0.00659386 -0.00150179 -0.01170597]

[ 0.01721347 -0.00015377 -0.00797081 -0.0020302 -0.01465107]

[ 0.01900439 -0.00067868 -0.00900003 -0.00227526 -0.0159403 ]

[ 0.01992344 -0.00114775 -0.00967174 -0.00232605 -0.01648146]]

[[-0.01881059 0.01788215 -0.00122766 -0.00630105 0.00780798]

[-0.00550694 0.00845043 -0.00286086 0.0044791 -0.00225887]

[ 0.00193014 -0.00468826 -0.00506912 0.00763552 -0.00479838]

[ 0.01028105 -0.00114804 -0.00574334 0.00213741 -0.01122396]

[ 0.01514361 -0.00032801 -0.00714831 -0.00055831 -0.01435578]

[ 0.01786109 -0.00049488 -0.00839208 -0.00173699 -0.01581654]

[ 0.01932136 -0.00090681 -0.00927618 -0.0021719 -0.0164557 ]

[ 0.02007534 -0.00129074 -0.00983838 -0.00227919 -0.01670802]]]

(2, 8, 5)

Check Decoder final_states:

[[ 0.01992344 -0.00114775 -0.00967174 -0.00232605 -0.01648146]

[ 0.02007534 -0.00129074 -0.00983838 -0.00227919 -0.01670802]]

(2, 5)

Encoder-Decoder Architecture with attention

For Attention mechanism to work, the Decoder need an extra information – a sequence of state produced by the Encoder as the Encoder process every entry of the English input sentence. If, and only if, for example, the English input has length 2, then there are 2 vectors of state information given by the encoder, each of dimension equals to encoder’s hidden dimension. This sequence of state, corresponding to each English input, will be used to compute a “correspondence score” for the Decoder to lookup, when the Decoder is making prediction to a word.

Recall that a Decoder needs state information and a word from decoder’s input (decoder_in) to give prediction to a word. The “correspondence score” calculated will be used to multiply to the sequence of states given by the encoder. It acts as to do a weighted sum over a set of vectors – the sequences of states in the Encoder. A vector is therefore produced and passed to the Decoder. It will be then concatenated to the (embedded) input in the Decoder, so that the Decoder will be using a longer input, along with its own state information, to give a prediction. Check with the following.

vocab_size_eng = 5. seqlen_eng = 2.

vocab_size_fren = 17. seqlen_fren = 8.

decoder_embedding_dim = 4. decoder_hidden_dim = 5.

batch size = 2.

decoder input (The current batch):

[[2 5 1 0 0 0 0 0]

[2 8 4 0 0 0 0 0]]

(2,8)

Using entry-0 over the inputted batch

[2 2]

(2,)

Check Encoder final_state:

[[ 0.00695505 0.00925171 0.00850129 -0.00365019 0.01062863]

[ 0.00849607 -0.0015159 0.00437575 0.005831 0.01335546]]

Check Decoder initial state:

[[ 0.00695505 0.00925171 0.00850129 -0.00365019 0.01062863]

[ 0.00849607 -0.0015159 0.00437575 0.005831 0.01335546]]

(2, 5)

Check Decoder Embedded_input:

[[-0.00109055 0.03185408 -0.02998426 -0.00805385]

[-0.00109055 0.03185408 -0.02998426 -0.00805385]]

(2, 4)

Following this, the decoder’s state information (in the 1st step, inherit from Encoder) and the sequences of states inherits from the Encoder are passed to Attention. The former and latter tensors will go through a series of dense layers and activation layers to produce alignment – the “correspondence score”, and thus an extra context information for Decoder to give prediction of a word. The corresponding description and details are commented below for readability. Additive attention is employed here. The 2 transformed state information are summed together which signify the name “additive”.

>>>>>AdditiveAttention:

Check Decoder state:

[[ 0.00695505 0.00925171 0.00850129 -0.00365019 0.01062863]

[ 0.00849607 -0.0015159 0.00437575 0.005831 0.01335546]]

(2, 5)

Check Encoder_seq_states:

[[[-0.01080218 0.01989299 0.00204788 0.00671714 -0.00268245]

[ 0.00695505 0.00925171 0.00850129 -0.00365019 0.01062863]]

[[-0.00388632 -0.00092694 -0.00081464 0.01778765 0.00470874]

[ 0.00849607 -0.0015159 0.00437575 0.005831 0.01335546]]]

(2, 2, 5)

## A series of calculation

## 1)

Check Decoder state expand dims.

[[[ 0.00695505 0.00925171 0.00850129 -0.00365019 0.01062863]]

[[ 0.00849607 -0.0015159 0.00437575 0.005831 0.01335546]]]

(2, 1, 5)

## 2)

The two entries will be *linearly transformed* to agree with decoder’s embedding dimension.

Check self.W1(decoder_state_prev):

[[[-0.00065386 0.00596439 0.01028331 0.01475203]]

[[ 0.00082657 -0.00710913 0.00535248 0.00980083]]]

(2, 1, 4)

Check self.W2(encoder_state_seq):

[[[ 0.01787243 -0.00579923 -0.00917218 -0.00235289]

[ 0.00085506 0.01127228 -0.01400615 -0.00238137]]

[[ 0.0068425 0.00306441 0.00146474 -0.00294555]

[-0.00294493 0.01512635 -0.00711599 -0.00288397]]]

(2, 2, 4)

## 3)

and then *summed* together.

Comment: Each innermost vector in self.W1(decoder_state_prev) of shape (2, 1, 4) is added to every innermost vector in self.W2(encoder_state_seq) of shape (2, 2, 4)

Check SUM:

[[[ 0.01721858 0.00016515 0.00111113 0.01239914]

[ 0.0002012 0.01723667 -0.00372284 0.01237066]]

[[ 0.00766907 -0.00404472 0.00681722 0.00685528]

[-0.00211836 0.00801722 -0.00176351 0.00691686]]]

(2, 2, 4)

## 4)

and then apply activations.

tf.keras.activations.tanh(SUM):

tf.Tensor(

[[[ 0.01721688 0.00016515 0.00111113 0.01239851]

[ 0.0002012 0.01723496 -0.00372282 0.01237003]]

[[ 0.00766891 -0.0040447 0.00681712 0.00685518]

[-0.00211836 0.00801704 -0.0017635 0.00691675]]], shape=(2, 2, 4), dtype=float32)

## 5)

and then passes to Dense layer V of output dimension 1.

Check score=self.V(tf.keras.activations.tanh(SUM):

[[[-0.01240916]

[-0.0232123 ]]

[[-0.00430589]

[-0.01085638]]]

(2, 2, 1)

## 6)

and then apply softmax to get alignment.

Apply softmax to get alignment.

Check alignment:

[[[0.50270075]

[0.4972992 ]]

[[0.50163764]

[0.4983624 ]]]

(2, 2, 1)

## 7)

Further, obtain an extra context signal from alignment

Check alignment_Transposed:

[[[0.50270075 0.4972992 ]]

[[0.50163764 0.4983624 ]]]

(2, 1, 2)

Check Encoder_seq_states:

[[[-0.01080218 0.01989299 0.00204788 0.00671714 -0.00268245]

[ 0.00695505 0.00925171 0.00850129 -0.00365019 0.01062863]]

[[-0.00388632 -0.00092694 -0.00081464 0.01778765 0.00470874]

[ 0.00849607 -0.0015159 0.00437575 0.005831 0.01335546]]]

(2, 2, 5)

Check (M). matmul between the two tensors:

[[[-0.00197152 0.01460109 0.00525715 0.00156147 0.00393714]]

[[ 0.0022846 -0.00122046 0.00177206 0.0118289 0.00901794]]]

(2, 1, 5)

Check take tf.reduce_sum(M, axis=1)):

[[-0.00197152 0.01460109 0.00525715 0.00156147 0.00393714]

[ 0.0022846 -0.00122046 0.00177206 0.0118289 0.00901794]]

(2, 5)

Check expand dims. This is the extra context signal:

[[[-0.00197152 0.01460109 0.00525715 0.00156147 0.00393714]]

[[ 0.0022846 -0.00122046 0.00177206 0.0118289 0.00901794]]]

(2, 1, 5)

The alignment calculated in this step, where Decoder is using the 1st entry of French input [2 2], from [[2 5 1 0 0 0 0 0]

[2 8 4 0 0 0 0 0]], to give the 1st entry of prediction sequence, is given as

[[[0.50270075]

[0.4972992 ]]

[[0.50163764]

[0.4983624 ]]].

It suggests that the Decoder need to pay 0.50270075 and 0.50163764 attention to the 1st entry of English input, which is the 5 in [5 1] and the 4 in [4 1], when it is to give the 1st entry for the prediction sequence. Also, 0.4972992 and 0.4983624 attention to the 2nd entry of English input, which is the 1 in [5 1] and the 1 in [4 1]. Check below the English input to the Encoder.

Check Encoder Eng_input:

[[5 1]

[4 1]]

(2, 5)

Further, the extra context signal will be passed back to the Decoder. It sticks with the embedded input for [2 2], the input in this run, to gives prediction. The final results given by the dense layer of output dimension 18, which equals to 1+vocab size of French, will be used to give a predicted word.

>>>Decoder Continue:

## 0) Double-checking

Check again Embedded_input:

[[-0.00109055 0.03185408 -0.02998426 -0.00805385]

[-0.00109055 0.03185408 -0.02998426 -0.00805385]]

(2, 4)

## 1) Sticking the two innermost vectors together. The result will be used as rnn input.

Check tf.squeeze(context_signal, axis=1):

[[-0.00197151 0.01460108 0.00525716 0.00156147 0.00393715]

[ 0.0022846 -0.00122046 0.00177206 0.0118289 0.00901794]]

(2, 5)

Check tf.concat([x, tf.squeeze(context_signal, axis=1)], axis=1):

[[-0.00109055 0.03185408 -0.02998426 -0.00805385 -0.00197151 0.01460108

0.00525716 0.00156147 0.00393715]

[-0.00109055 0.03185408 -0.02998426 -0.00805385 0.0022846 -0.00122046

0.00177206 0.0118289 0.00901794]]

(2, 9)

Check expand dims...:

[[[-0.00109055 0.03185408 -0.02998426 -0.00805385 -0.00197151

0.01460108 0.00525716 0.00156147 0.00393715]]

[[-0.00109055 0.03185408 -0.02998426 -0.00805385 0.0022846

-0.00122046 0.00177206 0.0118289 0.00901794]]]

(2, 1, 9)

## 2) The tensor above(named x2 in the code)is used as rnn’s input

Check (x) self.rnn(x2, state):

[[[-0.01085174 0.02083068 0.00200731 0.00069363 0.00350785]]

[[-0.01653891 0.01090383 -0.00105983 0.00169392 0.00912278]]]

(2, 1, 5)

## 3) activation and dense layer

Check (x) self.tanh(x):

[[[ 1.3665126e-02 1.4264538e-02 -2.8624907e-03 -8.8535046e-05

-1.6955875e-02]]

[[ 1.5753986e-02 1.1237667e-02 -6.1049550e-03 5.1309071e-05

-1.7799601e-02]]]

(2, 1, 5)

Check (x) self.fc(x):

[[[ 0.01327891 -0.00648213 0.0005606 0.0078405 -0.00375937

0.00481374 0.00105885 -0.00722548 0.01121659 0.00686908

-0.00817349 0.01147187 0.00433722 0.00689595 -0.00725744

0.01580662 0.00481296 0.00117604]]

[[ 0.01196476 -0.0062236 0.00032258 0.00892573 -0.0023132

0.00738119 0.00298696 -0.00759445 0.00916394 0.00695748

-0.00797199 0.0110859 0.0016316 0.00476087 -0.00610589

0.01749316 0.00492798 0.00013863]]]

(2, 1, 18)

## 4) This state will be passed to Decoder as the state information for the next entry’s prediction.

Check current state:

[[-0.01085174 0.02083068 0.00200731 0.00069363 0.00350785]

[-0.01653891 0.01090383 -0.00105983 0.00169392 0.00912278]]

(2, 5)

>>>Decoder Ends:

The remaining loops are as follows. In each batch, if and only if the French sequence length is, e.g., 8, 8 loops will be run, so that 8 predictions of word will be made. The predicted sequence will be compared with the label French output to compute loss. But with this mechanism, loss is computed word-on-word in each of the 8 loops, instead of sentence-on-sentence as in the typical Encoder-Decoder case. The losses over the 8 loops accumulate and are averaged, e.g., by 8, to give an averaged loss. The remaining steps to update model parameters, however, are exactly equivalent in both mechanisms.

Using entry-1 over the inputted batch

[5 8]

…

Check alignment:

[[[0.5026997 ]

[0.49730027]]

[[0.501637 ]

[0.49836302]]]

(2, 2, 1)

…

Check context_signal_from_attention:

[[[-0.0019715 0.01460108 0.00525716 0.00156146 0.00393715]]

[[ 0.0022846 -0.00122046 0.00177206 0.01182889 0.00901794]]]

(2, 1, 5)

… (Skipped)

Using entry-2 over the inputted batch

[1 4]

…

Check alignment:

[[[0.50270087]

[0.4972991 ]]

[[0.5016375 ]

[0.49836245]]]

(2, 2, 1)

… (Skipped)

Model results

The latter model outperforms in the sense that the decrease in loss is smother and the BLEU score (BiLingual Evaluation Understudy), which is a metrics to evaluate machine translation performance, obtained after every epoch is trained, is increasing linearly, while the former model doesn’t. The overall score is also better. The captures below shows actual prediction of sentences done by the model at around 100, 200, and 300 epochs.

Machine Translation using RNN with and without Attention - A detailed tracing of information

Recent Posts

Comentarios