If we do a character level tokenization, we get a very small vocabulary, but the encoding size is very large. Whereas if we have a word level tokenization, we can encode long sentences easily, but the vocabulary size will be huge. Therefore intermediate tokenization techniques were introduced such as BPE and sub-word tokenization.
Decoder
The GPT is trained on next token prediction task, such that if we give the model some context, it will try to predict the next token.
Sentence - "Alexander, the great was the king of kingdom of Macedon" context - "Alexander, the __"
GPT - great
Self-attention
Its a novel idea introduced in "Attention is all you need" paper , which tells that, each token should interact with every other token in its context, to generate an affinity or likelihood of appearing together.
Given token X has occurred somewhere in the context of the token Y, it will generate token Z in future context.
In the case of Decoder, the self-attention is used for the past tokens only. As if it will interact with the future tokens, it can easily memorize and give a high accuracy. Therefore, in the decoder unit, the future tokens are masked with a mask token.
Implementation
Basic Implementation
Here, we use the tokenized sentence X to create data-pairs.
Now, if we want the tokens to interact with the other tokens, for example we want to average all the tokens from past. we can do the following.