GPT OpenAI
Brief info about GPTs
Tokenization
The process of converting a wall of text to its numeric representation. We can use different type of schemes, For example -
character level (abcdef...).
Word level - Mostly the vocab is too huge to converge.
wordpiece level - Sentencepiece made by google.
For more info refer this.
Vocabulary Size vs Encoding Size trade-off
If we do a character level tokenization, we get a very small vocabulary, but the encoding size is very large. Whereas if we have a word level tokenization, we can encode long sentences easily, but the vocabulary size will be huge. Therefore intermediate tokenization techniques were introduced such as BPE and sub-word tokenization.
Decoder
The GPT is trained on next token prediction task, such that if we give the model some context, it will try to predict the next token.
Sentence - "Alexander, the great was the king of kingdom of Macedon" context - "Alexander, the __"
GPT - great
Self-attention
Its a novel idea introduced in "Attention is all you need" paper , which tells that, each token should interact with every other token in its context, to generate an affinity or likelihood of appearing together.
Given token
X
has occurred somewhere in the context of the tokenY
, it will generate tokenZ
in future context.
In the case of Decoder, the self-attention is used for the past tokens only. As if it will interact with the future tokens, it can easily memorize and give a high accuracy. Therefore, in the decoder unit, the future tokens are masked with a mask
token.
Implementation
Basic Implementation
Here, we use the tokenized sentence X to create data-pairs.
Now, if we want the tokens to interact with the other tokens, for example we want to average all the tokens from past. we can do the following.
Maths Trick
Last updated