π§GPT OpenAI
Brief info about GPTs
Tokenization
The process of converting a wall of text to its numeric representation. We can use different type of schemes, For example -
character level (abcdef...).
Word level - Mostly the vocab is too huge to converge.
wordpiece level - Sentencepiece made by google.
import sentencepiece as spm
sp.encode('This is a test')
# [284, 47, 11, 4, 15, 400]import tiktoken
enc = tiktoken.get_encoding("gpt2")
assert enc.decode(enc.encode("hello world")) == "hello world"
# TrueFor more info refer this.
Vocabulary Size vs Encoding Size trade-off
If we do a character level tokenization, we get a very small vocabulary, but the encoding size is very large. Whereas if we have a word level tokenization, we can encode long sentences easily, but the vocabulary size will be huge. Therefore intermediate tokenization techniques were introduced such as BPE and sub-word tokenization.
Decoder
The GPT is trained on next token prediction task, such that if we give the model some context, it will try to predict the next token.
Sentence - "Alexander, the great was the king of kingdom of Macedon" context - "Alexander, the __"
GPT - great
Self-attention
Its a novel idea introduced in "Attention is all you need" paper , which tells that, each token should interact with every other token in its context, to generate an affinity or likelihood of appearing together.
Given token
Xhas occurred somewhere in the context of the tokenY, it will generate tokenZin future context.
In the case of Decoder, the self-attention is used for the past tokens only. As if it will interact with the future tokens, it can easily memorize and give a high accuracy. Therefore, in the decoder unit, the future tokens are masked with a mask token.
Implementation
Basic Implementation
Here, we use the tokenized sentence X to create data-pairs.
Now, if we want the tokens to interact with the other tokens, for example we want to average all the tokens from past. we can do the following.
Maths Trick
Last updated