🧙GPT OpenAI
Brief info about GPTs
Tokenization
The process of converting a wall of text to its numeric representation. We can use different type of schemes, For example -
character level (abcdef...).
Word level - Mostly the vocab is too huge to converge.
wordpiece level - Sentencepiece made by google.
import sentencepiece as spm
sp.encode('This is a test')
# [284, 47, 11, 4, 15, 400]
import tiktoken
enc = tiktoken.get_encoding("gpt2")
assert enc.decode(enc.encode("hello world")) == "hello world"
# True
For more info refer this.
Vocabulary Size vs Encoding Size trade-off
If we do a character level tokenization, we get a very small vocabulary, but the encoding size is very large. Whereas if we have a word level tokenization, we can encode long sentences easily, but the vocabulary size will be huge. Therefore intermediate tokenization techniques were introduced such as BPE and sub-word tokenization.
Decoder
The GPT is trained on next token prediction task, such that if we give the model some context, it will try to predict the next token.
Sentence - "Alexander, the great was the king of kingdom of Macedon" context - "Alexander, the __"
GPT - great
Self-attention
Its a novel idea introduced in "Attention is all you need" paper , which tells that, each token should interact with every other token in its context, to generate an affinity or likelihood of appearing together.
Given token
X
has occurred somewhere in the context of the tokenY
, it will generate tokenZ
in future context.
In the case of Decoder, the self-attention is used for the past tokens only. As if it will interact with the future tokens, it can easily memorize and give a high accuracy. Therefore, in the decoder unit, the future tokens are masked with a mask
token.
Implementation
Basic Implementation
Here, we use the tokenized sentence X to create data-pairs.
# X -> [1, 2, 3, 4, 5, 6]
for i in range(len(x)-1):
print(f"\nData Pair {i+1}")
print("Input:", X[:i])
print("Output:", X[i+1])
# Data pair 1
# Input: 1
# Output: 2
#
# Data pair 2
# Input: 1 2
# Output: 3
#
# Data pair 3
# Input: 1 2 3
# Output: 4
Now, if we want the tokens to interact with the other tokens, for example we want to average all the tokens from past. we can do the following.
Maths Trick
import torch
wei = torch.open((2,2))
print(wei)
# 1 1
# 1 1
wei = torch.tril(wei)
print(wei)
# 1 0
# 1 1
wei = wei / wei.sum(dim = 1, keepdims = True)
print(wei)
# 1 0
# .5 .5
Last updated