GPT OpenAI

Brief info about GPTs

Tokenization

The process of converting a wall of text to its numeric representation. We can use different type of schemes, For example -

character level (abcdef...).
Word level - Mostly the vocab is too huge to converge.
wordpiece level - Sentencepiece made by google.

import sentencepiece as spm
sp.encode('This is a test')
# [284, 47, 11, 4, 15, 400]

Byte Pair level - Tiktoken made by openai.

import tiktoken
enc = tiktoken.get_encoding("gpt2")
assert enc.decode(enc.encode("hello world")) == "hello world"
# True

For more info refer this.

Vocabulary Size vs Encoding Size trade-off

If we do a character level tokenization, we get a very small vocabulary, but the encoding size is very large. Whereas if we have a word level tokenization, we can encode long sentences easily, but the vocabulary size will be huge. Therefore intermediate tokenization techniques were introduced such as BPE and sub-word tokenization.

Decoder

The GPT is trained on next token prediction task, such that if we give the model some context, it will try to predict the next token.

Sentence - "Alexander, the great was the king of kingdom of Macedon" context - "Alexander, the __"
GPT - great

Self-attention

Its a novel idea introduced in "Attention is all you need" paper , which tells that, each token should interact with every other token in its context, to generate an affinity or likelihood of appearing together.

Given token X has occurred somewhere in the context of the token Y, it will generate token Z in future context.

In the case of Decoder, the self-attention is used for the past tokens only. As if it will interact with the future tokens, it can easily memorize and give a high accuracy. Therefore, in the decoder unit, the future tokens are masked with a mask token.

Implementation

Basic Implementation

Here, we use the tokenized sentence X to create data-pairs.

# X -> [1, 2, 3, 4, 5, 6]
for i in range(len(x)-1):
	print(f"\nData Pair {i+1}")
	print("Input:", X[:i])
	print("Output:", X[i+1])

# Data pair 1
# Input: 1
# Output: 2
# 
# Data pair 2
# Input: 1 2
# Output: 3
#
# Data pair 3
# Input: 1 2 3
# Output: 4

Now, if we want the tokens to interact with the other tokens, for example we want to average all the tokens from past. we can do the following.

Maths Trick

import torch
wei = torch.open((2,2))
print(wei)
# 1 1
# 1 1

wei = torch.tril(wei)
print(wei)
# 1 0
# 1 1

wei = wei / wei.sum(dim = 1, keepdims = True)
print(wei)
#  1  0
# .5 .5

PreviousNLP Notes NextNLP Cleaning Pipeline

Last updated 1 year ago