# GPT OpenAI

## Tokenization

The process of converting a wall of text to its numeric representation. We can use different type of schemes, For example -

* character level (abcdef...).
* Word level - Mostly the vocab is too huge to converge.
* wordpiece level - [Sentencepiece](https://github.com/google/sentencepiece) made by [google](https://research.google/).

```py
import sentencepiece as spm
sp.encode('This is a test')
# [284, 47, 11, 4, 15, 400]
```

* Byte Pair level - [Tiktoken](https://github.com/openai/tiktoken) made by [openai](https://openai.com/).

```py
import tiktoken
enc = tiktoken.get_encoding("gpt2")
assert enc.decode(enc.encode("hello world")) == "hello world"
# True
```

For more info refer [this](https://huggingface.co/docs/tokenizers/components).

### Vocabulary Size vs Encoding Size trade-off

If we do a character level tokenization, we get a very small vocabulary, but the encoding size is very large. Whereas if we have a word level tokenization, we can encode long sentences easily, but the vocabulary size will be huge. Therefore intermediate tokenization techniques were introduced such as BPE and sub-word tokenization.

## Decoder

The GPT is trained on next token prediction task, such that if we give the model some context, it will try to predict the next token.

> *Sentence* - "Alexander, the great was the king of kingdom of Macedon" *context* - "Alexander, the \_\_"&#x20;
>
> **GPT** - great

### Self-attention

Its a novel idea introduced in "Attention is all you need" [paper](https://arxiv.org/abs/1706.03762) , which tells that, each token should interact with every other token in its context, to generate an affinity or likelihood of appearing together.

> Given token `X` has occurred somewhere in the context of the token `Y`, it will generate token `Z` in future context.

In the case of Decoder, the self-attention is used for the past tokens only. As if it will interact with the future tokens, it can easily memorize and give a high accuracy. Therefore, in the decoder unit, the future tokens are masked with a `mask` token.

### Implementation

#### Basic Implementation

Here, we use the tokenized sentence X to create data-pairs.

```py
# X -> [1, 2, 3, 4, 5, 6]
for i in range(len(x)-1):
	print(f"\nData Pair {i+1}")
	print("Input:", X[:i])
	print("Output:", X[i+1])

# Data pair 1
# Input: 1
# Output: 2
# 
# Data pair 2
# Input: 1 2
# Output: 3
#
# Data pair 3
# Input: 1 2 3
# Output: 4
```

Now, if we want the tokens to interact with the other tokens, for example we want to average all the tokens from past. we can do the following.

#### Maths Trick

```py
import torch
wei = torch.open((2,2))
print(wei)
# 1 1
# 1 1

wei = torch.tril(wei)
print(wei)
# 1 0
# 1 1

wei = wei / wei.sum(dim = 1, keepdims = True)
print(wei)
#  1  0
# .5 .5
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deba-notes.gitbook.io/nlp/wiki/gpt-openai.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
