Page cover

πŸ“œNLP Cleaning Pipeline

This post contains my most used snippets in my Natural Language adventures.

Imports

import re
import os
from bs4 import BeautifulSoup as beautifulsoup
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer as wordnetlemmatizer

Lowercase

def lowercase(input):
    """
    returns lowercase text
    """
    return input.lower()

Remove Punctuation

def remove_punctuation(input):
    """
    returns text without punctuation
    """
    return input.translate(str.maketrans("", "", string.punctuation + "1234567890"))

Remove White space

Remove Emoji

Remove HTML

Tokenize into words

Remove Stop Words

Lemmatize

Total Pipeline

Putting all the modules defined abhove tokether.

Last updated