πNLP Cleaning Pipeline
This post contains my most used snippets in my Natural Language adventures.
Imports
import re
import os
from bs4 import BeautifulSoup as beautifulsoup
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer as wordnetlemmatizerLowercase
def lowercase(input):
"""
returns lowercase text
"""
return input.lower()Remove Punctuation
def remove_punctuation(input):
"""
returns text without punctuation
"""
return input.translate(str.maketrans("", "", string.punctuation + "1234567890"))Remove White space
Remove Emoji
Remove HTML
Tokenize into words
Remove Stop Words
Lemmatize
Total Pipeline
Putting all the modules defined abhove tokether.
Last updated