Text-Cleaning

I have choosen Bubmle Application Reviews as my task. Here, I have showcased step by step process that is involved for textual data analysis.

The steps are to first take the textiual data from review column and make it into a list so that you can see all the necessary things that has to be changed and observed.

After that you start the processing work, for example to make the sentences into a lower case.Post that remove special charachters,emojis and punctuations. Then remove the stop words and after completing all the above steps then we do the tokenization and later anything between stemming or lemmatization.

Then for analysis you can go ahead with ngram or can build a word cloud depending on the need.

Writing down Steps in bullet point format for clear understanding.

Import the libraries initially (pandas, numpy, re).
Load the dataset.
Review the data.head() then drop the redundant features.
Checking for missing records.
Fill the missing records accordingly or can also drop the records if needed.
check the feature that has the textual content, if it is a review then you can go ahead with below code

for index,text in enumerate(data['content']): print('Review %d:\n'%(index+1),text)

or

' '.join(data['content'].tolist()) ----> this line of code is majorly used. If the textual content records is too large then I would suggest to do this in a seperate notebook as doing in the same notebook will make the size too large.

Check the number(len) of wordcount, special charachter and stop words each textual record contains.
If you want to see what are those words in specific then you can just remove the "len" function from the command line and you will see the list of words, list of special charachters and list of stop wrods each record has.
Now comes the processing work.
First thing is we need to make all the statements look uniform, that is we need to make them in lower case first.
Remove the punctuation, nubers, special charachters and emojis from the records.
Now, lets remove the stop words from our text data. To remove the stop words, we need to import library ---> from nltk.corpus import stopwords stop = stopwords.words('english')
Next thing which we can do is check for spelling mistakes, I had used textblob library. Which had helped me to rectify to spelling errors but at same time it has also changed the already correct spellings to a meaningless word and that showcases the efficienccy of the library. However, In many cases if the data has lot of spelling errors then it can be useful and also a point to note that there are other libraries as well like cleantext which we can use to check the spelling errors. Anyways, I had used it to display on how we can use that library but dropped it later as it was affecting the efficiency.
Coming to three major pillars of textual pre processing.
1. Tokkenization. -> Tokkenization will divide the sentence into sequnce of words.
2. Stemming. -> Stemming will remove the suffices from words, for example it will remove 'ing','ly' and tries to get to the root word.
3. Lemmatization. -> Lemmatization, Lemmatization is also similar like Stemming but there are many instances where Lemmatization outnumbers stemming in
  efficiency and that is the reason we prefer Lemmatization.
Libraries for tokkenization - import nltk from nltk.tokenize import WhitespaceTokenizer
libraries for stemming - #First we will import the library from nltk.stem import PorterStemmer --> we can also use other library lilke snowballstemmer st = PorterStemmer()
libraries for Lemmatization - from textblob import Word
We will plot our analysis We can use wordcloud from wordcloud import WordCloud import matplotlib.pyplot as plt
Then if you feel that there are words that bring no insights to your analysis then you can add those words in the stop words list.

from nltk.corpus import stopwords stop = stopwords.words('english') morestopwords = ['bumble','dating app','match'] stop.extend(morestopwords)
Then, do the ngram for more analysis to get different aspect of analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
Text Analysis - Bumble Application.zip		Text Analysis - Bumble Application.zip
text_analysis_bumble_application.py		text_analysis_bumble_application.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-Cleaning

About

Releases

Packages

Languages

Arindham-cmd/Text-Cleaning

Folders and files

Latest commit

History

Repository files navigation

Text-Cleaning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages