Notebooks are text documents composed of json that contain both code, markup text and other graphical elements (images, videos, plots, widgets).
Notebooks are composed of:
Example of text preprocessing for topic modelling with consumer complaints data.
%%bash
if [ -d data/ ]; then
echo "Data directory exists"
else
mkdir data
fi
if test -f data/complaints.csv; then
echo "Data file exists"
else
curl -LO http://files.consumerfinance.gov/ccdb/complaints.csv.zip; mv complaints.csv.zip data/ ;unzip data/complaints.csv.zip -d data/
fi
Data directory exists Data file exists
# import the dataset
import pandas as pd
ticket_data = pd.read_csv('data/complaints.csv')
ticket_data.dropna(subset=["Consumer complaint narrative"], inplace=True)
print(ticket_data.shape)
ticket_data.head()
/tmp/ipykernel_8714/1604837686.py:4: DtypeWarning: Columns (9,16) have mixed types. Specify dtype option on import or set low_memory=False. ticket_data = pd.read_csv('data/complaints.csv')
(1158384, 18)
Date received | Product | Sub-product | Issue | Sub-issue | Consumer complaint narrative | Company public response | Company | State | ZIP code | Tags | Consumer consent provided? | Submitted via | Date sent to company | Company response to consumer | Timely response? | Consumer disputed? | Complaint ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 2022-12-29 | Debt collection | I do not know | Attempts to collect debt not owed | Debt is not yours | I declare under penalty of perjury ( under the... | Company has responded to the consumer and the ... | Convergent Resources, Inc. | HI | 96818.0 | Servicemember | Consent provided | Web | 2022-12-29 | Closed with explanation | Yes | NaN | 6375521 |
10 | 2022-12-24 | Checking or savings account | Checking account | Managing an account | Deposits and withdrawals | I opened up a account online XXXX weeks ago an... | Company has responded to the consumer and the ... | BMO HARRIS BANK NATIONAL ASSOCIATION | AZ | 85301.0 | Servicemember | Consent provided | Web | 2022-12-24 | Closed with explanation | Yes | NaN | 6358144 |
13 | 2022-12-16 | Credit card or prepaid card | Store credit card | Fees or interest | Unexpected increase in interest rate | When signing up with the card they never tell ... | Company has responded to the consumer and the ... | SYNCHRONY FINANCIAL | NY | 11421.0 | NaN | Consent provided | Web | 2022-12-16 | Closed with explanation | Yes | NaN | 6329064 |
15 | 2022-12-20 | Credit card or prepaid card | General-purpose credit card or charge card | Problem with a purchase shown on your statement | Credit card company isn't resolving a dispute ... | Received credit card statement dated XX/XX/22 ... | Company has responded to the consumer and the ... | U.S. BANCORP | OH | 45377.0 | NaN | Consent provided | Web | 2022-12-20 | Closed with non-monetary relief | Yes | NaN | 6338237 |
17 | 2022-12-16 | Credit reporting, credit repair services, or o... | Credit reporting | Problem with a credit reporting company's inve... | Their investigation did not fix an error on yo... | I reviewed my Consumer Reports and noticed tha... | Company has responded to the consumer and the ... | Experian Information Solutions Inc. | CA | 93727.0 | NaN | Consent provided | Web | 2022-12-16 | Closed with explanation | Yes | NaN | 6323220 |
import numpy as np
# a quick look at the average number of words in each complaint in each category
ticket_data.groupby('Product')['Consumer complaint narrative'].apply(lambda x: np.mean([len(word) for word in x]))
Product Bank account or service 1243.543769 Checking or savings account 1326.154371 Consumer Loan 1109.715945 Credit card 1127.125438 Credit card or prepaid card 1260.682314 Credit reporting 750.135087 Credit reporting, credit repair services, or other personal consumer reports 846.862221 Debt collection 957.419525 Money transfer, virtual currency, or money service 1217.202314 Money transfers 1153.176353 Mortgage 1651.228443 Other financial service 1233.157534 Payday loan 747.893471 Payday loan, title loan, or personal loan 1151.683676 Prepaid card 963.180000 Student loan 1283.149795 Vehicle loan or lease 1360.902611 Virtual currency 940.187500 Name: Consumer complaint narrative, dtype: float64
ticket_data = ticket_data[ticket_data['Product'] == 'Credit card']
# lets peak and look what this looks like
ticket_data['Consumer complaint narrative'].iloc[:3].tolist()
['Last month I started receiving calls from unknown numbers. They did leave a voicemail to call a number back or log on to citicards.com and they could help me. I XXXX the numbers and there were multiple people suspecting the number of fraud. So I logged on to citicards.com and there were no alerts. I sent a secure message about the calls and they gave me this reply, " Dear XXXX, Thank you for contacting us. We appreciate each and every opportunity to serve you. \n\nOur records do not show that we have called you regarding your account. \n\nIf you think that the card information is at risk, please call Customer Service immediately. Once your closure request is processed, the current card is closed and a new card number is established. \nXXXX. \nIf there is any way we can be of further assistance, please feel free to contact us. \n\nSincerely, Account Specialist South Dakota \'\' So I assumed it was fraud, but the calls continued. I finally was able to answer XXXX and it said that it was a citicards account that was past due. Because of the additional time that lapsed, they reported my account as delinquent. This seems unfair since I took the exact step from their phone call and was told by Citi that they were n\'t trying to get a hold of me. Now I have a late payment on my credit report from Citi because Citi told me they were n\'t trying to contact me.', "I was misled into thinking that I could have a XXXX alternative from XXXX in XXXX GA. They set me up with a Comenity Bank account for a debt of over {$5500.00}. After XXXX weeks I let XXXX know that I had zero results. They had me wait until a full 3 months had passed to see me, all the while I was making payments to the account. \n\nAfter 12 weeks XXXX got me in, took my XXXX, and took after pictures. I almost cried, I had gained XXXX and there were no differences in my pictures. I was clearly upset with the results and wanted to speak to someone. The nurse said she was sorry, suggested some natural ideas help with XXXX and said a manager would call me. \n\nOnce I finally heard from a manager they tried to sell me on a second round of the procedure for another $ XXXX. I refused and contacted Comenity Bank for help. \n\nThe service from XXXX was fraud. The pictures on the sales ads were deceptive, and it should be illegal to take advantage of consumers like this. They advertise XXXX alternative and show unrealistic before and after pictures. I had zero results and informed the company after 4 weeks to let them know. \n\nAfter I filed the dispute, Comenity Bank instructed me to work my issues out with XXXX. I contacted the XXXX corporate office and was told they would offer me a credit and that a manager in GA would reach out to me. \n\nThe manager in Georgia did reach out to me. She agreed that the service was NOT successful and offered me a store credit for future services of {$2800.00}, still wanting me to pay Comenity $ XXXX. I refused this offer and went back to Comenity bank to let them know we could not resolve this ourselves. \n\nI was told by the XXXX customer service that the dispute would be refiled on XX/XX/2017. I have written on the company XXXX and client portal for updates and help with no reply other than call customer service. Today I called customer service and they said they had no record of me contacting XXXX to resolve and thought they had sent a letter to inform me. The rep said she was n't sure why the letter was n't mailed and would reopen the case. \n\nI feel like I am getting the run around from the bank and have definitely been scammed by XXXX. PLEASE HELP ME! I have two other accounts with Comenity funded ( XXXX and XXXX ) that are paid on time and XXXX is actually paid in full. \n\nI have no problem paying my bills but do have a problem being taken advantage of. \nWhen I complained on the Comenity XXXX page a consumer referred me to contact your site for help. I hope that you can help me and shut companies like this down. I wish I would have done more research before using XXXX. I see that many thousands of other people are going through this too. \n\nKind regards, XXXX", 'I once had a credit card with Fifth Third bank, which was closed bank in XX/XX/XXXX. In XX/XX/XXXX a charge that was stored on an online account automatically tried charging the card for the renewed subscription. Instead of Fifth Third bank declining the transaction they reopened my closed credit card without my permission. Finally, after three months of not contacting me via phone, email, or mail, I received a letter in the mail saying I owed the charge of {$77.00} ( for the service ) and an extra {$100.00} for late fees. After countless hours on the phone with them, they acknowledged my credit card was closed and they reopened it. I offered to pay the initial {$77.00} fee if they would waive the late fee for not informing me, or without reopening it without my permission. They said they would look into the dispute into a better solution. After about another month they sent back a letter saying the dispute was denied and I owed the entire fee. When I called back to ask how it could have been denied, since they reopened a closed account, they said the 120 day period has passed and there is nothing they could do about it. I have since discarded the credit card because I had no use for it after I had closed the account in XX/XX/XXXX.']
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_numeric, remove_stopwords, strip_short, stem_text
def basic_preprocess(list_of_strings):
"""
A basic function that takes a list of strings and runs some basic
gensim preprocessing to tokenise each string.
Operations:
- convert to lowercase
- remove html tags
- remove punctuation
- remove numbers
- remove short tokens (less than 3 characters)
Outputs a list of lists
"""
CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation, strip_numeric, remove_stopwords, strip_short]
preproc_text = [preprocess_string(doc, CUSTOM_FILTERS) for doc in list_of_strings]
return preproc_text
import re
def remove_twitterisms(list_of_strings):
"""
Some regular expression statements to remove twitter-isms
Operations:
- remove links
- remove @tag
- remove #tag
Returns list of strings with the above removed
"""
# removing some standard twitter-isms
list_of_strings = [re.sub(r"http\S+", "", doc) for doc in list_of_strings]
list_of_strings = [re.sub(r"@\S+", "", doc) for doc in list_of_strings]
list_of_strings = [re.sub(r"#\S+", "", doc) for doc in list_of_strings]
return list_of_strings
# removing emojis
# taken from https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b#gistcomment-3315605
def remove_emoji(string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
def remove_redacted(string):
string = [re.sub(r"(x|X){2,}", "", doc) for doc in string]
return string
from gensim.models.phrases import Phrases
def n_gram(tokens):
"""Identifies common two/three word phrases using gensim module."""
# Add bigrams and trigrams to docs (only ones that appear 10 times or more).
# includes threshold kwarg (threshold score required by bigram)
bigram = Phrases(tokens, min_count=10, threshold=100)
trigram = Phrases(bigram[tokens], threshold = 100)
for idx, val in enumerate(tokens):
for token in bigram[tokens[idx]]:
if '_' in token:
if token not in tokens[idx]:
# Token is a bigram, add to document.bigram
tokens[idx].append(token)
for token in trigram[tokens[idx]]:
if '_' in token:
if token not in tokens[idx]:
# Token is a trigram, add to document.
tokens[idx].append(token)
return tokens
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
def lemmatise(words):
"""
Convert words to their lemma or root using WordNet lemmatizer
"""
lemma = WordNetLemmatizer()
# this function takes a list of lists of tokens
return [[lemma.lemmatize(token,'v') for token in tokens] for tokens in words]
[nltk_data] Downloading package wordnet to /home/medacola/nltk_data... [nltk_data] Package wordnet is already up-to-date!
# lets slice out the text data from our dataframe
subsample_text = ticket_data['Consumer complaint narrative'].tolist()
# next we implement the preprocessing functions on our data
preprocessed_corpus = remove_twitterisms(subsample_text)
preprocessed_corpus = remove_redacted(preprocessed_corpus)
preprocessed_corpus = [remove_emoji(doc) for doc in preprocessed_corpus]
preprocessed_corpus = basic_preprocess(preprocessed_corpus)
preprocessed_corpus = lemmatise(preprocessed_corpus)
# lets compare the original strings to the preprocessed strings
print(subsample_text[0])
print("-------------------------")
print(preprocessed_corpus[0])
Last month I started receiving calls from unknown numbers. They did leave a voicemail to call a number back or log on to citicards.com and they could help me. I XXXX the numbers and there were multiple people suspecting the number of fraud. So I logged on to citicards.com and there were no alerts. I sent a secure message about the calls and they gave me this reply, " Dear XXXX, Thank you for contacting us. We appreciate each and every opportunity to serve you. Our records do not show that we have called you regarding your account. If you think that the card information is at risk, please call Customer Service immediately. Once your closure request is processed, the current card is closed and a new card number is established. XXXX. If there is any way we can be of further assistance, please feel free to contact us. Sincerely, Account Specialist South Dakota '' So I assumed it was fraud, but the calls continued. I finally was able to answer XXXX and it said that it was a citicards account that was past due. Because of the additional time that lapsed, they reported my account as delinquent. This seems unfair since I took the exact step from their phone call and was told by Citi that they were n't trying to get a hold of me. Now I have a late payment on my credit report from Citi because Citi told me they were n't trying to contact me. ------------------------- ['month', 'start', 'receive', 'call', 'unknown', 'number', 'leave', 'voicemail', 'number', 'log', 'citicards', 'com', 'help', 'number', 'multiple', 'people', 'suspect', 'number', 'fraud', 'log', 'citicards', 'com', 'alert', 'send', 'secure', 'message', 'call', 'give', 'reply', 'dear', 'thank', 'contact', 'appreciate', 'opportunity', 'serve', 'record', 'call', 'account', 'think', 'card', 'information', 'risk', 'customer', 'service', 'immediately', 'closure', 'request', 'process', 'current', 'card', 'close', 'new', 'card', 'number', 'establish', 'way', 'assistance', 'feel', 'free', 'contact', 'sincerely', 'account', 'specialist', 'south', 'dakota', 'assume', 'fraud', 'call', 'continue', 'finally', 'able', 'answer', 'say', 'citicards', 'account', 'past', 'additional', 'time', 'lapse', 'report', 'account', 'delinquent', 'unfair', 'take', 'exact', 'step', 'phone', 'tell', 'citi', 'try', 'hold', 'late', 'payment', 'credit', 'report', 'citi', 'citi', 'tell', 'try', 'contact']
print(subsample_text[2])
print("-------------------------")
print(preprocessed_corpus[2])
I once had a credit card with Fifth Third bank, which was closed bank in XX/XX/XXXX. In XX/XX/XXXX a charge that was stored on an online account automatically tried charging the card for the renewed subscription. Instead of Fifth Third bank declining the transaction they reopened my closed credit card without my permission. Finally, after three months of not contacting me via phone, email, or mail, I received a letter in the mail saying I owed the charge of {$77.00} ( for the service ) and an extra {$100.00} for late fees. After countless hours on the phone with them, they acknowledged my credit card was closed and they reopened it. I offered to pay the initial {$77.00} fee if they would waive the late fee for not informing me, or without reopening it without my permission. They said they would look into the dispute into a better solution. After about another month they sent back a letter saying the dispute was denied and I owed the entire fee. When I called back to ask how it could have been denied, since they reopened a closed account, they said the 120 day period has passed and there is nothing they could do about it. I have since discarded the credit card because I had no use for it after I had closed the account in XX/XX/XXXX. ------------------------- ['credit', 'card', 'fifth', 'bank', 'close', 'bank', 'charge', 'store', 'online', 'account', 'automatically', 'try', 'charge', 'card', 'renew', 'subscription', 'instead', 'fifth', 'bank', 'decline', 'transaction', 'reopen', 'close', 'credit', 'card', 'permission', 'finally', 'months', 'contact', 'phone', 'email', 'mail', 'receive', 'letter', 'mail', 'say', 'owe', 'charge', 'service', 'extra', 'late', 'fee', 'countless', 'hours', 'phone', 'acknowledge', 'credit', 'card', 'close', 'reopen', 'offer', 'pay', 'initial', 'fee', 'waive', 'late', 'fee', 'inform', 'reopen', 'permission', 'say', 'look', 'dispute', 'better', 'solution', 'month', 'send', 'letter', 'say', 'dispute', 'deny', 'owe', 'entire', 'fee', 'call', 'ask', 'deny', 'reopen', 'close', 'account', 'say', 'day', 'period', 'pass', 'discard', 'credit', 'card', 'use', 'close', 'account']
import nltk
import matplotlib.pyplot as plt
flat_list = [item for sublist in preprocessed_corpus for item in sublist]
text = nltk.Text(flat_list)
fdist = nltk.FreqDist(text)
plt.figure(figsize=(10,6))
fdist.plot(50)
<AxesSubplot: xlabel='Samples', ylabel='Counts'>
def sum(x): return x + x
xx = sum(2)
xx == 4
False