Introduction
Merchants selling products through ecommerce often received a high amount of customers reviews too large in scale for human processing. These reviews often have important business insights that can be leveraged to perform actions that can improve profits. In this project we analyze ~400,000 mobile phone reviews from Amazon.com aiming to find trends and patterns to determine which product characteristics are mentioned most by customers and with what sentiment. Our task is performed in six steps: (1) pre-processing to prepare the data for analysis including tokenization and part-of-speech tagging, (2) product names standardization, (3) characteristics extraction, (4) reviews filtering to remove reviews considered as outliers, unbalanced or meaningless, (5) sentiment extraction for each product-characteristic and (6) performance analysis to determine the accuracy of the model where we evaluate characteristic extraction separately from sentiment scores.
Methodology
A flowchart of the project, including the approach, performance and final business analysis is presented below:
1. Pre-procesing
This part includes:
1.1 Tokenization
Applied to both product names and reviews. It involves removal of stopwords, treating stemming of words, case-folding, removing characters that are not alphanumeric and breaking at whitespace.
Synonyms
Synonyms were grouped together as a means of dimensionality reduction, with manually inputted gazetteer with most common synonyms (for example the words “camera”, “video”, “display” are all transformed into “camera”).
Negation
It was important to handle negation for sentiment analysis so that negated opinion words could be reversed when computing its score. This method comes from Das and Chen 2001 - basically appending the suffix '_NEG' to every word appearing between a negation and a clause-level punctuation mark (such as comma). The built-in function sentiment.util.mark_negation from NLTK package was used without considering double negation.
Spelling correction Because reviews are hand-typed the function 'spell' from the 'autocorrect package' was used to treat misspellings but also considering a manually inputted gazetteer to ignore special cases (for example the word “microsd” was incorrectly being transformed into “micros”).
1.2 Part of Speech tagging
POS tagging was critical for three reasons.
(1) To find adjectives which were all considered as opinion words (as well as others exceptions that will be discussed in next sections),
(2) to extract it’s sentiment score since words have different polarity depending on their POS tag and
(3) to extract products characteristics where Nouns (NN) and Noun-phrases (NNP) were considered as potential candidates.
The function pos_tag from NLTK package was used for this task.
1.3 Vector Space Model and TF * IDF transformation
Vector Space Model
A vector space model was created based on a normalized (by euclidean distance) Term-Document-Matrix via bags-of-words for both product names as well as reviews in preparation for clustering purposes. For the first to standardize product names and for the latter to filter reviews.
Inverse Document Frequency
Another normalized TDM was constructed this time using TF*IDF weightings for each product name term. Its purpose was to determine which potential terms could be considered as standardized product names. The higher the IDF value the more important to be a potential part of the standardized name since the most commons words such as “unlocked”, “black” or “dual-core” should be avoided (and they have low IDF scores).
Load Libraries
from IPython.display import display
import timeit
from collections import defaultdict
import math
import numpy as np
import pandas as pd
import random
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib.dates as md
%matplotlib inline
import operator
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from bs4 import BeautifulSoup
import re
import nltk
from nltk import word_tokenize
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk import sentiment
from autocorrect import spell # For spelling correction
from urllib import request
Load alternative for WordNet
url_pos = r'https://raw.githubusercontent.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107/master/data/opinion-lexicon-English/positive-words.txt'
url_neg = r'https://raw.githubusercontent.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107/master/data/opinion-lexicon-English/negative-words.txt'
pos_list = request.urlopen(url_pos).read().decode('utf-8')[1:]
pos_list = pos_list[pos_list.find("a+"):].split("\n")
neg_list = request.urlopen(url_neg).read().decode('ISO-8859-1')[1:]
neg_list = neg_list[neg_list.find("2-faced"):].split("\n")
Load and correct Test Data
# The initial format of he annotated test_set is difficult to read
# as a dataframe, transformation to .csv format is computed first
# with regular expressions.
test = open('data/annotated_test_set.txt','r', encoding='utf8')
test_file = test.read()
test.close()
test_file[:200]
test_file = re.sub(r"{[^{}]+}", lambda x: x.group(0).replace(",", ";"), test_file)
test_file = test_file.replace(';', "%")
test_file = test_file.replace(',', ";")
test_file = test_file.replace('%', ",")
test_file = test_file.replace('{', "{'")
test_file = test_file.replace(',', ",'")
test_file = test_file.replace(':', "':")
test_file = test_file.replace("},'", "}")
# Once fixed, save and load:
text_file = open("data/annotated_test_set_corrected.csv", "w")
for row in test_file.split(",\n"):
text_file.write(row)
text_file.write("\n")
text_file.close()
test = open('data/annotated_test_set_corrected.csv','r', encoding='utf8')
test_file = test.read()
test.close()
test = pd.read_csv('data/annotated_test_set_corrected.csv', delimiter = ";")
test.columns = ['review_id', 'Product', 'Sentiments_test']
Load Amazon Reviews Data
df = pd.read_csv('data/Amazon_Unlocked_Mobile.csv', delimiter = ",")
n = len(df)
df.columns = ['Product', 'Brand', 'Price', 'Rating', 'Review', 'Votes']
df['id_col'] = range(0, n)
n_reviews = 1000 # Let's get a sample
keep = sorted(random.sample(range(1,n),n_reviews))
keep += list(set(test.review_id)) # this are the reviews annotated for test
df = df[df.id_col.isin(keep)]
n_reviews = len(df)
df['id_new_col'] = range(0, n_reviews)
df.head()
Out[8]:
Product | Brand | Price | Rating | Review | Votes | id_col | id_new_col | |
---|---|---|---|---|---|---|---|---|
53 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | muy buen producto | 0.0 | 53 | 0 |
69 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... | 13.0 | 69 | 1 |
71 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 1 | Hola, compramos dos teléfonos y vienieron tota... | 2.0 | 71 | 2 |
73 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | GRACIAS ME LLEGO EL PROCTO QUE COMPRE Y LLEVO ... | 0.0 | 73 | 3 |
75 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 4 | The keys are a little hard to hit, and I didn'... | 0.0 | 75 | 4 |
Sample review:
id_prod = 69
for val in df[df.id_col == id_prod].Review:
print(val)
Nokia Asha 302 Unlocked GSM Phone with 3.2MP Camera, Video, QWERTYDependableTraditional Nokia Menu'sNot Complicated like 'Smart Phones'DurableEasy to use on Straighttalk, Internet, WiFi, Bluetooth.
Create functions
def get_tokens(df, stem = False, negation = False):
stemmer = PorterStemmer()
stop = set(stopwords.words('english'))
reviews = []
i = 1
for review in df["Review"]:
tokenized_review = []
review = str(review).lower() # lowercase
# Remove every character except A-Z, a-z,space
# and punctuation (we'll need it for negation)
review = re.sub(r'[^A-Za-z /.]','',review)
# mark_negation needs punctuation separated by white space.
review = review.replace(".", " .")
tokens = word_tokenize(review)
for token in tokens:
# Remove single characters and stop words
if (len(token)>1 or token == ".") and token not in stop:
if stem:
tokenized_review.append(stemmer.stem(get_synonym(token)))
else:
tokenized_review.append(get_synonym(token))
if negation:
tokenized_review = sentiment.util.mark_negation(tokenized_review)
# Now we can get rid of punctuation and also let's fix some spellings:
tokenized_review = [correction(x) for x in tokenized_review if x != "." ]
reviews.append(tokenized_review)
if i%100 == 0:
print('progress: ', (i/len(df["Review"]))*100, "%")
i = i + 1
return reviews
def get_pos(tokenized_reviews):
tokenized_pos = []
for review in tokenized_reviews:
tokenized_pos.append(nltk.pos_tag(review))
return tokenized_pos
def get_frequency(tokens):
term_freqs = defaultdict(int)
for token in tokens:
term_freqs[token] += 1
return term_freqs
def get_tdm(tokenized_reviews):
tdm = []
for tokens in tokenized_reviews:
tdm.append(get_frequency(tokens))
return tdm
def normalize_tdm(tdm):
tdm_normalized = []
for review in tdm:
den = 0
review_normalized = defaultdict(int)
for k,v in review.items():
den += v**2
den = math.sqrt(den)
for k,v in review.items():
review_normalized[k] = v/den
tdm_normalized.append(review_normalized)
return tdm_normalized
def get_all_terms(tokenized_reviews):
all_terms = []
for tokens in tokenized_reviews:
for token in tokens:
all_terms.append(token)
return(set(all_terms))
def get_all_terms_dft(tokenized_reviews, all_terms):
terms_dft = defaultdict(int)
for term in all_terms:
for review in tokenized_reviews:
if term in review:
terms_dft[term] += 1
return terms_dft
def get_tf_idf_transform(tokenized_reviews, tdm, n_reviews):
tf_idf = []
all_terms = get_all_terms(tokenized_reviews)
terms_dft = get_all_terms_dft(tokenized_reviews, all_terms)
for review in tdm:
review_tf_idf = defaultdict(int)
for k,v in review.items():
review_tf_idf[k] = v * math.log(n_reviews / terms_dft[k], 2)
tf_idf.append(review_tf_idf)
return tf_idf
def get_idf_transform(tokenized_reviews, tdm, n_reviews):
idf = []
terms_dft = defaultdict(int)
all_terms = get_all_terms(tokenized_reviews)
for term in all_terms:
for review in tokenized_reviews:
if term in review:
terms_dft[term] += 1
for review in tdm:
review_idf = defaultdict(int)
for k,v in review.items():
review_idf[k] = math.log(n_reviews / terms_dft[k], 2)
idf.append(review_idf)
return idf
def correction(x):
ok_words = ["microsd"]
if x.find("_NEG") == -1 and x not in ok_words: # Don't correct if they are negated words or exceptions
return spell(x)
else:
return x
def get_synonym(word):
synonyms = [["camera","video", "display"],
["phone", "cellphone", "smartphone", "phones"],
["setting", "settings"],
["feature", "features"],
["pictures", "photos"],
["speakers", "speaker"]]
synonyms_parent = ["camera", "phone", "settings", "features", "photos", "speakers"]
for i in range(len(synonyms)):
if word in synonyms[i]:
return synonyms_parent[i]
return word
def get_similarity_matrix(similarity, tokenized_reviews):
similarity_matrix = []
all_terms = get_all_terms(tokenized_reviews)
for review in similarity:
similarity_matrix_row = []
for term in all_terms:
similarity_matrix_row.append(review[term])
similarity_matrix.append(similarity_matrix_row)
return similarity_matrix
# EXECUTE
tic=timeit.default_timer()
tokenized_reviews = get_tokens(df, stem = False, negation = False)
tokenized_pos = get_pos(tokenized_reviews)
tdm = get_tdm(tokenized_reviews)
vsm = normalize_tdm(tdm)
tf_idf = get_tf_idf_transform(tokenized_reviews, tdm, n_reviews)
toc=timeit.default_timer()
print("minutes: ", (toc - tic)/60)
progress: 8.865248226950355 % progress: 17.73049645390071 % progress: 26.595744680851062 % progress: 35.46099290780142 % progress: 44.32624113475177 % progress: 53.191489361702125 % progress: 62.056737588652474 % progress: 70.92198581560284 % progress: 79.7872340425532 % progress: 88.65248226950354 % progress: 97.51773049645391 % minutes: 2.6501673290627097
Let's see a sample of:
- Tokenized reviews
- Part of speech
- Term-Document-Matrix (TDM)
- TD*IDF transformation
lookup_review = 1
for val in df[df.id_new_col == lookup_review]["Review"]: print(val)
display(tokenized_reviews[lookup_review])
display(tokenized_pos[lookup_review])
display(tdm[lookup_review])
display(tf_idf[lookup_review])
Nokia Asha 302 Unlocked GSM Phone with 3.2MP Camera, Video, QWERTYDependableTraditional Nokia Menu'sNot Complicated like 'Smart Phones'DurableEasy to use on Straighttalk, Internet, WiFi, Bluetooth.
['Nokia', 'Asha', 'unlocked', 'grm', 'phone', 'imp', 'camera', 'camera', 'qwertydependabletraditional', 'Nokia', 'menusnot', 'complicated', 'like', 'smart', 'phonesdurableeasy', 'use', 'straighttalk', 'internet', 'WiFi', 'Bluetooth']
[('Nokia', 'NNP'), ('Asha', 'NNP'), ('unlocked', 'VBD'), ('grm', 'JJ'), ('phone', 'NN'), ('imp', 'NN'), ('camera', 'NN'), ('camera', 'NN'), ('qwertydependabletraditional', 'JJ'), ('Nokia', 'NNP'), ('menusnot', 'NN'), ('complicated', 'VBN'), ('like', 'IN'), ('smart', 'JJ'), ('phonesdurableeasy', 'NN'), ('use', 'NN'), ('straighttalk', 'NN'), ('internet', 'NN'), ('WiFi', 'NNP'), ('Bluetooth', 'NNP')]
defaultdict(int, {'Asha': 1, 'Bluetooth': 1, 'Nokia': 2, 'WiFi': 1, 'camera': 2, 'complicated': 1, 'grm': 1, 'imp': 1, 'internet': 1, 'like': 1, 'menusnot': 1, 'phone': 1, 'phonesdurableeasy': 1, 'qwertydependabletraditional': 1, 'smart': 1, 'straighttalk': 1, 'unlocked': 1, 'use': 1})
defaultdict(int, {'Asha': 9.139551352398794, 'Bluetooth': 5.969626350956481, 'Nokia': 12.66439286068238, 'WiFi': 5.010268335453827, 'camera': 7.393215713100131, 'complicated': 10.139551352398794, 'grm': 6.439111634257702, 'imp': 10.139551352398794, 'internet': 5.817623257511432, 'like': 3.0627357553479633, 'menusnot': 10.139551352398794, 'phone': 0.9179642311339886, 'phonesdurableeasy': 10.139551352398794, 'qwertydependabletraditional': 10.139551352398794, 'smart': 5.554588851677638, 'straighttalk': 8.554588851677638, 'unlocked': 4.554588851677638, 'use': 3.2940613014544184})
In [ ]:
1. Pre-procesing
This part includes:
1.4 Product Names Standardization
Merchants often name their products in different ways, for example “iPhone 4 32GB Black, AT&T” and “iPhone 4 16GB Gold, Verizon”. This tasks objective was to add a new standard name that for this case should simply be “iPhone 4”.
Three different approaches were combined,
(1) manually inputted set/gazetteer with words to be removed,
(2) IDF importance score and
(3) Clustering.
The first step cleaned the names through the (1) gazetteer, removing colors names and common terms (such as “unlocked”).
With the remaining terms (2) selected only the first 5 terms with the highest IDF in the group (none common terms).
Finally with the remaining terms performed the (3) Clustering using a VSM matrix with k=N°_reviews/2 clusters. This number was approximated through trial and error validating with visualization since did not know a priori how many product names were in the dataset. The number of clusters was a trade-off between having different products in the same standardized names (low number of clusters) which was highly undesirable and having too many standardized names that couldn’t standardize properly (for example having “iPhone 4 32GB” and “iPhone 4 16GB”).
A sample output can be seen in figure 2.
The approach attempted was taking advantage of POS tagging. The hypothesis was that nouns (NN and NNP) could be potential terms in a standardized product name, however NLTK tagging couldn’t grab some of the terms as NN which were the most important ones (for example model names such as “A850” were tagged as verbs).
Load Function
def get_product_tokens(df):
stop = set(stopwords.words('english'))
products = []
i = 1
for product in df["Product"]:
tokenized_product = []
product = product.lower() # lowercase
# Remove every character except A-Z, a-z,space
# and punctuation (we'll need it for negation)
product = re.sub(r'[^0-9A-Za-z \.]','',product)
# Only consider first 10 words of the product names
tokens = word_tokenize(product)[:11]
for token in tokens:
# Remove stop words
if token not in stop:
tokenized_product.append(token)
products.append(tokenized_product)
if i%100 == 0:
print('progress: ', (i/len(df["Product"]))*100, "%")
i = i + 1
return products
tokenized_products = get_product_tokens(df)
products_tokenized_pos = get_pos(tokenized_products)
products_tdm = get_tdm(tokenized_products)
products_tf_idf = get_tf_idf_transform(tokenized_products, products_tdm, n_reviews)
products_idf = get_idf_transform(tokenized_products, products_tdm, n_reviews)
progress: 8.865248226950355 % progress: 17.73049645390071 % progress: 26.595744680851062 % progress: 35.46099290780142 % progress: 44.32624113475177 % progress: 53.191489361702125 % progress: 62.056737588652474 % progress: 70.92198581560284 % progress: 79.7872340425532 % progress: 88.65248226950354 % progress: 97.51773049645391 %
-
Based on IDF we will get only the words that have the highest importance. The hypothesis is that the most common will be words typical in many smarphones such as "black, unlocked, dual, etc." **
-
Unfortunately we can't filter through POS, it fails to grab the most important words. For example A850 is tagged as Verb when in fact it's the model of a smartphone (the main word to have in the product name) **
-
We can assume that we will not lose the brand (which might not be grabed) since we do have it in a second column.
Visualization for analysis below:
lookup_product = 53
display(df[df.id_new_col== lookup_product]["Product"])
# we want to grab those with higher scores (least common terms)
display(sorted(products_idf[lookup_product].items(),
key=operator.itemgetter(1), reverse = True))
# Unfortunately we can't filter through POS
display(products_tokenized_pos[lookup_product])
10170 Apple iPhone 4 8GB, White, for Straight Talk, ... Name: Product, dtype: object
[('straight', 9.139551352398794), ('talk', 9.139551352398794), ('contract', 6.33219643034119), ('4', 4.8916238389552085), ('8gb', 4.4671260104272985), ('white', 3.052088511148454), ('apple', 2.2569083030369526), ('iphone', 2.202913413396223)]
[('apple', 'NN'), ('iphone', 'NN'), ('4', 'CD'), ('8gb', 'CD'), ('white', 'JJ'), ('straight', 'JJ'), ('talk', 'NN'), ('contract', 'NN')]
In [19]:
colors = ["black", "red", "blue", "white", "gray", "green","yellow", "pink", "gold"]
common_terms = ["smarthphone", "phone", "cellphone", "retail", "warranty",
"silver", "bluetooth", "wifi", "wireless", "keyboard", "gps",
"original", "unlocked", "camera", "certified", "international",
"actory", "packaging", "us", "usa", "international", "refurbished",
"phones", "att", "verizon", "-", "8gb", "16gb", "32gb", "64gb", "contract"]
def standardize_names(products_idf, colors, common_terms):
standard_names = []
brands = [str(x).lower() for x in set(df.Brand)]
for product in products_idf:
for k, v in product.items():
# Remove color and brand words
if k in colors or k in common_terms or k in brands:
product[k] = 0
# Grab the first 5 words with highest score
product = sorted(product.items(), key=operator.itemgetter(1), reverse = True)[:5]
standard_names.append(product)
tokenized_standard_product_names = []
for product in standard_names:
product_name = []
for word in product:
if word[1] > 0:
product_name.append(word[0])
tokenized_standard_product_names.append(product_name)
return tokenized_standard_product_names
standard_product_names = standardize_names(products_idf, colors, common_terms)
product_tdm = get_tdm(standard_product_names)
product_vsm = normalize_tdm(product_tdm)
product_vsm[1]
Out[20]:
defaultdict(int, {'3.2mp': 0.4472135954999579, '302': 0.4472135954999579, 'asha': 0.4472135954999579, 'qwerty': 0.4472135954999579, 'video': 0.4472135954999579})
CLUSTER PRODUCT NAMES
similarity = product_tdm
product_names_clusters = int(round(n_reviews/2,0))
similarity_matrix = pd.DataFrame(get_similarity_matrix(similarity, standard_product_names), columns = get_all_terms(standard_product_names))
kmeans = KMeans(n_clusters=product_names_clusters, random_state=0).fit(similarity_matrix)
clusters=kmeans.labels_.tolist()
clustered_matrix = similarity_matrix.copy()
clustered_matrix['product_name_cluster'] = clusters
clustered_matrix['id_col'] = range(0, n_reviews)
display(clustered_matrix[:5])
count_clusters = pd.DataFrame(clustered_matrix.product_name_cluster.value_counts())
display(count_clusters[:5])
s5 | blanco | purple | 1touch | duos | link | discontinued | lt30at | a887 | rose | ... | carrier | notification | beat | easy | 4100mah | internationally | easytouse | graul00 | product_name_cluster | id_col | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 1 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 2 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 3 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 4 |
5 rows × 924 columns
product_name_cluster | |
---|---|
28 | 17 |
30 | 14 |
167 | 13 |
27 | 12 |
9 | 12 |
ASSIGN CLUSTER PRODUCT NAMES
df["cluster_name"] = list(clustered_matrix.product_name_cluster)
def create_standard_name(df):
new_names = defaultdict(int)
current_names = df.groupby('cluster_name').first().Product
for i in set(clusters):
cluster_name = df[df.cluster_name == i].Product.value_counts().index[0]
new_name = []
for word in cluster_name.split():
temp_word= re.sub(r'[^0-9A-Za-z \.\-]','',word).lower()
if temp_word not in colors and temp_word not in common_terms :
new_name.append(word)
new_names[i] = ' '.join(new_name)
new_standard_names = []
for row in df.cluster_name:
new_standard_names.append(new_names[row])
df["Standard_Product_Name"] = new_standard_names
return df
df = create_standard_name(df)
df.head()
Out[23]:
Product | Brand | Price | Rating | Review | Votes | id_col | id_new_col | cluster_name | Standard_Product_Name | |
---|---|---|---|---|---|---|---|---|---|---|
53 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | muy buen producto | 0.0 | 53 | 0 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... |
69 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... | 13.0 | 69 | 1 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... |
71 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 1 | Hola, compramos dos teléfonos y vienieron tota... | 2.0 | 71 | 2 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... |
73 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | GRACIAS ME LLEGO EL PROCTO QUE COMPRE Y LLEVO ... | 0.0 | 73 | 3 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... |
75 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 4 | The keys are a little hard to hit, and I didn'... | 0.0 | 75 | 4 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... |
Sample for 'iPhone'
df[["Product","Standard_Product_Name"]][df['Product'].str.contains("iPhone")][:8]
Out[26]:
Product | Standard_Product_Name | |
---|---|---|
3968 | Apple A1533 Unlocked iPhone 5S Smart Phone, 16... | Apple A1533 iPhone 5S Smart 16 GB |
7237 | Apple iPhone 4 16GB (Black) - AT&T | Apple iPhone 4 |
7283 | Apple iPhone 4 16GB (Black) - AT&T | Apple iPhone 4 |
7592 | Apple iPhone 4 16GB (Black) - AT&T | Apple iPhone 4 |
8913 | Apple iPhone 4 32GB (Black) - Verizon | Apple iPhone 4 |
9022 | Apple iPhone 4 32GB (Black) - Verizon | Apple iPhone 4 |
9541 | Apple iPhone 4 32GB (White) - Verizon | Apple iPhone 4 |
9739 | Apple iPhone 4 8GB Unlocked- Black | Apple iPhone 4 |
2.1 Characteristics Extraction
Two steps were taken to extract the main characteristics from reviews:
(1) manually inputted set/gazetteer with words to be removed or included and
(2) identifying NN/NNP POS tagged terms that exceeded a specific threshold (set in 1%*N°_reviews) of reviews occurrences.
Load functions and shortcuts
def get_all_terms_pos_dft(all_terms, terms_dft):
all_terms_pos = nltk.pos_tag(all_terms)
i = 0
for k, v in terms_dft.items():
all_terms_pos[i] = all_terms_pos[i] + (v,)
i+=1
return all_terms_pos
def get_threshold_terms(all_terms_pos_dft, threshold = 20):
threshold_terms = []
for term in all_terms_pos_dft:
if term[0] in exceptions_to_consider or (term[2] >= threshold and term[1] in ["NN", "NNS", "NNP", "NNPS"] and term[0] not in exceptions_not_to_consider):
threshold_terms.append(term)
return threshold_terms
exceptions_to_consider = ["apps", "android", "buttons", "hardware", "wifi",
"audio", "speed", "settings", "charger", "design",
"price", "look", "trackball", "microsd", "speaker"]
exceptions_not_to_consider = ["phone", "cool", "love", "awesome", "tell", 'tell',
'feels',
'works',
'excelente',
'item',
'get',
'iPhone',
'dont',
'lot',
'let',
'money',
'brand',
'recommend',
'issues',
'cant',
'nothing',
'number',
'check',
'month',
'husband',
'need',
'note',
'venezuela',
'give',
'Samsung',
'see',
'turn',
'pocket',
'amazing',
'hands',
'couldnt',
'fast',
'condition',
'super',
'today',
'star',
'life',
'anyone',
'storage',
'speaker',
'internet',
'delivery',
'picture',
'games',
'hand',
'model',
'glass',
'case',
'micro',
'sound',
'mp',
'watch',
'grm',
'try',
'line',
'thing',
'isnt',
'thanks',
'Verizon',
'experience',
'box',
'scratches',
'problems',
'waste',
'bottom',
'company',
'bit',
'youre',
'lack',
'deal',
'pay',
'i',
'reason',
'issue',
'couple',
'option',
'beautiful',
'mobile',
'replacement',
'wasnt',
'way',
'days',
'loves',
'trouble',
'quick',
'someone',
'glad',
'weeks',
'ones',
'something',
'market',
'galaxy',
'apple',
'havent',
'download',
'time',
'lg',
'send',
'home',
'years',
'product',
'change',
'people',
'review',
'price',
'simple',
'person',
'lasts',
'user',
'hold',
'please',
'reviews',
'work',
'thats',
'text',
'im',
'end',
'thank',
'look',
'cost',
'months',
'buying',
'point',
'version',
'web',
'times',
'Nokia',
'problem',
'wouldnt',
'performance',
'products',
'minutes',
'customer',
'order',
'guess',
'things',
'everything',
'week',
'play',
'daughter',
'anything',
'purchase',
'ok',
'year',
'stars',
'day',
'wife',
'son',
'doesnt',
'blackberry',
'hours',
'return',
'use']
all_terms = get_all_terms(tokenized_reviews)
terms_dft = get_all_terms_dft(tokenized_reviews, all_terms)
all_terms_pos_dft = get_all_terms_pos_dft(all_terms, terms_dft)
threshold_terms = get_threshold_terms(all_terms_pos_dft, threshold = 0.01 * n_reviews)
threshold_terms[:10]
Out[31]:
[('voice', 'NN', 12), ('design', 'NN', 13), ('microsd', 'NN', 7), ('online', 'NN', 15), ('connection', 'NN', 13), ('service', 'NN', 36), ('warranty', 'NN', 22), ('os', 'NN', 13), ('WiFi', 'NNP', 35), ('network', 'NN', 22)]
characteristics = [x[0] for x in threshold_terms]
characteristics[:10]
Out[32]:
['voice', 'design', 'microsd', 'online', 'connection', 'service', 'warranty', 'os', 'WiFi', 'network']
2.2 Filtering
To effectively reduce the dataset size and improve performance, we need to filter out unusable, misleading and noisy reviews through 4 methods described below. In the end the dataset was reduced by 77% from an initial ~1500 reviews.
POS Filter Reviews without an adjective POS tag are removed since sentiment orientation is extracted only from adjectives.
Wordnet Filter Reviews with descriptive words not recognised by Wordnet or other sentiment lexicons are also pruned.
Rough Sentiment Analysis Filter To filter misleading reviews, we first conduct a rough sentiment analysis on individual opinion words, giving them a score of -1 or 1, and the overall score of a review, which is the sum of scores of all adjectives in the review. If there are less than three times the number of positive adjectives than that of negative reviews, or vice versa, then we assume the review is noisy and filter it out. Additionally, we assume that the sum of all reviews being positive, zero, or negative corresponds to a rating of >=3, 3 and below 3 respectively. Thus reviews not satisfying this equality condition against the rating are pruned.
Clustering Filter Performing clustering through a raw and normalized (VSM) TDM, the best results were obtained through VSM since it managed to obtain more diverse clusters - TDM was biased to create clusters based on the amount/frequency of words (clustering almost based exclusively on the lengths of the reviews).
Characteristics Filter The last step aims to keep only the reviews that have at least one characteristic. Since the objective of the project is determining why a product is good or bad through their characteristics sentiment instead of just computing the sentiment score of the review which can already be derived by the rating.
# first import 1000 rows in dataframe
to_prune = [i+1 for i in range(n_reviews)]
ratings = list(df['Rating'])
def get_wordnet_pos(pos):
for tag in [('J','ADJ'),('V','VERB'),('N','NOUN'),('R','ADV')]:
if pos.startswith(tag[0]):
return getattr(wordnet,tag[1])
else:
return 'None'
def get_adj(review):
with_adj = [tup for tup in review if tup[1] == 'JJ']
return with_adj
# score for each word
def senti(synset):
s = swn.senti_synset(synset).pos_score() - swn.senti_synset(synset).neg_score()
if s>=0:
return 1
else:
return -1
adjs = {x.name().split('.', 1)[0] for x in wn.all_synsets('a')}
### 1. prune reviews without adjectives recognised by wordnet
def prune_adj(tokenized_pos):
for k in [i for i in to_prune if i!=0]:
if not len(get_adj(tokenized_pos[k-1])) or not all(i[0] in adjs for i in get_adj(tokenized_pos[k-1])):
to_prune[k-1] = 0
return to_prune
### 2. prune by number of pos and neg adj
# list of scores for each review
def slist(tokenized_pos):
score = []
for k in [i for i in to_prune if i!=0]:
r = get_adj(tokenized_pos[k-1])
tag = [get_wordnet_pos(tuple[1]) for tuple in r]
synsets = [r[i][0] + '.' + tag[i] + '.01' for i in range(len(r))]
score.append([senti(i) for i in synsets])
return score
def balance(score_list):
m=-1
for k in [i for i in to_prune if i!=0]:
m+=1
s = score_list[m]
if 1 in s and -1 in s and max([s.count(1),s.count(-1)])/min([s.count(1),s.count(-1)]) <= 3:
to_prune[k-1] = 0
return to_prune
### 3. prune by average score compared to rating score
def average_score(score_list):
m = -1
for k in [i for i in to_prune if i!=0]:
m += 1
s = score_list[m]
#(sum >=0, then rating >=3)
if sum(s)>=0 and (sum(s)+1)*(ratings[k-1]-2.5)<=0:
to_prune[k-1] = 0
elif sum(s)<0 and (sum(s)+1)*(2.5-ratings[k-1])<=0:
to_prune[k-1] = 0
return to_prune
# initialise index to_prune = [1,2,3,...,1000]
to_prune = [i+1 for i in range(n_reviews)]
#to_prune = list(set(df.id_col))
to_prune = prune_adj(tokenized_pos)
score_list = slist(tokenized_pos)
to_prune = balance(score_list)
to_prune = average_score(score_list)
# len([i for i in to_prune if i!=0])
to_keep = [i for i in to_prune if i!=0]
to_keep += list(df[df.id_col.isin(list(set(test.review_id)))].id_new_col) # this are the reviews annotated for test
to_keep = list(set(to_keep))
df_filtered = df[df.id_new_col.isin(to_keep)]
df_filtered[:3]
Out[53]:
Product | Brand | Price | Rating | Review | Votes | id_col | id_new_col | cluster_name | Standard_Product_Name | |
---|---|---|---|---|---|---|---|---|---|---|
53 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | muy buen producto | 0.0 | 53 | 0 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... |
69 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... | 13.0 | 69 | 1 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... |
71 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 1 | Hola, compramos dos teléfonos y vienieron tota... | 2.0 | 71 | 2 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... |
len(list(df_filtered[df_filtered.id_col.isin(list(set(test.review_id)))].id_new_col))
Out[54]:
128
to_keep = list(df_filtered.id_new_col)
clustering filter
n_reviews = len(to_keep)
tokenized_reviews = get_tokens(df_filtered, stem = False, negation = False)
tokenized_pos = get_pos(tokenized_reviews)
tdm = get_tdm(tokenized_reviews)
vsm = normalize_tdm(tdm)
tf_idf = get_tf_idf_transform(tokenized_reviews, tdm, n_reviews)
similarity = vsm #vsm # tdm
similarity_matrix = pd.DataFrame(get_similarity_matrix(similarity, tokenized_reviews), columns = get_all_terms(tokenized_reviews))
similarity_matrix[:10]
progress: 48.54368932038835 % progress: 97.0873786407767 %
Out[56]:
wasnt | bien | ahead | blew | mind | delight | connecting | voice | drain | grm | ... | mp | calls | incoming | maneuver | telephone | Zoe | copy | around | wider | afterwards | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00000 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.204124 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00000 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00000 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00000 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00000 | 0.0 |
5 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.07785 | 0.0 |
6 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00000 | 0.0 |
7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00000 | 0.0 |
8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00000 | 0.0 |
9 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.102062 | 0.0 | 0.00000 | 0.0 |
10 rows × 1578 columns
kmeans = KMeans(n_clusters=int(round(math.sqrt(n_reviews),0)), random_state=0).fit(similarity_matrix)
clusters=kmeans.labels_.tolist()
# clustered_matrix = pd.DataFrame(tf_idf_matrix, clusters)
clustered_matrix = similarity_matrix.copy()
clustered_matrix['cluster'] = clusters
clustered_matrix['id_col'] = to_keep
display(len(clustered_matrix))
display(clustered_matrix[:5])
top_clusters = pd.DataFrame(clustered_matrix.cluster.value_counts())
display(top_clusters)
206
wasnt | bien | ahead | blew | mind | delight | connecting | voice | drain | grm | ... | incoming | maneuver | telephone | Zoe | copy | around | wider | afterwards | cluster | id_col | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.204124 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5 | 1 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 2 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 3 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 4 |
5 rows × 1580 columns
cluster | |
---|---|
13 | 59 |
0 | 28 |
5 | 25 |
3 | 17 |
1 | 17 |
11 | 11 |
10 | 10 |
12 | 8 |
7 | 6 |
6 | 6 |
4 | 6 |
8 | 5 |
2 | 5 |
9 | 3 |
limit = top_clusters.cluster.quantile(0.3)
cluster_filter = top_clusters[top_clusters.cluster > limit]
display(cluster_filter)
list(cluster_filter.index)
cluster | |
---|---|
13 | 59 |
0 | 28 |
5 | 25 |
3 | 17 |
1 | 17 |
11 | 11 |
10 | 10 |
12 | 8 |
Out[58]:
[13, 0, 5, 3, 1, 11, 10, 12]
df_filtered["cluster"] = list(clustered_matrix.cluster)
df_filtered[:3]
C:\Program Files\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.
Out[59]:
Product | Brand | Price | Rating | Review | Votes | id_col | id_new_col | cluster_name | Standard_Product_Name | cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|
53 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | muy buen producto | 0.0 | 53 | 0 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 0 |
69 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... | 13.0 | 69 | 1 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 5 |
71 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 1 | Hola, compramos dos teléfonos y vienieron tota... | 2.0 | 71 | 2 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 0 |
to_keep = list(df_filtered[df_filtered.cluster.isin(list(cluster_filter.index))].id_new_col)
to_keep += list(df[df.id_col.isin(list(set(test.review_id)))].id_new_col) # this are the reviews annotated for test
to_keep = list(set(to_keep))
df_filtered = df_filtered[df_filtered.id_new_col.isin(to_keep)]
df_filtered[:3]
Out[60]:
Product | Brand | Price | Rating | Review | Votes | id_col | id_new_col | cluster_name | Standard_Product_Name | cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|
53 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | muy buen producto | 0.0 | 53 | 0 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 0 |
69 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... | 13.0 | 69 | 1 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 5 |
71 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 1 | Hola, compramos dos teléfonos y vienieron tota... | 2.0 | 71 | 2 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 0 |
Characteristic Filter
The idea is to consider only reviews that have at least one characteristic
def filter_with_characteristics(df_filtered, characteristics):
tokenized_reviews = get_tokens(df_filtered, stem = False, negation = False)
to_keep_in = []
j = 0
for i in df_filtered.id_col:
for token in tokenized_reviews[j]:
if token in characteristics:
to_keep_in.append(i)
break
j+=1
return to_keep_in
to_keep_in = filter_with_characteristics(df_filtered, characteristics)
len(to_keep_in)
progress: 52.083333333333336 %
Out[61]:
114
to_keep_in += list(set(test.review_id)) # this are the reviews annotated for test
df_filtered = df_filtered[df_filtered.id_col.isin(to_keep_in)]
df_filtered[:3]
Out[62]:
Product | Brand | Price | Rating | Review | Votes | id_col | id_new_col | cluster_name | Standard_Product_Name | cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|
53 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | muy buen producto | 0.0 | 53 | 0 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 0 |
69 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... | 13.0 | 69 | 1 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 5 |
71 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 1 | Hola, compramos dos teléfonos y vienieron tota... | 2.0 | 71 | 2 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 0 |
2.3 Characteristics Sentiment Extraction
This task was approached with five combined methods.
(1) Manually inputting set/gazetteer to fix wordnet sentiments that should be positive/negative instead, to ignore certain opinion words (for example “unlocked”, “old”, “normal”, “yellow”, etc.) and to include words that are not tagged as adjectives (i.e. opinion words) such as “broken”, “love” or “cool”.
(2) Inverting the polarity of words when they were negated.
(3) Using “Minqing Hu and Bing Liu” lexicon when opinion words were not supported by wordnet (either missing or neutral).
(4) Extracting the nearests opinion words (with a maximum set at 2) considering token distance (with a maximum set at 5).
(5) Computing the final characteristic sentiment score weighting by their distance (the further apart the lesser its weight).
For (4) the procedure considered looking at opinion words before and after the characteristic found and always keeping the closest ones first (for example taking the opinion word at distance -1 before the one at distance +2) where distance refers to the numbers of tokens from the characteristic word. The maximum amount of opinion words was set at two since usually when a third is found it is because the application misses a new characteristic that was there and which that third opinion word was referring to, hence avoiding assigning incorrectly an opinion word to a characteristic. Furthermore the procedure also considers if an opinion word has already been assigned to a characteristic, which is an advantage (for example avoiding to assign an opinion word twice) as well as a limitation (an opinion word between two characteristic might end up being incorrectly assigned to the first characteristic found) and should be handled in further improvements through Relationship Extraction (RE).
Another limitation and challenge of this task was the fact that customers usually give a review comparing the product with another one. This is problematic since for example they could be talking positively about the screen of the product of importance and negatively about another one they previously had, giving a neutral sentiment in the end. This was not handled in the current project and should be further investigated using RE as well.
For (5) an altered formulation proposed by Ding et al. was used, where the sentiment score for one characteristic of one product is aggregated across all sentiment polarities , as follows:
- Challenges not taken care of: Sometimes customers explain they got rid of their old phone that had a "bad" camera, to get a new one. This algorithm considers that that bad camera as part of the new phone.
Load functions
positive_exceptions = ["high", "surprised"] # wordnet have it as negative, should be positive
negative_exceptions = ["old"] # wordnet have it as positive, should be negative.
ignore_exceptions = ["old", "new", "unlocked", "normal"]
ignore_exceptions += colors
word_exceptions = ["missing", "broken", "love", "awesome", "cool"] # They are not tagged as JJ sometimes, they should.
def compute_score(word, word_neg):
if word in ignore_exceptions:
return 0
if word in positive_exceptions:
if word_neg.find("_NEG") == -1:
return 1
else:
return -1
if word in negative_exceptions:
print(word)
if word_neg.find("_NEG") == -1:
return -1
else:
return 1
word2 = ''.join([word,".a.01"])
try:
pos_score = swn.senti_synset(word2).pos_score()
neg_score = swn.senti_synset(word2).neg_score()
except:
if word in pos_list:
pos_score = 1
neg_score = 0
elif word in neg_list:
pos_score = 0
neg_score = 1
else:
return 0
if pos_score > neg_score:
if word_neg.find("_NEG") == -1:
return 1
else:
return -1
elif neg_score > pos_score:
if word_neg.find("_NEG") == -1:
return -1
else:
return 1
else:
if word in pos_list:
return 1
elif word in neg_list:
return -1
else:
return 0
def extract_characteristic_opinion_words(review, review_neg, max_opinion_words = 2, max_distance = 5, use_distance = False):
review_charactetistics_sentiment = defaultdict(list)
i = 0
temp_review = []
for word in review:
word = word + ("free",)
temp_review.append(list(word))
for i in range(len(review)):
if review[i][0] in characteristics:
keep_forward = True
keep_backward = True
opinion_words = 0
for j in range(1,max_distance+1):
if i+j >= len(review):
keep_forward = False
if keep_forward:
if review[i+j][0] in characteristics or opinion_words >= max_opinion_words:
keep_forward = False
elif i+j < len(review) and (review[i+j][1] in ["JJ", "JJR", "JJS"] or review[i+j][0] in word_exceptions) and temp_review[i+j][2] == "free":
sentiment = defaultdict(int)
score = compute_score(review[i+j][0], review_neg[i+j][0])
if score == 0: continue
if use_distance:
distance = j
else:
distance = 1
sentiment[review[i+j][0]] = (score,distance)
review_charactetistics_sentiment[review[i][0]].append(sentiment)
temp_review[i+j][2] = "used"
opinion_words +=1
if i-j < 0:
keep_backward = False
if keep_backward:
if review[i-j][0] in characteristics or opinion_words >= max_opinion_words:
keep_backward = False
elif i-j > -1 and (review[i-j][1] in ["JJ", "JJR", "JJS"] or review[i-j][0] in word_exceptions) and temp_review[i-j][2] == "free":
sentiment = defaultdict(int)
score = compute_score(review[i-j][0], review_neg[i-j][0])
if score == 0: continue
if use_distance:
distance = j
else:
distance = 1
sentiment[review[i-j][0]] = (score,distance)
review_charactetistics_sentiment[review[i][0]].append(sentiment)
temp_review[i-j][2] = "used"
opinion_words +=1
if not keep_forward and not keep_backward:
break
return review_charactetistics_sentiment
def consolidate_score(characteristic_dict):
num = 0
den = 0
for opinion in characteristic_dict:
for k, v in opinion.items():
num += v[0]/v[1]
den += 1/v[1]
return num/den
def compute_sentiment_scores(tokenized_pos, tokenized_pos_neg, max_distance = 5, use_distance = True):
if len(tokenized_pos) != len(tokenized_pos_neg):
print("FATAL ERROR: Different lenght between tokenized_pos and tokenized_pos_neg")
return None
else:
reviews_sentiment_scores = []
for i in range(len(tokenized_pos)):
review_sentiment_score = defaultdict(int)
review_characteristics_opinion_words = extract_characteristic_opinion_words(tokenized_pos[i], tokenized_pos_neg[i], max_distance = max_distance, use_distance = use_distance)
for k, v in review_characteristics_opinion_words.items():
review_sentiment_score[k] = consolidate_score(v)
reviews_sentiment_scores.append(review_sentiment_score)
return reviews_sentiment_scores
def get_NN_count(tokenized_pos):
NN_count = []
for review in tokenized_pos:
review_NN_count = 0
for token in review:
if token[1] in ["NN", "NNS", "NNP"] or token[0] in characteristics:
review_NN_count += 1
NN_count.append(review_NN_count)
return NN_count
tokenized_reviews = get_tokens(df_filtered, stem = False, negation = False)
tokenized_pos = get_pos(tokenized_reviews)
tokenized_reviews_neg = get_tokens(df_filtered, stem = False, negation = True)
tokenized_pos_neg = get_pos(tokenized_reviews_neg)
NN_count = get_NN_count(tokenized_pos)
df_filtered['new_id'] = range(0, len(df_filtered))
progress: 59.171597633136095 % progress: 59.171597633136095 %
The following review as an example gives insight of the application capabilities and limitations:
lookup_product_id = 7
for val in df_filtered[df_filtered.new_id == lookup_product_id]["Review"]: print(val)
display(tokenized_pos[lookup_product_id])
review_characteristics_opinion_words = extract_characteristic_opinion_words(tokenized_pos[lookup_product_id], tokenized_pos_neg[lookup_product_id], max_distance = 5, use_distance = True)
display(review_characteristics_opinion_words)
This phone in an excellent phone at a great price. I was impressed with the features of this phone and would recommend this to anyone.
[('phone', 'NN'), ('excellent', 'JJ'), ('phone', 'NN'), ('great', 'JJ'), ('price', 'NN'), ('impressed', 'VBD'), ('features', 'NNS'), ('phone', 'NN'), ('would', 'MD'), ('recommend', 'VB'), ('anyone', 'NN')]
defaultdict(list, {'price': [defaultdict(int, {'great': (1, 1)}), defaultdict(int, {'excellent': (1, 3)})]})
On the sentiment dictionary we store the characteristic as well as its opinion words with the sentiment score {-1,1} and the distance from the characteristic. Because “impressed” was tagged as a Verb it was not included as an opinion word. The application deal with this kind of cases using a gazetteer (not implemented for “impressed” in this case). Furthermore since “features” does not have any opinion words after it, it was not included in the sentiment dictionary (and even if “impressed” was considered as an opinion word it would be given to “price” which comes first).
review_sentiment_scores = compute_sentiment_scores(tokenized_pos, tokenized_pos_neg, max_distance = 5, use_distance = True)
review_sentiment_scores[:6]
df_filtered["Sentiments"] = list(review_sentiment_scores)
df_filtered["NN_count"] = list(NN_count)
df_filtered[:3]
Out[68]:
Product | Brand | Price | Rating | Review | Votes | id_col | id_new_col | cluster_name | Standard_Product_Name | cluster | new_id | Sentiments | NN_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
53 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | muy buen producto | 0.0 | 53 | 0 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 0 | 0 | {} | 1 |
69 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 5 | Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... | 13.0 | 69 | 1 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 5 | 1 | {'WiFi': 1.0} | 14 |
71 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.0 | 1 | Hola, compramos dos teléfonos y vienieron tota... | 2.0 | 71 | 2 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 0 | 2 | {} | 26 |
3. Performance
To determine how effective the application is performing we separated the measurements in two steps.
(1) Measure effectiveness of the mobile phones characteristics extraction and
(2) over the corrected characteristics extracted, measure how effective the sentiments were recorded.
For both steps we created a manually annotated test set with ~150 reviews chosen at random. The format of the test set is as follows:
Where the third column corresponds to manually inputted results.
For step (1) we compared the characteristics extracted by the application for the reviews annotated in the test set. The measurements computed for each review were:
● True_Positives: Correctly extracted characteristics
● True_Negatives: All potential characteristics (NN/NNPs) that were not considered and are not in test set
● False_Positives: Incorrectly extracted characteristics (i.e. not in the test set)
● False_Negatives: Missed characteristics that were considered in the test set
Based on those metrics aggregated on all reviews we calculated Specificity (0.773), Recall (0.070), F1_score (0.036) and Accuracy (0.720).
Our main focus is to have a high Recall, that is, to correctly extract characteristics which represent the main output of the business objective.
Currently it’s extremely low failing to produce relevant insights. Since in contrast Specificity is relatively high it further proves that the model is missing characteristics.
For step (2) using only the characteristics correctly extracted (Recall results) we compared their sentiment scored against those from the test set. The measurements computed for each review were:
● True_Positives: Characteristic correctly classified with positive score
● True_Negatives: Characteristic correctly classified with negative score
● False_Positive: Characteristic incorrectly classified with positive score
● False_Negatives: Characteristic incorrectly classified with negative score
Based on those metrics aggregated on all reviews we calculated Specificity (0.8), Recall (0.666), F1_score (0.666) and Accuracy (0.75). However, results are not statistically significant since the test set on this part was extremely low with only 7 reviews considered that had the correct characteristic extraction. Nonetheless it gives insights that assigning correct sentiment scores is performing better than characteristic extraction with higher Specificity and Recall.
Load and correct Test Data
# The initial format of he annotated test_set is difficult to read
# as a dataframe, transformation to .csv format is computed first
# with regular expressions.
test = open('data/annotated_test_set.txt','r', encoding='utf8')
test_file = test.read()
test.close()
test_file[:200]
test_file = re.sub(r"{[^{}]+}", lambda x: x.group(0).replace(",", ";"), test_file)
test_file = test_file.replace(';', "%")
test_file = test_file.replace(',', ";")
test_file = test_file.replace('%', ",")
test_file = test_file.replace('{', "{'")
test_file = test_file.replace(',', ",'")
test_file = test_file.replace(':', "':")
test_file = test_file.replace("},'", "}")
# Once fixed, save and load:
text_file = open("data/annotated_test_set_corrected.csv", "w")
for row in test_file.split(",\n"):
text_file.write(row)
text_file.write("\n")
text_file.close()
test = open('data/annotated_test_set_corrected.csv','r', encoding='utf8')
test_file = test.read()
test.close()
test = pd.read_csv('data/annotated_test_set_corrected.csv', delimiter = ";")
test.columns = ['review_id', 'Product', 'Sentiments_test']
test[:3]
Out[70]:
review_id | Product | Sentiments_test | |
---|---|---|---|
0 | 1540 | BlackBerry Curve | {'Trackball':-1,'Battery':-1,'Micro-SD':-1} |
1 | 1554 | Acer Liquid E700 TRIO | {'Camera':-1,'Hardware':-1,'Buttons':-1} |
2 | 1697 | Alcatel OneTouch | {'Hardware':-1,'Charging Port':-1} |
df_merge = pd.merge(df_filtered, test, left_on='id_col', right_on='review_id', how = "left")
df_merge[df_merge.Sentiments_test.isnull()==False]
Out[71]:
Product_x | Brand | Price | Rating | Review | Votes | id_col | id_new_col | cluster_name | Standard_Product_Name | cluster | new_id | Sentiments | NN_count | review_id | Product_y | Sentiments_test | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.00 | 5 | muy buen producto | 0.0 | 53 | 0 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 0 | 0 | {} | 1 | 53.0 | Asha 302 | {'sound': 1,' smart phone features': 1,' soft... |
1 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.00 | 5 | Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... | 13.0 | 69 | 1 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 5 | 1 | {'WiFi': 1.0} | 14 | 69.0 | Asha 302 | {'build': 1,' keyboard': 1,'sound': 1,' Xpres... |
2 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.00 | 1 | Hola, compramos dos teléfonos y vienieron tota... | 2.0 | 71 | 2 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 0 | 2 | {} | 26 | 71.0 | Asha 302 | {'build': 1,' reception': 1,' audio': 1,' key... |
3 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.00 | 5 | GRACIAS ME LLEGO EL PROCTO QUE COMPRE Y LLEVO ... | 0.0 | 73 | 3 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 0 | 3 | {} | 8 | 73.0 | Asha 302 | {'apps': 1} |
4 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.00 | 4 | The keys are a little hard to hit, and I didn'... | 0.0 | 75 | 4 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 1 | 4 | {'didnt': -1.0, 'keyboard': 1.0} | 5 | 75.0 | Asha 302 | {'SMS': 1,' rings': 1,' body': 1,' freezes': -1} |
5 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.00 | 5 | I bought this phone as a Christmas present for... | 3.0 | 78 | 5 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 13 | 5 | {'amazon': -0.14285714285714285, 'features': 0... | 57 | 78.0 | Asha 302 | {'ring tones': 1} |
6 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.00 | 4 | The Phone is pretty good. I am using it with a... | 2.0 | 79 | 6 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 13 | 6 | {} | 12 | 79.0 | Asha 302 | {'wi-fi': 1,' calendar': 1,' alarm clock': 1,... |
7 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.00 | 5 | This phone in an excellent phone at a great pr... | 1.0 | 82 | 7 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 13 | 7 | {'price': 1.0} | 6 | 82.0 | Asha 302 | {'price': 1} |
8 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.00 | 4 | This is a good phone although it seems to have... | 1.0 | 84 | 8 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 1 | 8 | {} | 7 | 84.0 | Asha 302 | {'time': -1,' support': -1,' booklet': -1} |
9 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.00 | 5 | I've been a long time user of the iPhone. I fi... | 2.0 | 85 | 9 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 13 | 9 | {'battery': 1.0, 'email': 1.0, 'etc': 1.0, 'ke... | 45 | 85.0 | Asha 302 | {'screen': -1,' calling': 1,' messaging': 1,'... |
10 | "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... | Nokia | 299.00 | 5 | This phone, like all of Nokia's feature phones... | 12.0 | 86 | 10 | 18 | "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... | 13 | 10 | {'features': 1.0, 'value': -1.0, 'keyboard': -... | 81 | 86.0 | Asha 302 | {'texting interface': 1,' battery': 1,' email... |
11 | 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... | NaN | 161.06 | 5 | Very nice.arrived on time. I love it. | 0.0 | 734 | 14 | 9 | 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... | 12 | 11 | {} | 1 | 734.0 | Lenovo A850 | {'screen':1,' audio': 1,' apps': -1,' speed':... |
12 | 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... | NaN | 161.06 | 3 | I sent the phone to Colombia, and They had to ... | 0.0 | 755 | 15 | 9 | 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... | 13 | 12 | {} | 9 | 755.0 | Lenovo A850 | {'language setting': -1,' battery': -1} |
13 | 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... | NaN | 161.06 | 2 | Didn't have the color that originally wanted w... | 0.0 | 773 | 16 | 9 | 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... | 3 | 13 | {} | 14 | 773.0 | Lenovo A850 | {'apps': 1,' price': 1} |
14 | 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... | NaN | 161.06 | 3 | sometimes the screen and home button are unres... | 0.0 | 774 | 17 | 9 | 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... | 6 | 14 | {'button': -0.6666666666666667} | 5 | 774.0 | Lenovo A850 | {'charger': -1} |
15 | 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... | NaN | 161.06 | 5 | Nice phone. Android gsm with 2sims great. | 0.0 | 776 | 18 | 9 | 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... | 13 | 15 | {'android': 1.0} | 3 | 776.0 | Lenovo A850 | {'screen': -1,' button': -1,' brand': 1} |
16 | 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... | NaN | 161.06 | 5 | I like this smartphone, good quality very very... | 0.0 | 828 | 19 | 9 | 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... | 13 | 16 | {'quality': 1.0, 'color': 1.0, 'amazon': 1.0} | 10 | 828.0 | Lenovo A850 | {'price': 1,' size': 1} |
17 | 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... | NaN | 161.06 | 5 | I have not reached the tlf | 1.0 | 943 | 20 | 9 | 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... | 0 | 17 | {} | 1 | 943.0 | Lenovo A850 | {'speed': 1,' size': 1,' screen': 1,' camera'... |
19 | 8330 BlackBerry Curve (US Cellular) Titanium P... | NaN | 29.95 | 1 | I recevied the phone with broken trackball, mi... | 4.0 | 1540 | 26 | 322 | 8330 BlackBerry Curve Cellular) Titanium | 13 | 19 | {'trackball': -1.0, 'microsd': -1.0} | 16 | 1540.0 | BlackBerry Curve | {'Trackball':-1,'Battery':-1,'Micro-SD':-1} |
20 | Acer Liquid Jade Z Andoid KitKat Unlocked Quad... | Acer | 129.99 | 2 | I had high hopes for this Acer phone based the... | 0.0 | 1554 | 27 | 179 | Acer Liquid Jade Z Andoid KitKat Quad-Core 5" ... | 13 | 20 | {'plastic': 0.6666666666666667, 'battery': 0.1... | 34 | 1554.0 | Acer Liquid E700 TRIO | {'Camera':-1,'Hardware':-1,'Buttons':-1} |
21 | ALCATEL OneTouch Idol 3 Global Unlocked 4G LTE... | Alcatel | 292.98 | 5 | It's good | 0.0 | 1697 | 28 | 112 | ALCATEL OneTouch Idol 3 Global 4G LTE Smartpho... | 1 | 21 | {} | 0 | 1697.0 | Alcatel OneTouch | {'Hardware':-1,'Charging Port':-1} |
22 | ALCATEL OneTouch Idol 3 Global Unlocked 4G LTE... | Alcatel | 129.00 | 1 | I am never one to write negative reviews but i... | 2.0 | 1930 | 29 | 444 | ALCATEL OneTouch Idol 3 Global 4G LTE Smartpho... | 13 | 22 | {'audio': 1.0} | 18 | 1930.0 | Alcatel OneTouch | {'Screen':1,'Size':-1} |
23 | Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... | Apple | 33.00 | 1 | The phone that I got doesnt work! | 0.0 | 3177 | 32 | 27 | Apple Iphone 5c A1532 16 GB Cell | 5 | 23 | {} | 2 | 3177.0 | iPhone 5c | {'size': 1,' charger': 1,' apps': 1,' headpho... |
24 | Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... | Apple | 33.00 | 5 | Received a great looking and working used phon... | 0.0 | 3270 | 33 | 27 | Apple Iphone 5c A1532 16 GB Cell | 13 | 24 | {} | 3 | 3270.0 | iPhone 5c | {'Wifi':-1} |
25 | Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... | Apple | 33.00 | 5 | Received a great looking and working used phon... | 0.0 | 3270 | 33 | 27 | Apple Iphone 5c A1532 16 GB Cell | 13 | 24 | {} | 3 | 3270.0 | iPhone 5c | {'wi-fi': -1} |
26 | Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... | Apple | 33.00 | 5 | good phone unlocked | 0.0 | 3274 | 34 | 27 | Apple Iphone 5c A1532 16 GB Cell | 1 | 25 | {} | 1 | 3274.0 | iPhone 5c | {'screen':1,'speed':1,'battery':-1} |
27 | Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... | Apple | 33.00 | 2 | The phone came with a bad speaker I could retu... | 0.0 | 3308 | 35 | 27 | Apple Iphone 5c A1532 16 GB Cell | 13 | 26 | {'speakers': -1.0} | 8 | 3308.0 | iPhone 5c | {'charging':1,'battery':1} |
28 | Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... | Apple | 33.00 | 1 | I did not receive a Verizon wireless,as stated... | 31.0 | 3310 | 36 | 27 | Apple Iphone 5c A1532 16 GB Cell | 8 | 27 | {} | 3 | 3310.0 | iPhone 5c | {'speaker':-1} |
29 | Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... | Apple | 33.00 | 5 | The phone was like new, and works perfect, tha... | 0.0 | 3323 | 37 | 27 | Apple Iphone 5c A1532 16 GB Cell | 10 | 28 | {} | 3 | 3323.0 | iPhone 5c | {'charger':-1} |
30 | Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... | Apple | 33.00 | 5 | quick delivery, product received as described,... | 0.0 | 3329 | 38 | 27 | Apple Iphone 5c A1532 16 GB Cell | 7 | 29 | {} | 3 | 3329.0 | iPhone 5c | {'battery':-1} |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
140 | HTC Rhyme 3G Android Smartphone Plum Verizon | HTC | 64.99 | 4 | This phone is what a smart stands for; the nav... | 0.0 | 198109 | 594 | 8 | HTC Rhyme 3G Android Smartphone Plum | 13 | 139 | {'system': 1.0, 'etc': 1.0} | 18 | 198109.0 | HTC Rhyme | {'charger':-1,'battery':-1} |
141 | HTC Rhyme 3G Android Smartphone Plum Verizon | HTC | 64.99 | 5 | It was a delightful surprise to find that this... | 0.0 | 198111 | 595 | 8 | HTC Rhyme 3G Android Smartphone Plum | 5 | 140 | {} | 4 | 198111.0 | HTC Rhyme | {'navigation system':1,'voice search':1,'spea... |
142 | HTC Rhyme 3G Android Smartphone Plum Verizon | HTC | 64.99 | 5 | just started having a few issues but it has wo... | 0.0 | 198115 | 596 | 8 | HTC Rhyme 3G Android Smartphone Plum | 0 | 141 | {} | 1 | 198115.0 | HTC Rhyme | {'size':1,'weight':1,'keyboard':-1,'SD card':... |
143 | Huawei Ascend P7 16G 5" Android 4.4 Quad Core ... | Huawei | 2066.00 | 5 | All very good, excellent product. | 0.0 | 199958 | 604 | 304 | Huawei Ascend P7 16G 5" Android 4.4 Quad Core ... | 4 | 142 | {} | 1 | 199958.0 | Huawei Ascend P7 | {'software':-1} |
144 | HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L... | Huawei | 182.99 | 5 | t camera and design*update* still love the pho... | 1.0 | 199986 | 605 | 61 | HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone | 13 | 143 | {} | 10 | 199986.0 | Huawei Ascend P7 | {'image quality':-1,'coverage':-1} |
145 | HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L... | Huawei | 182.99 | 4 | Network weak | 0.0 | 199992 | 606 | 61 | HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone | 3 | 144 | {'network': -1.0} | 1 | 199992.0 | Huawei Ascend P7 | {'wifi':-1} |
146 | HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L... | Huawei | 182.99 | 4 | Great phone good price. It needs to come with ... | 6.0 | 200009 | 607 | 61 | HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone | 1 | 145 | {'price': 1.0} | 6 | 200009.0 | Huawei Ascend P7 | {'price':1,'hardware':1} |
147 | HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L... | Huawei | 182.99 | 5 | excellent service and product as described | 0.0 | 200041 | 608 | 61 | HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone | 7 | 146 | {'service': 1.0} | 2 | 200041.0 | Huawei Ascend P7 | {'specs':1,'price':1,'software':1,'screen':1,... |
148 | Huawei GX8 Unlocked Smartphone (US Version: RI... | Huawei | 285.00 | 5 | Best phone I have had. Fingerprint sensor is e... | 1.0 | 200629 | 610 | 64 | Huawei GX8 Smartphone Version: RIO-L03) Horizon | 13 | 147 | {'device': 1.0, 'battery': -1.0} | 9 | 200629.0 | Huawei GX8 | {'battery': 1} |
149 | Huawei GX8 Unlocked Smartphone (US Version: RI... | Huawei | 285.00 | 5 | It's less than half the price of Galaxy 6, but... | 3.0 | 200641 | 611 | 64 | Huawei GX8 Smartphone Version: RIO-L03) Horizon | 11 | 148 | {'feel': 1.0} | 6 | 200641.0 | Huawei GX8 | {'price': 1,' security options': 1,' battery'... |
150 | Huawei GX8 Unlocked Smartphone (US Version: RI... | Huawei | 285.00 | 4 | Phone looks and feel nice. It is about the siz... | 13.0 | 200642 | 612 | 64 | Huawei GX8 Smartphone Version: RIO-L03) Horizon | 13 | 149 | {'feel': 1.0, 'size': 1.0, 'didnt': -1.0} | 11 | 200642.0 | Huawei GX8 | {'price': 1,' design': 1,' screen': 1,' finge... |
151 | Huawei GX8 Unlocked Smartphone (US Version: RI... | Huawei | 285.00 | 1 | Just bought a gx8 from Amazon. After one day o... | 1.0 | 200657 | 613 | 64 | Huawei GX8 Smartphone Version: RIO-L03) Horizon | 5 | 150 | {} | 11 | 200657.0 | Huawei GX8 | {'multi-tasking': -1,' fingerprint reader': 1... |
152 | Huawei GX8 Unlocked Smartphone (US Version: RI... | Huawei | 285.00 | 5 | Just switched from iphone 6s plus to this prod... | 4.0 | 200658 | 614 | 64 | Huawei GX8 Smartphone Version: RIO-L03) Horizon | 3 | 151 | {} | 17 | 200658.0 | Huawei GX8 | {'battery': 1,' camera': 1,' screen': 1,' pri... |
153 | Huawei Mate 2 - Factory Unlocked (Black) | Huawei | 229.99 | 2 | I bought the phone July 2014,It didn't work we... | 2.0 | 200946 | 615 | 102 | Huawei Mate 2 Factory | 5 | 152 | {'voice': -1.0, 'function': -1.0} | 30 | 200946.0 | Huawei Mate 2 | {'battery':1,'bluetootch':1,'size':1} |
154 | Huawei Mate 2 - Factory Unlocked (Black) | Huawei | 229.99 | 1 | This phone ages quickly. My previous phone was... | 2.0 | 200955 | 616 | 102 | Huawei Mate 2 Factory | 13 | 153 | {'look': -1.0, 'settings': -1.0, 'power': -1.0} | 48 | 200955.0 | Huawei Mate 2 | {'battery':1,'screen':1} |
155 | LG Optimus S Android Phone, Gray (Sprint) | LG | 69.98 | 4 | I was skeptical of buying this item because of... | 0.0 | 236742 | 697 | 24 | LG Optimus S Android (Sprint) | 0 | 154 | {'design': -0.3333333333333333} | 37 | 236742.0 | Optimus S | {'look': 1,' case': 1,' screen protector': 1} |
156 | LG Optimus S Android Phone, Gray (Sprint) | LG | 69.98 | 4 | Great rubberized case! Instead of being flimsy... | 0.0 | 236743 | 698 | 24 | LG Optimus S Android (Sprint) | 11 | 155 | {'feel': 1.0, 'look': 0.3333333333333333} | 21 | 236743.0 | Optimus S | {'case': 1,' design': -1} |
157 | LG Optimus S Android Phone, Gray (Sprint) | LG | 69.98 | 5 | Thank you | 0.0 | 236745 | 699 | 24 | LG Optimus S Android (Sprint) | 10 | 156 | {} | 1 | 236745.0 | Optimus S | {'case': 1,' screen protector': -1} |
158 | LG Optimus S Android Phone, Gray (Sprint) | LG | 69.98 | 5 | I purshes this ad on for my cell phone sprit s... | 0.0 | 236748 | 700 | 24 | LG Optimus S Android (Sprint) | 10 | 157 | {'cell': 1.0} | 15 | 236748.0 | Optimus S | {'design': 1,' case': -1} |
159 | LG Optimus S Android Phone, Gray (Sprint) | LG | 69.98 | 1 | They sent me the wrong case. I was so disappoi... | 0.0 | 236750 | 701 | 24 | LG Optimus S Android (Sprint) | 0 | 158 | {} | 9 | 236750.0 | Optimus S | {'build': -1} |
160 | LG Optimus S Android Phone, Gray (Sprint) | LG | 69.98 | 5 | This case is for an LG Optimus S but fits the ... | 0.0 | 236751 | 702 | 24 | LG Optimus S Android (Sprint) | 0 | 159 | {'design': -1.0} | 14 | 236751.0 | Optimus S | {'case': -1} |
161 | LG Xenon GR500 Unlocked Phone with QWERTY Keyb... | LG | 129.99 | 5 | This phone is easy to maneuver and user friend... | 2.0 | 238741 | 707 | 49 | LG Xenon GR500 with QWERTY 2MP and Touch Screen | 13 | 160 | {'speakers': 1.0, 'look': 1.0} | 15 | 238741.0 | LG Xenon GR500 | {'keyboard': 1,' color': 1} |
162 | LG Xenon GR500 Unlocked Phone with QWERTY Keyb... | LG | 129.99 | 4 | I bought this phone for my son, who has had lo... | 0.0 | 238859 | 708 | 49 | LG Xenon GR500 with QWERTY 2MP and Touch Screen | 13 | 161 | {'service': -1.0, 'didnt': -1.0, 'features': 1.0} | 31 | 238859.0 | LG Xenon GR500 | {'screen': -1,' ease of use': 1,' battery': -1} |
163 | LG Xenon GR500 Unlocked Phone with QWERTY Keyb... | LG | 129.99 | 5 | Great phone for the money. Easy to operate and... | 3.0 | 238891 | 709 | 49 | LG Xenon GR500 with QWERTY 2MP and Touch Screen | 13 | 162 | {} | 7 | 238891.0 | LG Xenon GR500 | {'ease of use': 1,' keyboard': 1} |
164 | Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ... | Microsoft | 300.51 | 5 | The most important for me is that it let me wo... | 1.0 | 240859 | 715 | 90 | Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ... | 5 | 163 | {'windows': 1.0} | 7 | 240859.0 | Microsoft Lumia 950 | {'price': 1,' camera': 1,' battery': 1,' weig... |
165 | Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ... | Microsoft | 300.51 | 3 | Not too bad. | 1.0 | 240862 | 716 | 90 | Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ... | 0 | 164 | {} | 0 | 240862.0 | Microsoft Lumia 950 | {'size': 1,' software': 1,' apps': 1} |
166 | Microsoft Lumia 950 XL RM-1085 32GB Black, Sin... | Microsoft | 328.41 | 5 | As a Tmobile customer, I was saddened when I l... | 4.0 | 240956 | 718 | 45 | Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7... | 13 | 165 | {'windows': 1.0, 'cover': 0.3333333333333333} | 66 | 240956.0 | Microsoft Lumia 950 | {'speed': 1,' case': 1,' screen': 1} |
167 | Microsoft Lumia 950 XL RM-1085 32GB Black, Sin... | Microsoft | 328.41 | 5 | I can't even tell you how much I love this pho... | 1.0 | 241013 | 720 | 45 | Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7... | 13 | 166 | {'online': -1.0, 'device': -0.3333333333333333... | 38 | 241013.0 | Microsoft Lumia 950 | {'ease of use': 1} |
168 | Microsoft Lumia 950 XL RM-1085 32GB White, Sin... | NaN | 333.41 | 4 | So,the phone arrived with no phone case and no... | 2.0 | 241146 | 722 | 45 | Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7... | 13 | 167 | {} | 8 | 241146.0 | Microsoft Lumia 950 | {'internet': -1} |
169 | Microsoft Lumia 950 XL RM-1085 32GB White, Sin... | NaN | 333.41 | 5 | This is the single SIM card version, not the d... | 1.0 | 241214 | 723 | 45 | Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7... | 5 | 168 | {'features': 1.0} | 52 | 241214.0 | Microsoft Lumia 950 | {'sim card': 1,' speed': 1,' camera': 1,' SMS... |
129 rows × 17 columns
lookup = 1540
for val in df_merge[df_merge.id_col == lookup].Review:
print(val)
df_merge[df_merge.id_col == lookup]
I recevied the phone with broken trackball, missing micro-sd and missing battery.The seller claimed that it is 100% working. i cannot see how such a phone can be workingwithout the internal sd and battery. It claimed that it is OEM and brand new.My obervations indicated this was a poorly attempted refurbished phone. They must berunnin out of second handed parts.
Out[72]:
Product_x | Brand | Price | Rating | Review | Votes | id_col | id_new_col | cluster_name | Standard_Product_Name | cluster | new_id | Sentiments | NN_count | review_id | Product_y | Sentiments_test | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
19 | 8330 BlackBerry Curve (US Cellular) Titanium P... | NaN | 29.95 | 1 | I recevied the phone with broken trackball, mi... | 4.0 | 1540 | 26 | 322 | 8330 BlackBerry Curve Cellular) Titanium | 13 | 19 | {'trackball': -1.0, 'microsd': -1.0} | 16 | 1540.0 | BlackBerry Curve | {'Trackball':-1,'Battery':-1,'Micro-SD':-1} |
Load Functions (for characteristics extraction performance)
def characteristics_extraction_performance(NN_count, training, test):
TP = 0
TN = 0
FP = 0
FN = 0
temp_test = []
test = eval(test)
for test_characteristic in test.keys():
test_characteristic = str(test_characteristic).lower()
test_characteristic = re.sub(r'[^A-Za-z /.]','',test_characteristic)
temp_test.append(test_characteristic)
if test_characteristic in training.keys():
TP += 1
else:
FN += 1
TN = NN_count - len(training.keys()) - FN
for train_characteristic in training.keys():
if train_characteristic not in temp_test:
FP += 1
return TP, TN, FP, FN
def compute_characteristics_extraction_performance(df_merge):
total_TP = 0
total_TN = 0
total_FP = 0
total_FN = 0
for i in range(len(df_merge)):
NN_count = df_merge.NN_count[i]
training = df_merge.Sentiments[i]
test = df_merge.Sentiments_test[i]
if pd.isnull(test): continue
TP, TN, FP, FN = characteristics_extraction_performance(NN_count, training, test)
total_TP += TP
total_TN += TN
total_FP += FP
total_FN += FN
if total_TP + total_FP == 0:
TPR_RECALL = 0
else:
TPR_RECALL = total_TP / (total_TP + total_FP)
TNR_SPECIFICITY = total_TN / (total_TN + total_FN)
F1_Score = 2* total_TP / (2*total_TP + total_FP + total_FN)
Accuracy = (total_TP + total_TN) / (total_TP + total_TN + total_FP + total_FN)
fpr = total_FP / (total_FN + total_FP)
return TPR_RECALL, TNR_SPECIFICITY, F1_Score, Accuracy, fpr
Recall, Specificity, F1_Score, Accuracy, fpr= compute_characteristics_extraction_performance(df_merge)
print("Recall: ", Recall)
print("Specificity: ", Specificity)
print("F1_Score: ", F1_Score)
print("Accuracy: ", Accuracy)
Recall: 0.05128205128205128 Specificity: 0.777699364855 F1_Score: 0.0273972602739726 Accuracy: 0.722294654498
Load Functions (for sentiment analysis performance)
def characteristics_sentiment_performance(training, test):
TP = 0
TN = 0
FP = 0
FN = 0
test = eval(test)
for test_characteristic, test_score in test.items():
test_characteristic = str(test_characteristic).lower()
test_characteristic = re.sub(r'[^A-Za-z /.]','',test_characteristic)
if test_characteristic in training.keys():
if test_score == training[test_characteristic]:
if test_score > 0:
TP += 1
else:
TN += 1
else:
if test_score > 0:
FN += 1
else:
FP += 1
else:
continue
return TP, TN, FP, FN
def compute_characteristics_sentiment_performance(df_merge):
total_TP = 0
total_TN = 0
total_FP = 0
total_FN = 0
cases = 0
for i in range(len(df_merge)):
training = df_merge.Sentiments[i]
test = df_merge.Sentiments_test[i]
if pd.isnull(test): continue
TP, TN, FP, FN = characteristics_sentiment_performance(training, test)
if TP+ TN+ FP+ FN > 0:
cases+=1
total_TP += TP
total_TN += TN
total_FP += FP
total_FN += FN
if total_TP + total_FP == 0:
TPR_RECALL = 0
else:
TPR_RECALL = total_TP / (total_TP + total_FP)
TNR_SPECIFICITY = total_TN / (total_TN + total_FN)
F1_Score = 2* total_TP / (2*total_TP + total_FP + total_FN)
Accuracy = (total_TP + total_TN) / (total_TP + total_TN + total_FP + total_FN)
fpr = total_FP / (total_FN + total_FP)
return TPR_RECALL, TNR_SPECIFICITY, F1_Score, Accuracy, cases
Recall, Specificity, F1_Score, Accuracy, cases= compute_characteristics_sentiment_performance(df_merge)
print("Reviews Evaluated: ", cases)
print("Recall: ", Recall)
print("Specificity: ", Specificity)
print("F1_Score: ", F1_Score)
print("Accuracy: ", Accuracy)
Reviews Evaluated: 5 Recall: 0.6666666666666666 Specificity: 0.6666666666666666 F1_Score: 0.6666666666666666 Accuracy: 0.6666666666666666
4. Business Insights
By extracting the main characteristics that customers are reviewing and which rating (i.e sentiment score) they are giving to them the business will be able to understand what positively or negatively affects product reviews and what specifically users choose as highlights or pain points. From the output table with the sentiments scores assigned to each product name characteristics and simple reporting transformation a the following table can be obtained:
Flexible enough allowing to create further reports such as:
Which can then be used by manufacturers (i.e. Apple or Samsung) to improve the quality of their products based on a specific characteristic they are getting negative reviews, and also by sellers who can use this information to diversify their products (for example have one which is strong in screen quality and another in battery) or to stop buying products that have critical issues.
5. Discussion
5.1 Further Improvements
Businesses do not necessarily need to have a sentiment score for reviews, especially for ecommerce sites such as Amazon where a rating is also available. For manufactures in particular even if they have a score they would not know exactly where to prioritize their efforts to improve their products. Instead, by giving them the specifics characteristics where their products are failing or not they get valuable insights to tackle problems as they arise. Hence, the challenge of correctly extracting the products characteristics is of major importance. This application underperforms in the capabilities of extracting the characteristics and seems to perform fairly well in assigning the correct sentiments to them (although some exceptions need to be adjusted through gazetteers by using the domain knowledge of the business and industry). To further improve characteristics extraction an approach using topic modelling could be implemented, where assumptions are made on the probabilistic distribution of topics inside documents. An example of this would be the Latent Dirichlet Allocation that outputs word clusters. By extending the basic model of identifying topics, we can separate sentiment and features from each topic. As mentioned before, opinion word can be incorrectly assigned to characteristics when multiple characteristics are present, a task that could be tackled and improved with the usage of Name Entity Recognition (NER) and Relationship extraction (RE). Because of computational limitations we worked only on a subsample of the ~400,000 reviews. In the future using cloud computing as well as parallelization and improving the algorithm will allow to process an even larger amount of reviews. Finally to have statistically significant results a larger test set should be created with roughly at least 10% of the data (for this project only ~150 reviews were created).
5.2 Conclusion
In this project we analyzed the performance of measuring sentiment analysis on specific characteristics of mobile phones mentioned in customer reviews to provide manufacturers with actionable insights to improve their products and for sellers to improve their offerings. Results shows the worst performance on characteristic extraction where Recall is critically low. This topic is also the main challenge which could be further improved by implementing topic modelling. Sentiment scores on characteristics extraction revealed a good but not great performance suggesting that further improvements could be made using Relationship Extraction. However the test set was too small to have a clear statistical significance on the results.