image
VincentWei

天地间,浩然正气长存,为天地立心,为生民立命,为往圣继绝学,为万世开太平!

Amazon Reviews Analysis: Unlocked Mobile Phones

VincentWei    2020-03-02 15:16

Introduction

Merchants selling products through ecommerce often received a high amount of customers reviews too large in scale for human processing. These reviews often have important business insights that can be leveraged to perform actions that can improve profits. In this project we analyze ~400,000 mobile phone reviews from Amazon.com aiming to find trends and patterns to determine which product characteristics are mentioned most by customers and with what sentiment. Our task is performed in six steps: (1) pre-processing to prepare the data for analysis including tokenization and part-of-speech tagging, (2) product names standardization, (3) characteristics extraction, (4) reviews filtering to remove reviews considered as outliers, unbalanced or meaningless, (5) sentiment extraction for each product-characteristic and (6) performance analysis to determine the accuracy of the model where we evaluate characteristic extraction separately from sentiment scores.

Methodology

A flowchart of the project, including the approach, performance and final business analysis is presented below:

1. Pre-procesing

This part includes:

1.1 Tokenization

Applied to both product names and reviews. It involves removal of stopwords, treating stemming of words, case-folding, removing characters that are not alphanumeric and breaking at whitespace.

Synonyms

Synonyms were grouped together as a means of dimensionality reduction, with manually inputted gazetteer with most common synonyms (for example the words “camera”, “video”, “display” are all transformed into “camera”).

Negation

It was important to handle negation for sentiment analysis so that negated opinion words could be reversed when computing its score. This method comes from Das and Chen 2001 - basically appending the suffix '_NEG' to every word appearing between a negation and a clause-level punctuation mark (such as comma). The built-in function sentiment.util.mark_negation from NLTK package was used without considering double negation.

Spelling correction Because reviews are hand-typed the function 'spell' from the 'autocorrect package' was used to treat misspellings but also considering a manually inputted gazetteer to ignore special cases (for example the word “microsd” was incorrectly being transformed into “micros”).

1.2 Part of Speech tagging

POS tagging was critical for three reasons.

(1) To find adjectives which were all considered as opinion words (as well as others exceptions that will be discussed in next sections),

(2) to extract it’s sentiment score since words have different polarity depending on their POS tag and

(3) to extract products characteristics where Nouns (NN) and Noun-phrases (NNP) were considered as potential candidates.

The function pos_tag from NLTK package was used for this task.

1.3 Vector Space Model and TF * IDF transformation

Vector Space Model

A vector space model was created based on a normalized (by euclidean distance) Term-Document-Matrix via bags-of-words for both product names as well as reviews in preparation for clustering purposes. For the first to standardize product names and for the latter to filter reviews.

Inverse Document Frequency

Another normalized TDM was constructed this time using TF*IDF weightings for each product name term. Its purpose was to determine which potential terms could be considered as standardized product names. The higher the IDF value the more important to be a potential part of the standardized name since the most commons words such as “unlocked”, “black” or “dual-core” should be avoided (and they have low IDF scores).

Load Libraries

from IPython.display import display
import timeit
from collections import defaultdict
import math
import numpy as np
import pandas as pd
import random
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib.dates as md
%matplotlib inline
import operator 
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from bs4 import BeautifulSoup
import re
import nltk
from nltk import word_tokenize
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk import sentiment
from autocorrect import spell # For spelling correction
from urllib import request

Load alternative for WordNet  

url_pos = r'https://raw.githubusercontent.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107/master/data/opinion-lexicon-English/positive-words.txt'
url_neg = r'https://raw.githubusercontent.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107/master/data/opinion-lexicon-English/negative-words.txt'
pos_list = request.urlopen(url_pos).read().decode('utf-8')[1:]
pos_list = pos_list[pos_list.find("a+"):].split("\n")
neg_list = request.urlopen(url_neg).read().decode('ISO-8859-1')[1:]
neg_list = neg_list[neg_list.find("2-faced"):].split("\n")

Load and correct Test Data

 
# The initial format of he annotated test_set is difficult to read
# as a dataframe, transformation to .csv format is computed first 
# with regular expressions.
test = open('data/annotated_test_set.txt','r', encoding='utf8')
test_file = test.read()
test.close()
test_file[:200]
test_file = re.sub(r"{[^{}]+}", lambda x: x.group(0).replace(",", ";"), test_file)
test_file = test_file.replace(';', "%")
test_file = test_file.replace(',', ";")
test_file = test_file.replace('%', ",")
test_file = test_file.replace('{', "{'")
test_file = test_file.replace(',', ",'")
test_file = test_file.replace(':', "':")
test_file = test_file.replace("},'", "}")
# Once fixed, save and load:
text_file = open("data/annotated_test_set_corrected.csv", "w")
for row in test_file.split(",\n"):
    text_file.write(row)
    text_file.write("\n")
text_file.close()
test = open('data/annotated_test_set_corrected.csv','r', encoding='utf8')
test_file = test.read()
test.close()
test = pd.read_csv('data/annotated_test_set_corrected.csv', delimiter = ";")
test.columns = ['review_id', 'Product', 'Sentiments_test']

Load Amazon Reviews Data

https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones/downloads/Amazon_Unlocked_Mobile.csv  

df = pd.read_csv('data/Amazon_Unlocked_Mobile.csv', delimiter = ",")
n = len(df)
df.columns = ['Product', 'Brand', 'Price', 'Rating', 'Review', 'Votes']
df['id_col'] = range(0, n)
n_reviews = 1000 # Let's get a sample
keep = sorted(random.sample(range(1,n),n_reviews))
keep += list(set(test.review_id)) # this are the reviews annotated for test
df = df[df.id_col.isin(keep)]
n_reviews = len(df)
df['id_new_col'] = range(0, n_reviews)
df.head()

Out[8]:
  Product Brand Price Rating Review Votes id_col id_new_col
53 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 muy buen producto 0.0 53 0
69 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... 13.0 69 1
71 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 1 Hola, compramos dos teléfonos y vienieron tota... 2.0 71 2
73 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 GRACIAS ME LLEGO EL PROCTO QUE COMPRE Y LLEVO ... 0.0 73 3
75 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 4 The keys are a little hard to hit, and I didn'... 0.0 75 4

Sample review:  

id_prod = 69
for val in df[df.id_col == id_prod].Review:
    print(val)

 
Nokia Asha 302 Unlocked GSM Phone with 3.2MP Camera, Video, QWERTYDependableTraditional Nokia Menu'sNot Complicated like 'Smart Phones'DurableEasy to use on Straighttalk, Internet, WiFi, Bluetooth.

Create functions  

def get_tokens(df, stem = False, negation = False):
    stemmer = PorterStemmer()
    stop = set(stopwords.words('english'))
    reviews = []    
    i = 1

 
    for review in df["Review"]:
        tokenized_review = []      
        review = str(review).lower() # lowercase

 
        # Remove every character except A-Z, a-z,space 
        # and punctuation (we'll need it for negation)
        review = re.sub(r'[^A-Za-z /.]','',review) 

 
        # mark_negation needs punctuation separated by white space.
        review = review.replace(".", " .")   

 
        tokens = word_tokenize(review)

 

 
        for token in tokens:
            # Remove single characters and stop words
            if (len(token)>1 or token == ".") and token not in stop: 
                if stem:
                    tokenized_review.append(stemmer.stem(get_synonym(token)))            
                else:
                    tokenized_review.append(get_synonym(token))

 
        if negation:
            tokenized_review = sentiment.util.mark_negation(tokenized_review)   

 
        # Now we can get rid of punctuation and also let's fix some spellings:
        tokenized_review = [correction(x) for x in tokenized_review if x != "." ]

 

 
        reviews.append(tokenized_review)

 
        if i%100 == 0:
            print('progress: ', (i/len(df["Review"]))*100, "%")
        i = i + 1

 
    return reviews

 
def get_pos(tokenized_reviews):
    tokenized_pos = []

 
    for review in tokenized_reviews:
        tokenized_pos.append(nltk.pos_tag(review))

 
    return tokenized_pos

 

 
def get_frequency(tokens):    
    term_freqs = defaultdict(int)    

 
    for token in tokens:
        term_freqs[token] += 1 

 
    return term_freqs
def get_tdm(tokenized_reviews):
    tdm = []

 
    for tokens in tokenized_reviews:
        tdm.append(get_frequency(tokens))

 
    return tdm
def normalize_tdm(tdm):    
    tdm_normalized = []

 
    for review in tdm:
        den = 0
        review_normalized = defaultdict(int)

 
        for k,v in review.items():
            den += v**2
        den = math.sqrt(den)

 
        for k,v in review.items():
            review_normalized[k] = v/den

 
        tdm_normalized.append(review_normalized)

 
    return tdm_normalized
def get_all_terms(tokenized_reviews):
    all_terms = []

 
    for tokens in tokenized_reviews:
        for token in tokens:
            all_terms.append(token)

 
    return(set(all_terms))

 
def get_all_terms_dft(tokenized_reviews, all_terms):
    terms_dft = defaultdict(int)  

 
    for term in all_terms: 
        for review in tokenized_reviews:
            if term in review:
                terms_dft[term] += 1

 
    return terms_dft
def get_tf_idf_transform(tokenized_reviews, tdm, n_reviews):
    tf_idf = []        
    all_terms = get_all_terms(tokenized_reviews)    
    terms_dft = get_all_terms_dft(tokenized_reviews, all_terms)

 
    for review in tdm:
        review_tf_idf = defaultdict(int)
        for k,v in review.items():
            review_tf_idf[k] = v * math.log(n_reviews / terms_dft[k], 2)

 
        tf_idf.append(review_tf_idf)     

 
    return tf_idf
def get_idf_transform(tokenized_reviews, tdm, n_reviews):
    idf = []    
    terms_dft = defaultdict(int)    

 
    all_terms = get_all_terms(tokenized_reviews)

 
    for term in all_terms: 
        for review in tokenized_reviews:
            if term in review:
                terms_dft[term] += 1

 
    for review in tdm:
        review_idf = defaultdict(int)
        for k,v in review.items():
            review_idf[k] = math.log(n_reviews / terms_dft[k], 2)

 
        idf.append(review_idf)     

 
    return idf
def correction(x):
    ok_words = ["microsd"]

 
    if x.find("_NEG") == -1 and x not in ok_words: # Don't correct if they are negated words or exceptions
        return spell(x)
    else:
        return x
def get_synonym(word):
    synonyms = [["camera","video", "display"], 
                ["phone", "cellphone", "smartphone", "phones"],
               ["setting", "settings"],
               ["feature", "features"],
               ["pictures", "photos"],
               ["speakers", "speaker"]]
    synonyms_parent = ["camera", "phone", "settings", "features", "photos", "speakers"]

 
    for i in range(len(synonyms)):
        if word in synonyms[i]:
            return synonyms_parent[i]

 
    return word
def get_similarity_matrix(similarity, tokenized_reviews):
    similarity_matrix = []
    all_terms = get_all_terms(tokenized_reviews)

 
    for review in similarity:
        similarity_matrix_row = []
        for term in all_terms:
            similarity_matrix_row.append(review[term])

 
        similarity_matrix.append(similarity_matrix_row)

 
    return similarity_matrix

 

 
# EXECUTE
tic=timeit.default_timer()
tokenized_reviews = get_tokens(df, stem = False, negation = False)
tokenized_pos = get_pos(tokenized_reviews)
tdm = get_tdm(tokenized_reviews)
vsm = normalize_tdm(tdm)
tf_idf = get_tf_idf_transform(tokenized_reviews, tdm, n_reviews)
toc=timeit.default_timer()
print("minutes: ", (toc - tic)/60)
progress:  8.865248226950355 %
progress:  17.73049645390071 %
progress:  26.595744680851062 %
progress:  35.46099290780142 %
progress:  44.32624113475177 %
progress:  53.191489361702125 %
progress:  62.056737588652474 %
progress:  70.92198581560284 %
progress:  79.7872340425532 %
progress:  88.65248226950354 %
progress:  97.51773049645391 %
minutes:  2.6501673290627097

Let's see a sample of:

  • Tokenized reviews
  • Part of speech
  • Term-Document-Matrix (TDM)
  • TD*IDF transformation
 
lookup_review = 1
for val in df[df.id_new_col == lookup_review]["Review"]: print(val)
display(tokenized_reviews[lookup_review])
display(tokenized_pos[lookup_review])
display(tdm[lookup_review])
display(tf_idf[lookup_review])
Nokia Asha 302 Unlocked GSM Phone with 3.2MP Camera, Video, QWERTYDependableTraditional Nokia Menu'sNot Complicated like 'Smart Phones'DurableEasy to use on Straighttalk, Internet, WiFi, Bluetooth.
['Nokia',
 'Asha',
 'unlocked',
 'grm',
 'phone',
 'imp',
 'camera',
 'camera',
 'qwertydependabletraditional',
 'Nokia',
 'menusnot',
 'complicated',
 'like',
 'smart',
 'phonesdurableeasy',
 'use',
 'straighttalk',
 'internet',
 'WiFi',
 'Bluetooth']
[('Nokia', 'NNP'),
 ('Asha', 'NNP'),
 ('unlocked', 'VBD'),
 ('grm', 'JJ'),
 ('phone', 'NN'),
 ('imp', 'NN'),
 ('camera', 'NN'),
 ('camera', 'NN'),
 ('qwertydependabletraditional', 'JJ'),
 ('Nokia', 'NNP'),
 ('menusnot', 'NN'),
 ('complicated', 'VBN'),
 ('like', 'IN'),
 ('smart', 'JJ'),
 ('phonesdurableeasy', 'NN'),
 ('use', 'NN'),
 ('straighttalk', 'NN'),
 ('internet', 'NN'),
 ('WiFi', 'NNP'),
 ('Bluetooth', 'NNP')]
defaultdict(int,
            {'Asha': 1,
             'Bluetooth': 1,
             'Nokia': 2,
             'WiFi': 1,
             'camera': 2,
             'complicated': 1,
             'grm': 1,
             'imp': 1,
             'internet': 1,
             'like': 1,
             'menusnot': 1,
             'phone': 1,
             'phonesdurableeasy': 1,
             'qwertydependabletraditional': 1,
             'smart': 1,
             'straighttalk': 1,
             'unlocked': 1,
             'use': 1})
defaultdict(int,
            {'Asha': 9.139551352398794,
             'Bluetooth': 5.969626350956481,
             'Nokia': 12.66439286068238,
             'WiFi': 5.010268335453827,
             'camera': 7.393215713100131,
             'complicated': 10.139551352398794,
             'grm': 6.439111634257702,
             'imp': 10.139551352398794,
             'internet': 5.817623257511432,
             'like': 3.0627357553479633,
             'menusnot': 10.139551352398794,
             'phone': 0.9179642311339886,
             'phonesdurableeasy': 10.139551352398794,
             'qwertydependabletraditional': 10.139551352398794,
             'smart': 5.554588851677638,
             'straighttalk': 8.554588851677638,
             'unlocked': 4.554588851677638,
             'use': 3.2940613014544184})

In [ ]:

 

1. Pre-procesing

This part includes:

1.4 Product Names Standardization

Merchants often name their products in different ways, for example “iPhone 4 32GB Black, AT&T” and “iPhone 4 16GB Gold, Verizon”. This tasks objective was to add a new standard name that for this case should simply be “iPhone 4”.

Three different approaches were combined,

(1) manually inputted set/gazetteer with words to be removed,

(2) IDF importance score and

(3) Clustering.

The first step cleaned the names through the (1) gazetteer, removing colors names and common terms (such as “unlocked”).

With the remaining terms (2) selected only the first 5 terms with the highest IDF in the group (none common terms).

Finally with the remaining terms performed the (3) Clustering using a VSM matrix with k=N°_reviews/2 clusters. This number was approximated through trial and error validating with visualization since did not know a priori how many product names were in the dataset. The number of clusters was a trade-off between having different products in the same standardized names (low number of clusters) which was highly undesirable and having too many standardized names that couldn’t standardize properly (for example having “iPhone 4 32GB” and “iPhone 4 16GB”).

A sample output can be seen in figure 2.

The approach attempted was taking advantage of POS tagging. The hypothesis was that nouns (NN and NNP) could be potential terms in a standardized product name, however NLTK tagging couldn’t grab some of the terms as NN which were the most important ones (for example model names such as “A850” were tagged as verbs).

Load Function

 
def get_product_tokens(df):
    stop = set(stopwords.words('english'))
    products = []
    i = 1

 
    for product in df["Product"]:
        tokenized_product = []      
        product = product.lower() # lowercase

 
        # Remove every character except A-Z, a-z,space 
        # and punctuation (we'll need it for negation)
        product = re.sub(r'[^0-9A-Za-z \.]','',product)    

 
        # Only consider first 10 words of the product names
        tokens = word_tokenize(product)[:11]

 
        for token in tokens:
            # Remove stop words
            if token not in stop:
                tokenized_product.append(token)       

 
        products.append(tokenized_product)

 
        if i%100 == 0:
            print('progress: ', (i/len(df["Product"]))*100, "%")
        i = i + 1

 
    return products

tokenized_products = get_product_tokens(df)
products_tokenized_pos = get_pos(tokenized_products)
products_tdm = get_tdm(tokenized_products)
products_tf_idf = get_tf_idf_transform(tokenized_products, products_tdm, n_reviews)
products_idf = get_idf_transform(tokenized_products, products_tdm, n_reviews)
progress:  8.865248226950355 %
progress:  17.73049645390071 %
progress:  26.595744680851062 %
progress:  35.46099290780142 %
progress:  44.32624113475177 %
progress:  53.191489361702125 %
progress:  62.056737588652474 %
progress:  70.92198581560284 %
progress:  79.7872340425532 %
progress:  88.65248226950354 %
progress:  97.51773049645391 %
  • Based on IDF we will get only the words that have the highest importance. The hypothesis is that the most common will be words typical in many smarphones such as "black, unlocked, dual, etc." **

  • Unfortunately we can't filter through POS, it fails to grab the most important words. For example A850 is tagged as Verb when in fact it's the model of a smartphone (the main word to have in the product name) **

  • We can assume that we will not lose the brand (which might not be grabed) since we do have it in a second column.

Visualization for analysis below:

 
lookup_product = 53
display(df[df.id_new_col== lookup_product]["Product"])
# we want to grab those with higher scores (least common terms)
display(sorted(products_idf[lookup_product].items(), 
               key=operator.itemgetter(1), reverse = True)) 
# Unfortunately we can't filter through POS
display(products_tokenized_pos[lookup_product])
10170    Apple iPhone 4 8GB, White, for Straight Talk, ...
Name: Product, dtype: object
[('straight', 9.139551352398794),
 ('talk', 9.139551352398794),
 ('contract', 6.33219643034119),
 ('4', 4.8916238389552085),
 ('8gb', 4.4671260104272985),
 ('white', 3.052088511148454),
 ('apple', 2.2569083030369526),
 ('iphone', 2.202913413396223)]
[('apple', 'NN'),
 ('iphone', 'NN'),
 ('4', 'CD'),
 ('8gb', 'CD'),
 ('white', 'JJ'),
 ('straight', 'JJ'),
 ('talk', 'NN'),
 ('contract', 'NN')]

In [19]:

 
colors = ["black", "red", "blue", "white", "gray", "green","yellow", "pink", "gold"]
common_terms = ["smarthphone", "phone", "cellphone", "retail", "warranty", 
                "silver", "bluetooth", "wifi", "wireless", "keyboard", "gps",
               "original", "unlocked", "camera", "certified", "international",
               "actory", "packaging", "us", "usa", "international", "refurbished", 
               "phones", "att", "verizon", "-", "8gb", "16gb", "32gb", "64gb", "contract"]
def standardize_names(products_idf, colors, common_terms):
    standard_names = []
    brands = [str(x).lower() for x in set(df.Brand)]

 
    for product in products_idf:

 
        for k, v in product.items():
            # Remove color and brand words
            if k in colors or k in common_terms or k in brands:
                product[k] = 0

 
        # Grab the first 5 words with highest score
        product = sorted(product.items(), key=operator.itemgetter(1), reverse = True)[:5]

 
        standard_names.append(product)

 
        tokenized_standard_product_names = []

 
    for product in standard_names:
        product_name = []
        for word in product:
            if word[1] > 0:
                product_name.append(word[0])
        tokenized_standard_product_names.append(product_name)

 

 

 
    return tokenized_standard_product_names

 
standard_product_names = standardize_names(products_idf, colors, common_terms)
product_tdm = get_tdm(standard_product_names)
product_vsm = normalize_tdm(product_tdm)
product_vsm[1]

Out[20]:
defaultdict(int,
            {'3.2mp': 0.4472135954999579,
             '302': 0.4472135954999579,
             'asha': 0.4472135954999579,
             'qwerty': 0.4472135954999579,
             'video': 0.4472135954999579})

CLUSTER PRODUCT NAMES

 
similarity = product_tdm
product_names_clusters = int(round(n_reviews/2,0))
similarity_matrix = pd.DataFrame(get_similarity_matrix(similarity, standard_product_names), columns = get_all_terms(standard_product_names))
kmeans = KMeans(n_clusters=product_names_clusters, random_state=0).fit(similarity_matrix)
clusters=kmeans.labels_.tolist()
clustered_matrix = similarity_matrix.copy()
clustered_matrix['product_name_cluster'] = clusters
clustered_matrix['id_col'] = range(0, n_reviews)
display(clustered_matrix[:5])
count_clusters = pd.DataFrame(clustered_matrix.product_name_cluster.value_counts())
display(count_clusters[:5])
  s5 blanco purple 1touch duos link discontinued lt30at a887 rose ... carrier notification beat easy 4100mah internationally easytouse graul00 product_name_cluster id_col
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 18 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 18 1
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 18 2
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 18 3
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 18 4

5 rows × 924 columns

  product_name_cluster
28 17
30 14
167 13
27 12
9 12

ASSIGN CLUSTER PRODUCT NAMES

 
df["cluster_name"] = list(clustered_matrix.product_name_cluster)
def create_standard_name(df):
    new_names = defaultdict(int)

 
    current_names = df.groupby('cluster_name').first().Product

 

 
    for i in set(clusters):
        cluster_name = df[df.cluster_name == i].Product.value_counts().index[0]
        new_name = []

 
        for word in cluster_name.split():
            temp_word= re.sub(r'[^0-9A-Za-z \.\-]','',word).lower()
            if temp_word not in colors and temp_word not in common_terms :
                new_name.append(word)
        new_names[i] = ' '.join(new_name)

 
    new_standard_names = []

 
    for row in df.cluster_name:

 
        new_standard_names.append(new_names[row])

 
    df["Standard_Product_Name"] = new_standard_names

 
    return df
df = create_standard_name(df)         

 
df.head()    

Out[23]:
  Product Brand Price Rating Review Votes id_col id_new_col cluster_name Standard_Product_Name
53 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 muy buen producto 0.0 53 0 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
69 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... 13.0 69 1 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
71 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 1 Hola, compramos dos teléfonos y vienieron tota... 2.0 71 2 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
73 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 GRACIAS ME LLEGO EL PROCTO QUE COMPRE Y LLEVO ... 0.0 73 3 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
75 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 4 The keys are a little hard to hit, and I didn'... 0.0 75 4 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...

Sample for 'iPhone'


 
df[["Product","Standard_Product_Name"]][df['Product'].str.contains("iPhone")][:8]

Out[26]:
  Product Standard_Product_Name
3968 Apple A1533 Unlocked iPhone 5S Smart Phone, 16... Apple A1533 iPhone 5S Smart 16 GB
7237 Apple iPhone 4 16GB (Black) - AT&T Apple iPhone 4
7283 Apple iPhone 4 16GB (Black) - AT&T Apple iPhone 4
7592 Apple iPhone 4 16GB (Black) - AT&T Apple iPhone 4
8913 Apple iPhone 4 32GB (Black) - Verizon Apple iPhone 4
9022 Apple iPhone 4 32GB (Black) - Verizon Apple iPhone 4
9541 Apple iPhone 4 32GB (White) - Verizon Apple iPhone 4
9739 Apple iPhone 4 8GB Unlocked- Black Apple iPhone 4

2.1 Characteristics Extraction

Two steps were taken to extract the main characteristics from reviews:

(1) manually inputted set/gazetteer with words to be removed or included and

(2) identifying NN/NNP POS tagged terms that exceeded a specific threshold (set in 1%*N°_reviews) of reviews occurrences.

Load functions and shortcuts


 
def get_all_terms_pos_dft(all_terms, terms_dft):
    all_terms_pos = nltk.pos_tag(all_terms)

 
    i = 0
    for k, v in terms_dft.items():
        all_terms_pos[i] = all_terms_pos[i] + (v,)        
        i+=1

 
    return all_terms_pos
def get_threshold_terms(all_terms_pos_dft, threshold = 20):
    threshold_terms = []

 
    for term in all_terms_pos_dft:
        if term[0] in exceptions_to_consider or (term[2] >= threshold and term[1] in ["NN", "NNS", "NNP", "NNPS"] and term[0] not in exceptions_not_to_consider):
            threshold_terms.append(term)

 
    return threshold_terms

 

 
exceptions_to_consider = ["apps", "android", "buttons", "hardware", "wifi",
                         "audio", "speed", "settings", "charger", "design",
                         "price", "look", "trackball", "microsd", "speaker"]
exceptions_not_to_consider = ["phone", "cool", "love", "awesome", "tell",  'tell',
 'feels',
  'works',
 'excelente',
 'item',
 'get',
 'iPhone',
 'dont',
 'lot',
 'let',
 'money',
 'brand',
 'recommend',
 'issues',
 'cant',
 'nothing',
 'number',
 'check',
 'month',
 'husband',
 'need',
 'note',
 'venezuela',
 'give',
 'Samsung',
 'see',
 'turn',
 'pocket',
 'amazing',
 'hands',
 'couldnt',
 'fast',
 'condition',
 'super',                    
 'today',
 'star',
 'life',
 'anyone',
 'storage',
 'speaker',
 'internet',
 'delivery',
 'picture',
 'games',
 'hand',
 'model',
 'glass',
 'case',
 'micro',
 'sound',
 'mp',
 'watch',
 'grm',
 'try',
 'line',
 'thing',
 'isnt',
 'thanks',
 'Verizon',
 'experience',
 'box',
 'scratches',
 'problems',
 'waste',
 'bottom',
 'company',
 'bit',
 'youre',
 'lack',
 'deal',
 'pay',
 'i',
 'reason',
 'issue',
 'couple',
 'option',
 'beautiful',
 'mobile',
 'replacement',
 'wasnt',
 'way',
 'days',
 'loves',
 'trouble',
 'quick',
 'someone',
 'glad',
 'weeks',
 'ones',
 'something',
 'market',
 'galaxy',
 'apple',
 'havent',
 'download',
 'time',
 'lg',
 'send',
 'home',
 'years',
 'product',
 'change',
 'people',
 'review',
 'price',
 'simple',
 'person',
 'lasts',
 'user',
 'hold',
 'please',
 'reviews',
 'work',
 'thats',
 'text',
 'im',
 'end',
 'thank',
 'look',
 'cost',
 'months',
 'buying',
 'point',
 'version',
 'web',
 'times',
 'Nokia',
 'problem',
 'wouldnt',
 'performance',
 'products',
 'minutes',
 'customer',
 'order',
 'guess',
 'things',
 'everything',
 'week',
 'play',
 'daughter',
 'anything',
 'purchase',
 'ok',
 'year',
 'stars',
 'day',
 'wife',
 'son',
 'doesnt',
 'blackberry',
 'hours',
 'return',
 'use']

 
all_terms = get_all_terms(tokenized_reviews)    
terms_dft = get_all_terms_dft(tokenized_reviews, all_terms)
all_terms_pos_dft = get_all_terms_pos_dft(all_terms, terms_dft)
threshold_terms = get_threshold_terms(all_terms_pos_dft, threshold = 0.01 * n_reviews)
threshold_terms[:10]

Out[31]:
[('voice', 'NN', 12),
 ('design', 'NN', 13),
 ('microsd', 'NN', 7),
 ('online', 'NN', 15),
 ('connection', 'NN', 13),
 ('service', 'NN', 36),
 ('warranty', 'NN', 22),
 ('os', 'NN', 13),
 ('WiFi', 'NNP', 35),
 ('network', 'NN', 22)]
 
 
characteristics = [x[0] for x in threshold_terms]
characteristics[:10]

Out[32]:
['voice',
 'design',
 'microsd',
 'online',
 'connection',
 'service',
 'warranty',
 'os',
 'WiFi',
 'network']

2.2 Filtering

To effectively reduce the dataset size and improve performance, we need to filter out unusable, misleading and noisy reviews through 4 methods described below. In the end the dataset was reduced by 77% from an initial ~1500 reviews.

POS Filter Reviews without an adjective POS tag are removed since sentiment orientation is extracted only from adjectives.

Wordnet Filter Reviews with descriptive words not recognised by Wordnet or other sentiment lexicons are also pruned.

Rough Sentiment Analysis Filter To filter misleading reviews, we first conduct a rough sentiment analysis on individual opinion words, giving them a score of -1 or 1, and the overall score of a review, which is the sum of scores of all adjectives in the review. If there are less than three times the number of positive adjectives than that of negative reviews, or vice versa, then we assume the review is noisy and filter it out. Additionally, we assume that the sum of all reviews being positive, zero, or negative corresponds to a rating of >=3, 3 and below 3 respectively. Thus reviews not satisfying this equality condition against the rating are pruned.

Clustering Filter Performing clustering through a raw and normalized (VSM) TDM, the best results were obtained through VSM since it managed to obtain more diverse clusters - TDM was biased to create clusters based on the amount/frequency of words (clustering almost based exclusively on the lengths of the reviews).

Characteristics Filter The last step aims to keep only the reviews that have at least one characteristic. Since the objective of the project is determining why a product is good or bad through their characteristics sentiment instead of just computing the sentiment score of the review which can already be derived by the rating.  

 
# first import 1000 rows in dataframe
to_prune = [i+1 for i in range(n_reviews)]
ratings = list(df['Rating'])

def get_wordnet_pos(pos):   
    for tag in [('J','ADJ'),('V','VERB'),('N','NOUN'),('R','ADV')]:
        if pos.startswith(tag[0]):
            return getattr(wordnet,tag[1])
    else:
        return 'None'
def get_adj(review):
    with_adj = [tup for tup in review if tup[1] == 'JJ']
    return with_adj
# score for each word
def senti(synset):
    s = swn.senti_synset(synset).pos_score() - swn.senti_synset(synset).neg_score()
    if s>=0:
        return 1
    else:
        return -1
adjs = {x.name().split('.', 1)[0] for x in wn.all_synsets('a')}
### 1. prune reviews without adjectives recognised by wordnet
def prune_adj(tokenized_pos):    
    for k in [i for i in to_prune if i!=0]:
        if not len(get_adj(tokenized_pos[k-1])) or not all(i[0] in adjs for i in get_adj(tokenized_pos[k-1])):
                to_prune[k-1] = 0
    return to_prune
### 2. prune by number of pos and neg adj
# list of scores for each review
def slist(tokenized_pos):
    score = []
    for k in [i for i in to_prune if i!=0]:
        r = get_adj(tokenized_pos[k-1])
        tag = [get_wordnet_pos(tuple[1]) for tuple in r]
        synsets = [r[i][0] + '.' + tag[i] + '.01' for i in range(len(r))] 
        score.append([senti(i) for i in synsets])
    return score
def balance(score_list):
    m=-1
    for k in [i for i in to_prune if i!=0]:
        m+=1
        s = score_list[m]
        if 1 in s and -1 in s and max([s.count(1),s.count(-1)])/min([s.count(1),s.count(-1)]) <= 3:
            to_prune[k-1] = 0
    return to_prune
### 3. prune by average score compared to rating score
def average_score(score_list):
    m = -1
    for k in [i for i in to_prune if i!=0]:
        m += 1
        s = score_list[m]
        #(sum >=0, then rating >=3)
        if sum(s)>=0 and (sum(s)+1)*(ratings[k-1]-2.5)<=0:
            to_prune[k-1] = 0
        elif sum(s)<0 and (sum(s)+1)*(2.5-ratings[k-1])<=0:
            to_prune[k-1] = 0
    return to_prune
 
 
# initialise index to_prune = [1,2,3,...,1000]
to_prune = [i+1 for i in range(n_reviews)]
#to_prune = list(set(df.id_col))
to_prune = prune_adj(tokenized_pos)
score_list = slist(tokenized_pos)
to_prune = balance(score_list)
to_prune = average_score(score_list)
# len([i for i in to_prune if i!=0])
​
 
 
to_keep = [i for i in to_prune if i!=0]
to_keep += list(df[df.id_col.isin(list(set(test.review_id)))].id_new_col) # this are the reviews annotated for test
to_keep = list(set(to_keep))
df_filtered = df[df.id_new_col.isin(to_keep)]
df_filtered[:3]

Out[53]:
  Product Brand Price Rating Review Votes id_col id_new_col cluster_name Standard_Product_Name
53 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 muy buen producto 0.0 53 0 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
69 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... 13.0 69 1 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...
71 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 1 Hola, compramos dos teléfonos y vienieron tota... 2.0 71 2 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W...

 
len(list(df_filtered[df_filtered.id_col.isin(list(set(test.review_id)))].id_new_col))

Out[54]:
128
 
 
to_keep = list(df_filtered.id_new_col)

clustering filter

 
n_reviews = len(to_keep)
tokenized_reviews = get_tokens(df_filtered, stem = False, negation = False)
tokenized_pos = get_pos(tokenized_reviews)
tdm = get_tdm(tokenized_reviews)
vsm = normalize_tdm(tdm)
tf_idf = get_tf_idf_transform(tokenized_reviews, tdm, n_reviews)
similarity = vsm #vsm # tdm
similarity_matrix = pd.DataFrame(get_similarity_matrix(similarity, tokenized_reviews), columns = get_all_terms(tokenized_reviews))
similarity_matrix[:10]
progress:  48.54368932038835 %
progress:  97.0873786407767 %

Out[56]:
  wasnt bien ahead blew mind delight connecting voice drain grm ... mp calls incoming maneuver telephone Zoe copy around wider afterwards
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.00000 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.204124 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.00000 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.00000 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.00000 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.00000 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.07785 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.00000 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.00000 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.00000 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.102062 0.0 0.00000 0.0

10 rows × 1578 columns  

 
kmeans = KMeans(n_clusters=int(round(math.sqrt(n_reviews),0)), random_state=0).fit(similarity_matrix)
clusters=kmeans.labels_.tolist()
# clustered_matrix = pd.DataFrame(tf_idf_matrix, clusters)
clustered_matrix = similarity_matrix.copy()
clustered_matrix['cluster'] = clusters
clustered_matrix['id_col'] = to_keep
display(len(clustered_matrix))
display(clustered_matrix[:5])
top_clusters = pd.DataFrame(clustered_matrix.cluster.value_counts())
display(top_clusters)
206
  wasnt bien ahead blew mind delight connecting voice drain grm ... incoming maneuver telephone Zoe copy around wider afterwards cluster id_col
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.204124 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5 1
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 2
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 3
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 4

5 rows × 1580 columns

  cluster
13 59
0 28
5 25
3 17
1 17
11 11
10 10
12 8
7 6
6 6
4 6
8 5
2 5
9 3
 

 
limit = top_clusters.cluster.quantile(0.3)
cluster_filter = top_clusters[top_clusters.cluster > limit]
display(cluster_filter)
list(cluster_filter.index)
  cluster
13 59
0 28
5 25
3 17
1 17
11 11
10 10
12 8

Out[58]:
[13, 0, 5, 3, 1, 11, 10, 12]
 
 
df_filtered["cluster"] = list(clustered_matrix.cluster)
df_filtered[:3]
C:\Program Files\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Out[59]:
  Product Brand Price Rating Review Votes id_col id_new_col cluster_name Standard_Product_Name cluster
53 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 muy buen producto 0.0 53 0 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 0
69 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... 13.0 69 1 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 5
71 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 1 Hola, compramos dos teléfonos y vienieron tota... 2.0 71 2 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 0
 

 
to_keep = list(df_filtered[df_filtered.cluster.isin(list(cluster_filter.index))].id_new_col)
to_keep += list(df[df.id_col.isin(list(set(test.review_id)))].id_new_col) # this are the reviews annotated for test
to_keep = list(set(to_keep))
df_filtered = df_filtered[df_filtered.id_new_col.isin(to_keep)]
df_filtered[:3]

Out[60]:
  Product Brand Price Rating Review Votes id_col id_new_col cluster_name Standard_Product_Name cluster
53 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 muy buen producto 0.0 53 0 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 0
69 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... 13.0 69 1 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 5
71 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 1 Hola, compramos dos teléfonos y vienieron tota... 2.0 71 2 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 0

Characteristic Filter

The idea is to consider only reviews that have at least one characteristic

 

 
def filter_with_characteristics(df_filtered, characteristics):
    tokenized_reviews = get_tokens(df_filtered, stem = False, negation = False)
    to_keep_in = []
    j = 0

 
    for i in df_filtered.id_col: 
        for token in tokenized_reviews[j]:
            if token in characteristics:
                to_keep_in.append(i)
                break

 
        j+=1

 
    return to_keep_in

 
to_keep_in = filter_with_characteristics(df_filtered, characteristics)
len(to_keep_in)   

 

 
progress:  52.083333333333336 %

Out[61]:
114

 
to_keep_in += list(set(test.review_id)) # this are the reviews annotated for test
df_filtered = df_filtered[df_filtered.id_col.isin(to_keep_in)]
df_filtered[:3]

Out[62]:
  Product Brand Price Rating Review Votes id_col id_new_col cluster_name Standard_Product_Name cluster
53 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 muy buen producto 0.0 53 0 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 0
69 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... 13.0 69 1 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 5
71 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 1 Hola, compramos dos teléfonos y vienieron tota... 2.0 71 2 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 0

2.3 Characteristics Sentiment Extraction

This task was approached with five combined methods.

(1) Manually inputting set/gazetteer to fix wordnet sentiments that should be positive/negative instead, to ignore certain opinion words (for example “unlocked”, “old”, “normal”, “yellow”, etc.) and to include words that are not tagged as adjectives (i.e. opinion words) such as “broken”, “love” or “cool”.

(2) Inverting the polarity of words when they were negated.

(3) Using “Minqing Hu and Bing Liu” lexicon when opinion words were not supported by wordnet (either missing or neutral).

(4) Extracting the nearests opinion words (with a maximum set at 2) considering token distance (with a maximum set at 5).

(5) Computing the final characteristic sentiment score weighting by their distance (the further apart the lesser its weight).

For (4) the procedure considered looking at opinion words before and after the characteristic found and always keeping the closest ones first (for example taking the opinion word at distance -1 before the one at distance +2) where distance refers to the numbers of tokens from the characteristic word. The maximum amount of opinion words was set at two since usually when a third is found it is because the application misses a new characteristic that was there and which that third opinion word was referring to, hence avoiding assigning incorrectly an opinion word to a characteristic. Furthermore the procedure also considers if an opinion word has already been assigned to a characteristic, which is an advantage (for example avoiding to assign an opinion word twice) as well as a limitation (an opinion word between two characteristic might end up being incorrectly assigned to the first characteristic found) and should be handled in further improvements through Relationship Extraction (RE).

Another limitation and challenge of this task was the fact that customers usually give a review comparing the product with another one. This is problematic since for example they could be talking positively about the screen of the product of importance and negatively about another one they previously had, giving a neutral sentiment in the end. This was not handled in the current project and should be further investigated using RE as well.

For (5) an altered formulation proposed by Ding et al. was used, where the sentiment score for one characteristic of one product is aggregated across all sentiment polarities , as follows:

  • Challenges not taken care of: Sometimes customers explain they got rid of their old phone that had a "bad" camera, to get a new one. This algorithm considers that that bad camera as part of the new phone.

Load functions  

 
positive_exceptions = ["high", "surprised"] # wordnet have it as negative, should be positive
negative_exceptions = ["old"] # wordnet have it as positive, should be negative. 
ignore_exceptions = ["old", "new", "unlocked", "normal"]
ignore_exceptions += colors
word_exceptions = ["missing", "broken", "love", "awesome", "cool"] # They are not tagged as JJ sometimes, they should.
def compute_score(word, word_neg):
    if word in ignore_exceptions: 
        return 0

 
    if word in positive_exceptions:
        if word_neg.find("_NEG") == -1:
            return 1
        else:
            return -1

 
    if word in negative_exceptions:
        print(word)
        if word_neg.find("_NEG") == -1:
            return -1
        else:
            return 1

 
    word2 = ''.join([word,".a.01"])
    try:
        pos_score = swn.senti_synset(word2).pos_score()
        neg_score = swn.senti_synset(word2).neg_score()
    except:
        if word in pos_list:
            pos_score = 1
            neg_score = 0
        elif word in neg_list:
            pos_score = 0
            neg_score = 1
        else:
            return 0

 
    if pos_score > neg_score:
        if word_neg.find("_NEG") == -1:
            return 1
        else:

 
            return -1
    elif neg_score > pos_score:
        if word_neg.find("_NEG") == -1:            
            return -1
        else:

 
            return 1   
    else:
        if word in pos_list:
            return 1
        elif word in neg_list:
            return -1
        else:
            return 0

 
def extract_characteristic_opinion_words(review, review_neg, max_opinion_words = 2, max_distance = 5, use_distance = False):
    review_charactetistics_sentiment = defaultdict(list) 
    i = 0

 
    temp_review = []
    for word in review: 
        word = word + ("free",)
        temp_review.append(list(word))

 
    for i in range(len(review)):
        if review[i][0] in characteristics:
            keep_forward = True
            keep_backward = True
            opinion_words = 0

 
            for j in range(1,max_distance+1):

 
                if  i+j >= len(review):
                    keep_forward = False
                if keep_forward:
                    if  review[i+j][0] in characteristics or opinion_words >= max_opinion_words:
                        keep_forward = False
                    elif i+j < len(review) and (review[i+j][1] in ["JJ", "JJR", "JJS"] or review[i+j][0] in word_exceptions) and temp_review[i+j][2] == "free":
                        sentiment = defaultdict(int)
                        score = compute_score(review[i+j][0], review_neg[i+j][0])                   
                        if score == 0: continue
                        if use_distance:
                            distance = j
                        else:
                            distance = 1
                        sentiment[review[i+j][0]] = (score,distance)
                        review_charactetistics_sentiment[review[i][0]].append(sentiment)
                        temp_review[i+j][2] = "used"
                        opinion_words +=1

 

 
                if  i-j < 0:
                    keep_backward = False
                if keep_backward:
                    if  review[i-j][0] in characteristics or opinion_words >= max_opinion_words:
                        keep_backward = False
                    elif i-j > -1 and (review[i-j][1] in ["JJ", "JJR", "JJS"] or review[i-j][0] in word_exceptions) and temp_review[i-j][2] == "free":
                        sentiment = defaultdict(int)
                        score = compute_score(review[i-j][0], review_neg[i-j][0])         
                        if score == 0: continue
                        if use_distance:
                            distance = j
                        else:
                            distance = 1
                        sentiment[review[i-j][0]] = (score,distance)
                        review_charactetistics_sentiment[review[i][0]].append(sentiment)
                        temp_review[i-j][2] = "used"  
                        opinion_words +=1

 
                if not keep_forward and not keep_backward:
                    break

 
    return review_charactetistics_sentiment
def consolidate_score(characteristic_dict):
    num = 0
    den = 0

 
    for opinion in characteristic_dict:
        for k, v in opinion.items():
            num += v[0]/v[1]
            den += 1/v[1]
    return num/den
def compute_sentiment_scores(tokenized_pos, tokenized_pos_neg, max_distance = 5, use_distance = True):

 
    if len(tokenized_pos) != len(tokenized_pos_neg):
        print("FATAL ERROR: Different lenght between tokenized_pos and tokenized_pos_neg")
        return None

 
    else:

 
        reviews_sentiment_scores = []        

 
        for i in range(len(tokenized_pos)):
            review_sentiment_score = defaultdict(int)

 
            review_characteristics_opinion_words = extract_characteristic_opinion_words(tokenized_pos[i], tokenized_pos_neg[i], max_distance = max_distance, use_distance = use_distance)

 
            for k, v in review_characteristics_opinion_words.items():
                review_sentiment_score[k] = consolidate_score(v)

 
            reviews_sentiment_scores.append(review_sentiment_score)

 
        return reviews_sentiment_scores

 
def get_NN_count(tokenized_pos):
    NN_count = []

 
    for review in tokenized_pos:
        review_NN_count = 0
        for token in review: 
            if token[1] in ["NN", "NNS", "NNP"] or token[0] in characteristics:
                review_NN_count += 1
        NN_count.append(review_NN_count)

 
    return NN_count
​
 
tokenized_reviews = get_tokens(df_filtered, stem = False, negation = False)
tokenized_pos = get_pos(tokenized_reviews)
tokenized_reviews_neg = get_tokens(df_filtered, stem = False, negation = True)
tokenized_pos_neg = get_pos(tokenized_reviews_neg)
NN_count = get_NN_count(tokenized_pos)
df_filtered['new_id'] = range(0, len(df_filtered))
progress:  59.171597633136095 %
progress:  59.171597633136095 %

The following review as an example gives insight of the application capabilities and limitations:


lookup_product_id = 7
for val in df_filtered[df_filtered.new_id == lookup_product_id]["Review"]: print(val)
display(tokenized_pos[lookup_product_id])
review_characteristics_opinion_words = extract_characteristic_opinion_words(tokenized_pos[lookup_product_id], tokenized_pos_neg[lookup_product_id], max_distance = 5, use_distance = True)               
display(review_characteristics_opinion_words)
This phone in an excellent phone at a great price. I was impressed with the features of this phone and would recommend this to anyone.
[('phone', 'NN'),
 ('excellent', 'JJ'),
 ('phone', 'NN'),
 ('great', 'JJ'),
 ('price', 'NN'),
 ('impressed', 'VBD'),
 ('features', 'NNS'),
 ('phone', 'NN'),
 ('would', 'MD'),
 ('recommend', 'VB'),
 ('anyone', 'NN')]
defaultdict(list,
            {'price': [defaultdict(int, {'great': (1, 1)}),
              defaultdict(int, {'excellent': (1, 3)})]})

On the sentiment dictionary we store the characteristic as well as its opinion words with the sentiment score {-1,1} and the distance from the characteristic. Because “impressed” was tagged as a Verb it was not included as an opinion word. The application deal with this kind of cases using a gazetteer (not implemented for “impressed” in this case). Furthermore since “features” does not have any opinion words after it, it was not included in the sentiment dictionary (and even if “impressed” was considered as an opinion word it would be given to “price” which comes first).


review_sentiment_scores = compute_sentiment_scores(tokenized_pos, tokenized_pos_neg, max_distance = 5, use_distance = True)
review_sentiment_scores[:6]
df_filtered["Sentiments"] = list(review_sentiment_scores)
df_filtered["NN_count"] = list(NN_count)
df_filtered[:3]

Out[68]:
  Product Brand Price Rating Review Votes id_col id_new_col cluster_name Standard_Product_Name cluster new_id Sentiments NN_count
53 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 muy buen producto 0.0 53 0 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 0 0 {} 1
69 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 5 Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... 13.0 69 1 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 5 1 {'WiFi': 1.0} 14
71 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.0 1 Hola, compramos dos teléfonos y vienieron tota... 2.0 71 2 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 0 2 {} 26

3. Performance

To determine how effective the application is performing we separated the measurements in two steps.

(1) Measure effectiveness of the mobile phones characteristics extraction and

(2) over the corrected characteristics extracted, measure how effective the sentiments were recorded.

For both steps we created a manually annotated test set with ~150 reviews chosen at random. The format of the test set is as follows:

Where the third column corresponds to manually inputted results.

For step (1) we compared the characteristics extracted by the application for the reviews annotated in the test set. The measurements computed for each review were:

● True_Positives: Correctly extracted characteristics

● True_Negatives: All potential characteristics (NN/NNPs) that were not considered and are not in test set

● False_Positives: Incorrectly extracted characteristics (i.e. not in the test set)

● False_Negatives: Missed characteristics that were considered in the test set

Based on those metrics aggregated on all reviews we calculated Specificity (0.773), Recall (0.070), F1_score (0.036) and Accuracy (0.720).

Our main focus is to have a high Recall, that is, to correctly extract characteristics which represent the main output of the business objective.

Currently it’s extremely low failing to produce relevant insights. Since in contrast Specificity is relatively high it further proves that the model is missing characteristics.

For step (2) using only the characteristics correctly extracted (Recall results) we compared their sentiment scored against those from the test set. The measurements computed for each review were:

● True_Positives: Characteristic correctly classified with positive score

● True_Negatives: Characteristic correctly classified with negative score

● False_Positive: Characteristic incorrectly classified with positive score

● False_Negatives: Characteristic incorrectly classified with negative score

Based on those metrics aggregated on all reviews we calculated Specificity (0.8), Recall (0.666), F1_score (0.666) and Accuracy (0.75). However, results are not statistically significant since the test set on this part was extremely low with only 7 reviews considered that had the correct characteristic extraction. Nonetheless it gives insights that assigning correct sentiment scores is performing better than characteristic extraction with higher Specificity and Recall.

Load and correct Test Data

 

 
# The initial format of he annotated test_set is difficult to read
# as a dataframe, transformation to .csv format is computed first 
# with regular expressions.
test = open('data/annotated_test_set.txt','r', encoding='utf8')
test_file = test.read()
test.close()
test_file[:200]
test_file = re.sub(r"{[^{}]+}", lambda x: x.group(0).replace(",", ";"), test_file)
test_file = test_file.replace(';', "%")
test_file = test_file.replace(',', ";")
test_file = test_file.replace('%', ",")
test_file = test_file.replace('{', "{'")
test_file = test_file.replace(',', ",'")
test_file = test_file.replace(':', "':")
test_file = test_file.replace("},'", "}")
# Once fixed, save and load:
text_file = open("data/annotated_test_set_corrected.csv", "w")
for row in test_file.split(",\n"):
    text_file.write(row)
    text_file.write("\n")
text_file.close()
test = open('data/annotated_test_set_corrected.csv','r', encoding='utf8')
test_file = test.read()
test.close()
test = pd.read_csv('data/annotated_test_set_corrected.csv', delimiter = ";")
test.columns = ['review_id', 'Product', 'Sentiments_test']

 
test[:3]

Out[70]:
  review_id Product Sentiments_test
0 1540 BlackBerry Curve {'Trackball':-1,'Battery':-1,'Micro-SD':-1}
1 1554 Acer Liquid E700 TRIO {'Camera':-1,'Hardware':-1,'Buttons':-1}
2 1697 Alcatel OneTouch {'Hardware':-1,'Charging Port':-1}
 

 
df_merge = pd.merge(df_filtered, test, left_on='id_col', right_on='review_id', how = "left")
df_merge[df_merge.Sentiments_test.isnull()==False]

Out[71]:
  Product_x Brand Price Rating Review Votes id_col id_new_col cluster_name Standard_Product_Name cluster new_id Sentiments NN_count review_id Product_y Sentiments_test
0 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.00 5 muy buen producto 0.0 53 0 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 0 0 {} 1 53.0 Asha 302 {'sound': 1,' smart phone features': 1,' soft...
1 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.00 5 Nokia Asha 302 Unlocked GSM Phone with 3.2MP C... 13.0 69 1 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 5 1 {'WiFi': 1.0} 14 69.0 Asha 302 {'build': 1,' keyboard': 1,'sound': 1,' Xpres...
2 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.00 1 Hola, compramos dos teléfonos y vienieron tota... 2.0 71 2 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 0 2 {} 26 71.0 Asha 302 {'build': 1,' reception': 1,' audio': 1,' key...
3 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.00 5 GRACIAS ME LLEGO EL PROCTO QUE COMPRE Y LLEVO ... 0.0 73 3 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 0 3 {} 8 73.0 Asha 302 {'apps': 1}
4 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.00 4 The keys are a little hard to hit, and I didn'... 0.0 75 4 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 1 4 {'didnt': -1.0, 'keyboard': 1.0} 5 75.0 Asha 302 {'SMS': 1,' rings': 1,' body': 1,' freezes': -1}
5 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.00 5 I bought this phone as a Christmas present for... 3.0 78 5 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 13 5 {'amazon': -0.14285714285714285, 'features': 0... 57 78.0 Asha 302 {'ring tones': 1}
6 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.00 4 The Phone is pretty good. I am using it with a... 2.0 79 6 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 13 6 {} 12 79.0 Asha 302 {'wi-fi': 1,' calendar': 1,' alarm clock': 1,...
7 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.00 5 This phone in an excellent phone at a great pr... 1.0 82 7 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 13 7 {'price': 1.0} 6 82.0 Asha 302 {'price': 1}
8 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.00 4 This is a good phone although it seems to have... 1.0 84 8 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 1 8 {} 7 84.0 Asha 302 {'time': -1,' support': -1,' booklet': -1}
9 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.00 5 I've been a long time user of the iPhone. I fi... 2.0 85 9 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 13 9 {'battery': 1.0, 'email': 1.0, 'etc': 1.0, 'ke... 45 85.0 Asha 302 {'screen': -1,' calling': 1,' messaging': 1,'...
10 "Nokia Asha 302 Unlocked GSM Phone with 3.2MP ... Nokia 299.00 5 This phone, like all of Nokia's feature phones... 12.0 86 10 18 "Nokia Asha 302 GSM with 3.2MP Video, QWERTY W... 13 10 {'features': 1.0, 'value': -1.0, 'keyboard': -... 81 86.0 Asha 302 {'texting interface': 1,' battery': 1,' email...
11 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... NaN 161.06 5 Very nice.arrived on time. I love it. 0.0 734 14 9 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... 12 11 {} 1 734.0 Lenovo A850 {'screen':1,' audio': 1,' apps': -1,' speed':...
12 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... NaN 161.06 3 I sent the phone to Colombia, and They had to ... 0.0 755 15 9 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... 13 12 {} 9 755.0 Lenovo A850 {'language setting': -1,' battery': -1}
13 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... NaN 161.06 2 Didn't have the color that originally wanted w... 0.0 773 16 9 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... 3 13 {} 14 773.0 Lenovo A850 {'apps': 1,' price': 1}
14 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... NaN 161.06 3 sometimes the screen and home button are unres... 0.0 774 17 9 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... 6 14 {'button': -0.6666666666666667} 5 774.0 Lenovo A850 {'charger': -1}
15 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... NaN 161.06 5 Nice phone. Android gsm with 2sims great. 0.0 776 18 9 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... 13 15 {'android': 1.0} 3 776.0 Lenovo A850 {'screen': -1,' button': -1,' brand': 1}
16 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... NaN 161.06 5 I like this smartphone, good quality very very... 0.0 828 19 9 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... 13 16 {'quality': 1.0, 'color': 1.0, 'amazon': 1.0} 10 828.0 Lenovo A850 {'price': 1,' size': 1}
17 5.5-Inch Unlocked Lenovo A850 3G Smartphone-(9... NaN 161.06 5 I have not reached the tlf 1.0 943 20 9 5.5-Inch Lenovo A850 3G Smartphone-(960x540) Q... 0 17 {} 1 943.0 Lenovo A850 {'speed': 1,' size': 1,' screen': 1,' camera'...
19 8330 BlackBerry Curve (US Cellular) Titanium P... NaN 29.95 1 I recevied the phone with broken trackball, mi... 4.0 1540 26 322 8330 BlackBerry Curve Cellular) Titanium 13 19 {'trackball': -1.0, 'microsd': -1.0} 16 1540.0 BlackBerry Curve {'Trackball':-1,'Battery':-1,'Micro-SD':-1}
20 Acer Liquid Jade Z Andoid KitKat Unlocked Quad... Acer 129.99 2 I had high hopes for this Acer phone based the... 0.0 1554 27 179 Acer Liquid Jade Z Andoid KitKat Quad-Core 5" ... 13 20 {'plastic': 0.6666666666666667, 'battery': 0.1... 34 1554.0 Acer Liquid E700 TRIO {'Camera':-1,'Hardware':-1,'Buttons':-1}
21 ALCATEL OneTouch Idol 3 Global Unlocked 4G LTE... Alcatel 292.98 5 It's good 0.0 1697 28 112 ALCATEL OneTouch Idol 3 Global 4G LTE Smartpho... 1 21 {} 0 1697.0 Alcatel OneTouch {'Hardware':-1,'Charging Port':-1}
22 ALCATEL OneTouch Idol 3 Global Unlocked 4G LTE... Alcatel 129.00 1 I am never one to write negative reviews but i... 2.0 1930 29 444 ALCATEL OneTouch Idol 3 Global 4G LTE Smartpho... 13 22 {'audio': 1.0} 18 1930.0 Alcatel OneTouch {'Screen':1,'Size':-1}
23 Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... Apple 33.00 1 The phone that I got doesnt work! 0.0 3177 32 27 Apple Iphone 5c A1532 16 GB Cell 5 23 {} 2 3177.0 iPhone 5c {'size': 1,' charger': 1,' apps': 1,' headpho...
24 Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... Apple 33.00 5 Received a great looking and working used phon... 0.0 3270 33 27 Apple Iphone 5c A1532 16 GB Cell 13 24 {} 3 3270.0 iPhone 5c {'Wifi':-1}
25 Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... Apple 33.00 5 Received a great looking and working used phon... 0.0 3270 33 27 Apple Iphone 5c A1532 16 GB Cell 13 24 {} 3 3270.0 iPhone 5c {'wi-fi': -1}
26 Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... Apple 33.00 5 good phone unlocked 0.0 3274 34 27 Apple Iphone 5c A1532 16 GB Cell 1 25 {} 1 3274.0 iPhone 5c {'screen':1,'speed':1,'battery':-1}
27 Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... Apple 33.00 2 The phone came with a bad speaker I could retu... 0.0 3308 35 27 Apple Iphone 5c A1532 16 GB Cell 13 26 {'speakers': -1.0} 8 3308.0 iPhone 5c {'charging':1,'battery':1}
28 Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... Apple 33.00 1 I did not receive a Verizon wireless,as stated... 31.0 3310 36 27 Apple Iphone 5c A1532 16 GB Cell 8 27 {} 3 3310.0 iPhone 5c {'speaker':-1}
29 Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... Apple 33.00 5 The phone was like new, and works perfect, tha... 0.0 3323 37 27 Apple Iphone 5c A1532 16 GB Cell 10 28 {} 3 3323.0 iPhone 5c {'charger':-1}
30 Apple - Iphone 5c A1532 Verizon 16 GB Cell Pho... Apple 33.00 5 quick delivery, product received as described,... 0.0 3329 38 27 Apple Iphone 5c A1532 16 GB Cell 7 29 {} 3 3329.0 iPhone 5c {'battery':-1}
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
140 HTC Rhyme 3G Android Smartphone Plum Verizon HTC 64.99 4 This phone is what a smart stands for; the nav... 0.0 198109 594 8 HTC Rhyme 3G Android Smartphone Plum 13 139 {'system': 1.0, 'etc': 1.0} 18 198109.0 HTC Rhyme {'charger':-1,'battery':-1}
141 HTC Rhyme 3G Android Smartphone Plum Verizon HTC 64.99 5 It was a delightful surprise to find that this... 0.0 198111 595 8 HTC Rhyme 3G Android Smartphone Plum 5 140 {} 4 198111.0 HTC Rhyme {'navigation system':1,'voice search':1,'spea...
142 HTC Rhyme 3G Android Smartphone Plum Verizon HTC 64.99 5 just started having a few issues but it has wo... 0.0 198115 596 8 HTC Rhyme 3G Android Smartphone Plum 0 141 {} 1 198115.0 HTC Rhyme {'size':1,'weight':1,'keyboard':-1,'SD card':...
143 Huawei Ascend P7 16G 5" Android 4.4 Quad Core ... Huawei 2066.00 5 All very good, excellent product. 0.0 199958 604 304 Huawei Ascend P7 16G 5" Android 4.4 Quad Core ... 4 142 {} 1 199958.0 Huawei Ascend P7 {'software':-1}
144 HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L... Huawei 182.99 5 t camera and design*update* still love the pho... 1.0 199986 605 61 HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone 13 143 {} 10 199986.0 Huawei Ascend P7 {'image quality':-1,'coverage':-1}
145 HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L... Huawei 182.99 4 Network weak 0.0 199992 606 61 HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone 3 144 {'network': -1.0} 1 199992.0 Huawei Ascend P7 {'wifi':-1}
146 HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L... Huawei 182.99 4 Great phone good price. It needs to come with ... 6.0 200009 607 61 HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone 1 145 {'price': 1.0} 6 200009.0 Huawei Ascend P7 {'price':1,'hardware':1}
147 HUAWEI Ascend P7 P7-L10 16GB Unlocked GSM 4G L... Huawei 182.99 5 excellent service and product as described 0.0 200041 608 61 HUAWEI Ascend P7 P7-L10 GSM 4G LTE Smartphone 7 146 {'service': 1.0} 2 200041.0 Huawei Ascend P7 {'specs':1,'price':1,'software':1,'screen':1,...
148 Huawei GX8 Unlocked Smartphone (US Version: RI... Huawei 285.00 5 Best phone I have had. Fingerprint sensor is e... 1.0 200629 610 64 Huawei GX8 Smartphone Version: RIO-L03) Horizon 13 147 {'device': 1.0, 'battery': -1.0} 9 200629.0 Huawei GX8 {'battery': 1}
149 Huawei GX8 Unlocked Smartphone (US Version: RI... Huawei 285.00 5 It's less than half the price of Galaxy 6, but... 3.0 200641 611 64 Huawei GX8 Smartphone Version: RIO-L03) Horizon 11 148 {'feel': 1.0} 6 200641.0 Huawei GX8 {'price': 1,' security options': 1,' battery'...
150 Huawei GX8 Unlocked Smartphone (US Version: RI... Huawei 285.00 4 Phone looks and feel nice. It is about the siz... 13.0 200642 612 64 Huawei GX8 Smartphone Version: RIO-L03) Horizon 13 149 {'feel': 1.0, 'size': 1.0, 'didnt': -1.0} 11 200642.0 Huawei GX8 {'price': 1,' design': 1,' screen': 1,' finge...
151 Huawei GX8 Unlocked Smartphone (US Version: RI... Huawei 285.00 1 Just bought a gx8 from Amazon. After one day o... 1.0 200657 613 64 Huawei GX8 Smartphone Version: RIO-L03) Horizon 5 150 {} 11 200657.0 Huawei GX8 {'multi-tasking': -1,' fingerprint reader': 1...
152 Huawei GX8 Unlocked Smartphone (US Version: RI... Huawei 285.00 5 Just switched from iphone 6s plus to this prod... 4.0 200658 614 64 Huawei GX8 Smartphone Version: RIO-L03) Horizon 3 151 {} 17 200658.0 Huawei GX8 {'battery': 1,' camera': 1,' screen': 1,' pri...
153 Huawei Mate 2 - Factory Unlocked (Black) Huawei 229.99 2 I bought the phone July 2014,It didn't work we... 2.0 200946 615 102 Huawei Mate 2 Factory 5 152 {'voice': -1.0, 'function': -1.0} 30 200946.0 Huawei Mate 2 {'battery':1,'bluetootch':1,'size':1}
154 Huawei Mate 2 - Factory Unlocked (Black) Huawei 229.99 1 This phone ages quickly. My previous phone was... 2.0 200955 616 102 Huawei Mate 2 Factory 13 153 {'look': -1.0, 'settings': -1.0, 'power': -1.0} 48 200955.0 Huawei Mate 2 {'battery':1,'screen':1}
155 LG Optimus S Android Phone, Gray (Sprint) LG 69.98 4 I was skeptical of buying this item because of... 0.0 236742 697 24 LG Optimus S Android (Sprint) 0 154 {'design': -0.3333333333333333} 37 236742.0 Optimus S {'look': 1,' case': 1,' screen protector': 1}
156 LG Optimus S Android Phone, Gray (Sprint) LG 69.98 4 Great rubberized case! Instead of being flimsy... 0.0 236743 698 24 LG Optimus S Android (Sprint) 11 155 {'feel': 1.0, 'look': 0.3333333333333333} 21 236743.0 Optimus S {'case': 1,' design': -1}
157 LG Optimus S Android Phone, Gray (Sprint) LG 69.98 5 Thank you 0.0 236745 699 24 LG Optimus S Android (Sprint) 10 156 {} 1 236745.0 Optimus S {'case': 1,' screen protector': -1}
158 LG Optimus S Android Phone, Gray (Sprint) LG 69.98 5 I purshes this ad on for my cell phone sprit s... 0.0 236748 700 24 LG Optimus S Android (Sprint) 10 157 {'cell': 1.0} 15 236748.0 Optimus S {'design': 1,' case': -1}
159 LG Optimus S Android Phone, Gray (Sprint) LG 69.98 1 They sent me the wrong case. I was so disappoi... 0.0 236750 701 24 LG Optimus S Android (Sprint) 0 158 {} 9 236750.0 Optimus S {'build': -1}
160 LG Optimus S Android Phone, Gray (Sprint) LG 69.98 5 This case is for an LG Optimus S but fits the ... 0.0 236751 702 24 LG Optimus S Android (Sprint) 0 159 {'design': -1.0} 14 236751.0 Optimus S {'case': -1}
161 LG Xenon GR500 Unlocked Phone with QWERTY Keyb... LG 129.99 5 This phone is easy to maneuver and user friend... 2.0 238741 707 49 LG Xenon GR500 with QWERTY 2MP and Touch Screen 13 160 {'speakers': 1.0, 'look': 1.0} 15 238741.0 LG Xenon GR500 {'keyboard': 1,' color': 1}
162 LG Xenon GR500 Unlocked Phone with QWERTY Keyb... LG 129.99 4 I bought this phone for my son, who has had lo... 0.0 238859 708 49 LG Xenon GR500 with QWERTY 2MP and Touch Screen 13 161 {'service': -1.0, 'didnt': -1.0, 'features': 1.0} 31 238859.0 LG Xenon GR500 {'screen': -1,' ease of use': 1,' battery': -1}
163 LG Xenon GR500 Unlocked Phone with QWERTY Keyb... LG 129.99 5 Great phone for the money. Easy to operate and... 3.0 238891 709 49 LG Xenon GR500 with QWERTY 2MP and Touch Screen 13 162 {} 7 238891.0 LG Xenon GR500 {'ease of use': 1,' keyboard': 1}
164 Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ... Microsoft 300.51 5 The most important for me is that it let me wo... 1.0 240859 715 90 Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ... 5 163 {'windows': 1.0} 7 240859.0 Microsoft Lumia 950 {'price': 1,' camera': 1,' battery': 1,' weig...
165 Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ... Microsoft 300.51 3 Not too bad. 1.0 240862 716 90 Microsoft Lumia 950 RM-1104 5.2" 20mp 3gb Ram ... 0 164 {} 0 240862.0 Microsoft Lumia 950 {'size': 1,' software': 1,' apps': 1}
166 Microsoft Lumia 950 XL RM-1085 32GB Black, Sin... Microsoft 328.41 5 As a Tmobile customer, I was saddened when I l... 4.0 240956 718 45 Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7... 13 165 {'windows': 1.0, 'cover': 0.3333333333333333} 66 240956.0 Microsoft Lumia 950 {'speed': 1,' case': 1,' screen': 1}
167 Microsoft Lumia 950 XL RM-1085 32GB Black, Sin... Microsoft 328.41 5 I can't even tell you how much I love this pho... 1.0 241013 720 45 Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7... 13 166 {'online': -1.0, 'device': -0.3333333333333333... 38 241013.0 Microsoft Lumia 950 {'ease of use': 1}
168 Microsoft Lumia 950 XL RM-1085 32GB White, Sin... NaN 333.41 4 So,the phone arrived with no phone case and no... 2.0 241146 722 45 Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7... 13 167 {} 8 241146.0 Microsoft Lumia 950 {'internet': -1}
169 Microsoft Lumia 950 XL RM-1085 32GB White, Sin... NaN 333.41 5 This is the single SIM card version, not the d... 1.0 241214 723 45 Microsoft Lumia 950 XL RM-1085 Single Sim, 5.7... 5 168 {'features': 1.0} 52 241214.0 Microsoft Lumia 950 {'sim card': 1,' speed': 1,' camera': 1,' SMS...

129 rows × 17 columns
 

lookup = 1540
for val in df_merge[df_merge.id_col == lookup].Review:
    print(val)

 
df_merge[df_merge.id_col == lookup]
I recevied the phone with broken trackball, missing micro-sd and missing battery.The seller claimed that it is 100% working. i cannot see how such a phone can be workingwithout the internal sd and battery. It claimed that it is OEM and brand new.My obervations indicated this was a poorly attempted refurbished phone. They must berunnin out of second handed parts.

Out[72]:
  Product_x Brand Price Rating Review Votes id_col id_new_col cluster_name Standard_Product_Name cluster new_id Sentiments NN_count review_id Product_y Sentiments_test
19 8330 BlackBerry Curve (US Cellular) Titanium P... NaN 29.95 1 I recevied the phone with broken trackball, mi... 4.0 1540 26 322 8330 BlackBerry Curve Cellular) Titanium 13 19 {'trackball': -1.0, 'microsd': -1.0} 16 1540.0 BlackBerry Curve {'Trackball':-1,'Battery':-1,'Micro-SD':-1}

Load Functions (for characteristics extraction performance)


 
def characteristics_extraction_performance(NN_count, training, test):
    TP = 0
    TN = 0
    FP = 0
    FN = 0
    temp_test = []
    test = eval(test)

 
    for test_characteristic in test.keys():
        test_characteristic = str(test_characteristic).lower()
        test_characteristic = re.sub(r'[^A-Za-z /.]','',test_characteristic)
        temp_test.append(test_characteristic)

 
        if test_characteristic in training.keys():
            TP += 1
        else: 
            FN += 1

 
    TN = NN_count - len(training.keys()) - FN

 
    for train_characteristic in training.keys():
        if train_characteristic not in temp_test:
            FP += 1

 
    return TP, TN, FP, FN

 
def compute_characteristics_extraction_performance(df_merge):
    total_TP = 0
    total_TN = 0
    total_FP = 0
    total_FN = 0

 
    for i in range(len(df_merge)):
        NN_count = df_merge.NN_count[i]
        training = df_merge.Sentiments[i]
        test = df_merge.Sentiments_test[i]
        if pd.isnull(test): continue
        TP, TN, FP, FN = characteristics_extraction_performance(NN_count, training, test)

 
        total_TP += TP
        total_TN += TN
        total_FP += FP
        total_FN += FN

 
    if total_TP + total_FP == 0:
        TPR_RECALL = 0
    else:
        TPR_RECALL =  total_TP / (total_TP + total_FP)

 
    TNR_SPECIFICITY = total_TN / (total_TN + total_FN)
    F1_Score = 2* total_TP / (2*total_TP + total_FP + total_FN)
    Accuracy = (total_TP + total_TN) / (total_TP + total_TN + total_FP + total_FN)
    fpr = total_FP / (total_FN + total_FP)

 
    return TPR_RECALL, TNR_SPECIFICITY, F1_Score, Accuracy, fpr

 

 
 
 
Recall, Specificity, F1_Score, Accuracy, fpr= compute_characteristics_extraction_performance(df_merge)
print("Recall: ", Recall)
print("Specificity: ", Specificity)
print("F1_Score: ", F1_Score)
print("Accuracy: ", Accuracy)
Recall:  0.05128205128205128
Specificity:  0.777699364855
F1_Score:  0.0273972602739726
Accuracy:  0.722294654498

Load Functions (for sentiment analysis performance)  

 
def characteristics_sentiment_performance(training, test):
    TP = 0
    TN = 0
    FP = 0
    FN = 0
    test = eval(test)

 
    for test_characteristic, test_score in test.items():
        test_characteristic = str(test_characteristic).lower()
        test_characteristic = re.sub(r'[^A-Za-z /.]','',test_characteristic)

 
        if test_characteristic in training.keys():
            if test_score == training[test_characteristic]:
                if test_score > 0:
                    TP += 1
                else:
                    TN += 1
            else:
                if test_score > 0:
                    FN += 1
                else:
                    FP += 1            
        else: 
            continue    
    return TP, TN, FP, FN
def compute_characteristics_sentiment_performance(df_merge):
    total_TP = 0
    total_TN = 0
    total_FP = 0
    total_FN = 0
    cases = 0

 
    for i in range(len(df_merge)):
        training = df_merge.Sentiments[i]
        test = df_merge.Sentiments_test[i]
        if pd.isnull(test): continue
        TP, TN, FP, FN = characteristics_sentiment_performance(training, test)
        if TP+ TN+ FP+ FN > 0:
            cases+=1            

 
        total_TP += TP
        total_TN += TN
        total_FP += FP
        total_FN += FN

 
    if total_TP + total_FP == 0:
        TPR_RECALL = 0
    else:
        TPR_RECALL =  total_TP / (total_TP + total_FP)

 
    TNR_SPECIFICITY = total_TN / (total_TN + total_FN)
    F1_Score = 2* total_TP / (2*total_TP + total_FP + total_FN)
    Accuracy = (total_TP + total_TN) / (total_TP + total_TN + total_FP + total_FN)
    fpr = total_FP / (total_FN + total_FP)

 
    return TPR_RECALL, TNR_SPECIFICITY, F1_Score, Accuracy, cases

 

 
Recall, Specificity, F1_Score, Accuracy, cases= compute_characteristics_sentiment_performance(df_merge)
print("Reviews Evaluated: ", cases)
print("Recall: ", Recall)
print("Specificity: ", Specificity)
print("F1_Score: ", F1_Score)
print("Accuracy: ", Accuracy)
Reviews Evaluated:  5
Recall:  0.6666666666666666
Specificity:  0.6666666666666666
F1_Score:  0.6666666666666666
Accuracy:  0.6666666666666666

4. Business Insights

By extracting the main characteristics that customers are reviewing and which rating (i.e sentiment score) they are giving to them the business will be able to understand what positively or negatively affects product reviews and what specifically users choose as highlights or pain points. From the output table with the sentiments scores assigned to each product name characteristics and simple reporting transformation a the following table can be obtained:

Flexible enough allowing to create further reports such as:

Which can then be used by manufacturers (i.e. Apple or Samsung) to improve the quality of their products based on a specific characteristic they are getting negative reviews, and also by sellers who can use this information to diversify their products (for example have one which is strong in screen quality and another in battery) or to stop buying products that have critical issues.

5. Discussion

5.1 Further Improvements

Businesses do not necessarily need to have a sentiment score for reviews, especially for ecommerce sites such as Amazon where a rating is also available. For manufactures in particular even if they have a score they would not know exactly where to prioritize their efforts to improve their products. Instead, by giving them the specifics characteristics where their products are failing or not they get valuable insights to tackle problems as they arise. Hence, the challenge of correctly extracting the products characteristics is of major importance. This application underperforms in the capabilities of extracting the characteristics and seems to perform fairly well in assigning the correct sentiments to them (although some exceptions need to be adjusted through gazetteers by using the domain knowledge of the business and industry). To further improve characteristics extraction an approach using topic modelling could be implemented, where assumptions are made on the probabilistic distribution of topics inside documents. An example of this would be the Latent Dirichlet Allocation that outputs word clusters. By extending the basic model of identifying topics, we can separate sentiment and features from each topic. As mentioned before, opinion word can be incorrectly assigned to characteristics when multiple characteristics are present, a task that could be tackled and improved with the usage of Name Entity Recognition (NER) and Relationship extraction (RE). Because of computational limitations we worked only on a subsample of the ~400,000 reviews. In the future using cloud computing as well as parallelization and improving the algorithm will allow to process an even larger amount of reviews. Finally to have statistically significant results a larger test set should be created with roughly at least 10% of the data (for this project only ~150 reviews were created).

5.2 Conclusion

In this project we analyzed the performance of measuring sentiment analysis on specific characteristics of mobile phones mentioned in customer reviews to provide manufacturers with actionable insights to improve their products and for sellers to improve their offerings. Results shows the worst performance on characteristic extraction where Recall is critically low. This topic is also the main challenge which could be further improved by implementing topic modelling. Sentiment scores on characteristics extraction revealed a good but not great performance suggesting that further improvements could be made using Relationship Extraction. However the test set was too small to have a clear statistical significance on the results.

Last Modified: 2020-03-02 15:48
Views: 333