Homework 1 - Question 3

Prepared by Ceren Demirkol, Okan Güven, Sevgican Varol

In this problem the analyse the user references over Netflix rating data of the 100 movies.

In [9]:
#HW1 - Question 3
options(warn=-1) #in order to hide warnings
library("data.table")
require(readxl) # to read excel input
require(data.table) # to use data.table functionalities

#reading titles of the movies
data_path='C:/Users/ceren.orhan/Desktop/ETM 58D/ETM58D_Spring20_HW1_q3_movie_titles.csv'
titles=read.csv(data_path, header = FALSE, sep = ",")
new_name=c("Year","Title")
setnames(titles,names(titles),new_name)
head(titles)
YearTitle
2000 Miss Congeniality
1996 Independence Day
2000 The Patriot
2004 The Day After Tomorrow
2003 Pirates of the Caribbean: The Curse of the Black Pearl
1990 Pretty Woman
In [2]:
#reading rates of the movies
data_path='C:/Users/ceren.orhan/Desktop/ETM 58D/ETM58D_Spring20_HW1_q3_Netflix_data.txt'
rates_with_zero_rating=read.table(data_path, header = FALSE)
#setnames(rates,names(rates),titles[2])

head(rates_with_zero_rating)
V1V2V3V4V5V6V7V8V9V10...V91V92V93V94V95V96V97V98V99V100
4 4 5 4 5 3 5 5 5 4 ...4 5 3 3 3 0 5 2 5 5
4 4 5 4 5 4 5 5 2 5 ...4 0 3 0 0 3 3 0 1 0
3 4 4 4 5 5 4 4 3 3 ...5 0 2 2 3 4 4 5 0 4
3 4 4 3 5 4 4 4 4 4 ...3 4 3 0 0 4 1 3 0 4
5 5 5 4 5 4 5 5 4 5 ...4 4 4 3 3 5 3 3 3 0
5 4 5 2 4 5 5 4 4 4 ...0 5 2 4 4 0 0 4 3 4

From the preview, there are 0 ratings for some movies, whis actually correspond "no rating". To have more accurate analyses, 0 values are replaces by mean rating value of the corresponding movie.

In [3]:
#Replacing 0 rates with median of rating for that movie
m<-c(1:length(rates_with_zero_rating))
rates<-rates_with_zero_rating
for(i in 1:length(rates_with_zero_rating)){
    a=rates_with_zero_rating[,i]
    m[i]=median(a[a!=0])
}

for(i in 1:length(rates_with_zero_rating)){
    #print(i)
    for(ii in 1:length(rates_with_zero_rating[,i])){
        if(rates_with_zero_rating[ii,i]==0){
            rates[ii,i]=m[i]
        }else{rates[ii,i]=rates_with_zero_rating[ii,i]}
    }
}
head(rates)
V1V2V3V4V5V6V7V8V9V10...V91V92V93V94V95V96V97V98V99V100
4 4 5 4 5 3 5 5 5 4 ...4 5 3 3 3 4 5 2 5 5
4 4 5 4 5 4 5 5 2 5 ...4 5 3 3 4 3 3 4 1 5
3 4 4 4 5 5 4 4 3 3 ...5 5 2 2 3 4 4 5 3 4
3 4 4 3 5 4 4 4 4 4 ...3 4 3 3 4 4 1 3 3 4
5 5 5 4 5 4 5 5 4 5 ...4 4 4 3 3 5 3 3 3 5
5 4 5 2 4 5 5 4 4 4 ...4 5 2 4 4 4 3 4 3 4

Then the distance matrix created with Euclidian method and MDS performed with 2 dimensions.

In [13]:
dist_rates=dist(rates,method = "euclidean")
mat_rates=as.matrix(dist_rates)
mds_coord=cmdscale(mat_rates,2)

Below figure shows how users are clustered. From that we can conclude that their raitng behaviour is almost similar. At least we can not see seperate clusters. And since we are dealing wiht 10000 users, number of outliers is negligable.

In [11]:
plot(mds_coord)

In order to see movie - rate relation same analysis performed on the transpose of the rate matrix.

In [14]:
dist_rates=dist(t(rates),method = "euclidean")
mat_rates=as.matrix(dist_rates)
mds_coord=cmdscale(mat_rates,2)
In [15]:
plot(mds_coord)

There is a greater variability on first component of mds_coord rather than the second one. If we can combine this data with different information like movie year, movie type, better analysis can be done.

When we review the plot, we can seperate the movies into 6 clusters.

  • Black cluster is the biggest cluster and have the largest variance.
  • Blue and Red clusters are the most different clusters.
  • Pink and Green clusters are the most similar clusters.
  • There is a one movie that does not fit any cluster. It can be an outlier or it can be included to one cluster with additional analyses.
In [ ]: