Homework 1 - Question 3¶

Prepared by Ceren Demirkol, Okan GC

In this problem the analyse the user references over Netflix rating data of the 100 movies.

#HW1 - Question 3
options(warn=-1) #in order to hide warnings
library("data.table")
require(readxl) # to read excel input
require(data.table) # to use data.table functionalities

#reading titles of the movies
data_path='C:/Users/ceren.orhan/Desktop/ETM 58D/ETM58D_Spring20_HW1_q3_movie_titles.csv'
titles=read.csv(data_path, header = FALSE, sep = ",")
new_name=c("Year","Title")
setnames(titles,names(titles),new_name)
head(titles)

#reading rates of the movies
data_path='C:/Users/ceren.orhan/Desktop/ETM 58D/ETM58D_Spring20_HW1_q3_Netflix_data.txt'
rates_with_zero_rating=read.table(data_path, header = FALSE)
#setnames(rates,names(rates),titles[2])

head(rates_with_zero_rating)

From the preview, there are 0 ratings for some movies, whis actually correspond "no rating". To have more accurate analyses, 0 values are replaces by mean rating value of the corresponding movie.

#Replacing 0 rates with median of rating for that movie
m<-c(1:length(rates_with_zero_rating))
rates<-rates_with_zero_rating
for(i in 1:length(rates_with_zero_rating)){
    a=rates_with_zero_rating[,i]
    m[i]=median(a[a!=0])
}

for(i in 1:length(rates_with_zero_rating)){
    #print(i)
    for(ii in 1:length(rates_with_zero_rating[,i])){
        if(rates_with_zero_rating[ii,i]==0){
            rates[ii,i]=m[i]
        }else{rates[ii,i]=rates_with_zero_rating[ii,i]}
    }
}
head(rates)

Then the distance matrix created with Euclidian method and MDS performed with 2 dimensions.

dist_rates=dist(rates,method = "euclidean")
mat_rates=as.matrix(dist_rates)
mds_coord=cmdscale(mat_rates,2)

Below figure shows how users are clustered. From that we can conclude that their raitng behaviour is almost similar. At least we can not see seperate clusters. And since we are dealing wiht 10000 users, number of outliers is negligable.

plot(mds_coord)

In order to see movie - rate relation same analysis performed on the transpose of the rate matrix.

dist_rates=dist(t(rates),method = "euclidean")
mat_rates=as.matrix(dist_rates)
mds_coord=cmdscale(mat_rates,2)

plot(mds_coord)

There is a greater variability on first component of mds_coord rather than the second one. If we can combine this data with different information like movie year, movie type, better analysis can be done.

When we review the plot, we can seperate the movies into 6 clusters.

Black cluster is the biggest cluster and have the largest variance.
Blue and Red clusters are the most different clusters.
Pink and Green clusters are the most similar clusters.
There is a one movie that does not fit any cluster. It can be an outlier or it can be included to one cluster with additional analyses.

Year	Title
2000	Miss Congeniality
1996	Independence Day
2000	The Patriot
2004	The Day After Tomorrow
2003	Pirates of the Caribbean: The Curse of the Black Pearl
1990	Pretty Woman

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V91	V92	V93	V94	V95	V96	V97	V98	V99	V100
4	4	5	4	5	3	5	5	5	4	...	4	5	3	3	3	0	5	2	5	5
4	4	5	4	5	4	5	5	2	5	...	4	0	3	0	0	3	3	0	1	0
3	4	4	4	5	5	4	4	3	3	...	5	0	2	2	3	4	4	5	0	4
3	4	4	3	5	4	4	4	4	4	...	3	4	3	0	0	4	1	3	0	4
5	5	5	4	5	4	5	5	4	5	...	4	4	4	3	3	5	3	3	3	0
5	4	5	2	4	5	5	4	4	4	...	0	5	2	4	4	0	0	4	3	4

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V91	V92	V93	V94	V95	V96	V97	V98	V99	V100
4	4	5	4	5	3	5	5	5	4	...	4	5	3	3	3	4	5	2	5	5
4	4	5	4	5	4	5	5	2	5	...	4	5	3	3	4	3	3	4	1	5
3	4	4	4	5	5	4	4	3	3	...	5	5	2	2	3	4	4	5	3	4
3	4	4	3	5	4	4	4	4	4	...	3	4	3	3	4	4	1	3	3	4
5	5	5	4	5	4	5	5	4	5	...	4	4	4	3	3	5	3	3	3	5
5	4	5	2	4	5	5	4	4	4	...	4	5	2	4	4	4	3	4	3	4

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V91	V92	V93	V94	V95	V96	V97	V98	V99	V100
4	4	5	4	5	3	5	5	5	4	...	4	5	3	3	3	0	5	2	5	5
4	4	5	4	5	4	5	5	2	5	...	4	0	3	0	0	3	3	0	1	0
3	4	4	4	5	5	4	4	3	3	...	5	0	2	2	3	4	4	5	0	4
3	4	4	3	5	4	4	4	4	4	...	3	4	3	0	0	4	1	3	0	4
5	5	5	4	5	4	5	5	4	5	...	4	4	4	3	3	5	3	3	3	0
5	4	5	2	4	5	5	4	4	4	...	0	5	2	4	4	0	0	4	3	4

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V91	V92	V93	V94	V95	V96	V97	V98	V99	V100
4	4	5	4	5	3	5	5	5	4	...	4	5	3	3	3	4	5	2	5	5
4	4	5	4	5	4	5	5	2	5	...	4	5	3	3	4	3	3	4	1	5
3	4	4	4	5	5	4	4	3	3	...	5	5	2	2	3	4	4	5	3	4
3	4	4	3	5	4	4	4	4	4	...	3	4	3	3	4	4	1	3	3	4
5	5	5	4	5	4	5	5	4	5	...	4	4	4	3	3	5	3	3	3	5
5	4	5	2	4	5	5	4	4	4	...	4	5	2	4	4	4	3	4	3	4

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V91	V92	V93	V94	V95	V96	V97	V98	V99	V100
4	4	5	4	5	3	5	5	5	4	...	4	5	3	3	3	0	5	2	5	5
4	4	5	4	5	4	5	5	2	5	...	4	0	3	0	0	3	3	0	1	0
3	4	4	4	5	5	4	4	3	3	...	5	0	2	2	3	4	4	5	0	4
3	4	4	3	5	4	4	4	4	4	...	3	4	3	0	0	4	1	3	0	4
5	5	5	4	5	4	5	5	4	5	...	4	4	4	3	3	5	3	3	3	0
5	4	5	2	4	5	5	4	4	4	...	0	5	2	4	4	0	0	4	3	4

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V91	V92	V93	V94	V95	V96	V97	V98	V99	V100
4	4	5	4	5	3	5	5	5	4	...	4	5	3	3	3	4	5	2	5	5
4	4	5	4	5	4	5	5	2	5	...	4	5	3	3	4	3	3	4	1	5
3	4	4	4	5	5	4	4	3	3	...	5	5	2	2	3	4	4	5	3	4
3	4	4	3	5	4	4	4	4	4	...	3	4	3	3	4	4	1	3	3	4
5	5	5	4	5	4	5	5	4	5	...	4	4	4	3	3	5	3	3	3	5
5	4	5	2	4	5	5	4	4	4	...	4	5	2	4	4	4	3	4	3	4

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V91	V92	V93	V94	V95	V96	V97	V98	V99	V100
4	4	5	4	5	3	5	5	5	4	...	4	5	3	3	3	0	5	2	5	5
4	4	5	4	5	4	5	5	2	5	...	4	0	3	0	0	3	3	0	1	0
3	4	4	4	5	5	4	4	3	3	...	5	0	2	2	3	4	4	5	0	4
3	4	4	3	5	4	4	4	4	4	...	3	4	3	0	0	4	1	3	0	4
5	5	5	4	5	4	5	5	4	5	...	4	4	4	3	3	5	3	3	3	0
5	4	5	2	4	5	5	4	4	4	...	0	5	2	4	4	0	0	4	3	4

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...	V91	V92	V93	V94	V95	V96	V97	V98	V99	V100
4	4	5	4	5	3	5	5	5	4	...	4	5	3	3	3	4	5	2	5	5
4	4	5	4	5	4	5	5	2	5	...	4	5	3	3	4	3	3	4	1	5
3	4	4	4	5	5	4	4	3	3	...	5	5	2	2	3	4	4	5	3	4
3	4	4	3	5	4	4	4	4	4	...	3	4	3	3	4	4	1	3	3	4
5	5	5	4	5	4	5	5	4	5	...	4	4	4	3	3	5	3	3	3	5
5	4	5	2	4	5	5	4	4	4	...	4	5	2	4	4	4	3	4	3	4