Prepared by Ceren Demirkol, Okan GC In this problem the analyse the user references over Netflix rating data of the 100 movies.
#HW1 - Question 3
options(warn=-1) #in order to hide warnings
library("data.table")
require(readxl) # to read excel input
require(data.table) # to use data.table functionalities
#reading titles of the movies
data_path='C:/Users/ceren.orhan/Desktop/ETM 58D/ETM58D_Spring20_HW1_q3_movie_titles.csv'
titles=read.csv(data_path, header = FALSE, sep = ",")
new_name=c("Year","Title")
setnames(titles,names(titles),new_name)
head(titles)
#reading rates of the movies
data_path='C:/Users/ceren.orhan/Desktop/ETM 58D/ETM58D_Spring20_HW1_q3_Netflix_data.txt'
rates_with_zero_rating=read.table(data_path, header = FALSE)
#setnames(rates,names(rates),titles[2])
head(rates_with_zero_rating)
From the preview, there are 0 ratings for some movies, whis actually correspond "no rating". To have more accurate analyses, 0 values are replaces by mean rating value of the corresponding movie.
#Replacing 0 rates with median of rating for that movie
m<-c(1:length(rates_with_zero_rating))
rates<-rates_with_zero_rating
for(i in 1:length(rates_with_zero_rating)){
a=rates_with_zero_rating[,i]
m[i]=median(a[a!=0])
}
for(i in 1:length(rates_with_zero_rating)){
#print(i)
for(ii in 1:length(rates_with_zero_rating[,i])){
if(rates_with_zero_rating[ii,i]==0){
rates[ii,i]=m[i]
}else{rates[ii,i]=rates_with_zero_rating[ii,i]}
}
}
head(rates)
Then the distance matrix created with Euclidian method and MDS performed with 2 dimensions.
dist_rates=dist(rates,method = "euclidean")
mat_rates=as.matrix(dist_rates)
mds_coord=cmdscale(mat_rates,2)
Below figure shows how users are clustered. From that we can conclude that their raitng behaviour is almost similar. At least we can not see seperate clusters. And since we are dealing wiht 10000 users, number of outliers is negligable.
plot(mds_coord)
In order to see movie - rate relation same analysis performed on the transpose of the rate matrix.
dist_rates=dist(t(rates),method = "euclidean")
mat_rates=as.matrix(dist_rates)
mds_coord=cmdscale(mat_rates,2)
plot(mds_coord)
There is a greater variability on first component of mds_coord rather than the second one. If we can combine this data with different information like movie year, movie type, better analysis can be done.
When we review the plot, we can seperate the movies into 6 clusters.