Prepared by Ceren Demirkol, Okan Güven, Sevgican Varol
In this problem the aim is to illustrate the curse of dimensionality.
In this part, proportion of the point in the hypersphere and hypercube calculated for dimensions from 1 to 10 and with 1000 random pointswhere the value in each dimension is uniformly randomly distributed between -1 and +1.
# Homework 1 - Question 1 - part a
options(warn=-1) #in order to hide warnings
library(data.table)
# number of samples and dimension(s)
nof_samples=1000
set.seed(1)
#creating required variables
estimated_pi<-c(1:10) # for 2-d assumption
estimated_pi_2d<-c(1:10) # for 2-d approximation
#calculatin for each dimensions
for (D in 1:10){
# data matrix creation
data_points=runif(nof_samples*2,-1,1)
data_points=matrix(data_points,ncol=D)
#calculating euclidean distance
sum<-0
for (i in 1:D){
sum=(data_points[,i])^2
}
euclidean_distance=sqrt(sum)
# number of points in the sphere
is_in_sphere=euclidean_distance<=1
nof_sphere=sum(is_in_sphere)
#proportion esstimations
estimated_pi[i]=nof_sphere/nof_samples #with volume approximation
}
If we plot the result, we see that when dimension increases, proportion decreaes. We can comment that, when the number of dimension increases, each dimension becomes more similar to each other.
#plotting dimension vs proportions
plot(estimated_pi,main="Proportion vs Dimension",ylab="Proportion",xlab="Dimension")
grid()
For 3 Dimensions, the volume of the hypersphere to the hypercube approximation is used. V_sphere / V_cube = π / 6
For 2 Dimensions, the volume of the hypersphere to the hypercube approximation is used. V_sphere / V_cube = π / 4
# Homework 1 - Question 1 - part b
print(paste("Esitmation for 2D is",estimated_pi[2]*4)) # second index used to get values for 2-D
print(paste("Esitmation for 3D is",estimated_pi[3]*6)) # third index used to get values for 3-D
print(paste("Actual Pi is",pi))
In this part 100 new instance is added to the data. Then the distance is calculated for each of them for each dimension and the closest point in 1000 instances selected using Euclidean distance
# Homework 1 - Question 1 - part c
# number of samples and dimension(s)
nof_samples=1000
nof_new_samples=100
set.seed(1)
min_distance<-c(1:10)
for (D in 1:10){
data_points=runif(nof_samples*2,-1,1)
data_points=matrix(data_points,ncol=D)
#calculating euclidean distance
sum<-0
for (i in 1:D){
sum=(data_points[,i])^2
}
euclidean_distance_1000=sqrt(sum)
for(j in 1:length(nof_new_samples)){
new_data_points=runif(nof_new_samples*2,-1,1)
new_data_points=matrix(new_data_points,ncol=D)
}
min_distance[D]=min(dist_mat=proxy::dist(data_points,new_data_points))
}
When we plot the minimum distances versus dimensions, we can say that there is an increasing trend between min distance and the dimenison in general. But there are 2 points where minimum distance decreased when dimension incerased.
In order to make more accurate comments, higher number of samples can be used.
plot(min_distance,main="Minimum Distance vs Dimension",ylab="Proportion",xlab="Dimension")
grid()