We worked on the Boston house price dataset that is already available in library MASS. Aim is to understand if certain variables explain the variability better than the others. We looked at the general structure and summary of the data and also correlation matrix to get a sense of correlations between variables.
library(MASS)
data <- Boston
summary(data)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
cor(data)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
We performed PCA and commented on how much of the variability is explained by which components. While performing the PCA, we perform the calculations on correlation matrix with cor=TRUE, ensuring scaling amongst the variables. PCA returns 14 components corresponding to the dataset. We see that 46% of the variability can be explained by Component 1 alone, and Proportion of Variance goes down from there. We have chosen our treshold as 90% for this dataset, so we will make use of first 8 components explanining 92% of the variability. Original attributesb mapping to these components can be observed from Loadings.
pca <- princomp(data,cor = TRUE)
summary(pca,loadings = TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.5585132 1.2843410 1.16142409 0.94156246 0.92244211
## Proportion of Variance 0.4675707 0.1178237 0.09635042 0.06332428 0.06077853
## Cumulative Proportion 0.4675707 0.5853944 0.68174481 0.74506909 0.80584762
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.81241047 0.73171771 0.63488312 0.52655824 0.50225237
## Proportion of Variance 0.04714363 0.03824363 0.02879118 0.01980454 0.01801839
## Cumulative Proportion 0.85299125 0.89123488 0.92002606 0.93983060 0.95784899
## Comp.11 Comp.12 Comp.13 Comp.14
## Standard deviation 0.4612919 0.42777038 0.366073349 0.245614857
## Proportion of Variance 0.0151993 0.01307054 0.009572121 0.004309047
## Cumulative Proportion 0.9730483 0.98611883 0.995690953 1.000000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## crim 0.242 0.395 0.100 0.225 0.777 0.157 0.254
## zn -0.245 0.148 0.395 0.343 0.114 0.336 -0.274 -0.380 0.383 -0.246
## indus 0.332 -0.127 -0.340 0.172 0.627 0.255
## chas -0.411 -0.125 0.700 -0.535 -0.163
## nox 0.325 -0.254 0.195 0.149 -0.198 0.212
## rm -0.203 -0.434 0.353 -0.293 -0.131 -0.438 0.526
## age 0.297 -0.260 -0.201 0.150 0.119 -0.588 -0.246
## dis -0.298 0.359 0.157 0.185 -0.106 -0.104 -0.128 -0.176 0.299
## rad 0.303 0.419 -0.230 0.135 -0.137 -0.463 -0.116
## tax 0.324 0.343 -0.163 0.188 -0.314 -0.179
## ptratio 0.208 0.315 -0.342 -0.616 -0.279 -0.283 0.275 -0.160
## black -0.197 -0.361 -0.202 -0.367 0.786 0.146
## lstat 0.311 0.201 -0.161 0.243 0.178 -0.357 -0.172
## medv -0.267 -0.445 0.163 -0.180 0.152 -0.576
## Comp.11 Comp.12 Comp.13 Comp.14
## crim
## zn 0.128 -0.221 -0.132
## indus -0.274 0.348 -0.235
## chas
## nox 0.437 -0.449 0.525
## rm -0.224 -0.126
## age 0.330 0.486
## dis 0.115 0.494 0.552
## rad -0.635
## tax 0.170 -0.243 0.699
## ptratio -0.232 0.188
## black
## lstat -0.683 -0.182 0.249
## medv -0.242 0.470 0.134
Variances explained by the components can also be seen in the plot.
plot(pca)