Homework 1-Question 2

We worked on the Boston house price dataset that is already available in library MASS. Aim is to understand if certain variables explain the variability better than the others. We looked at the general structure and summary of the data and also correlation matrix to get a sense of correlations between variables.

library(MASS)
data <- Boston
summary(data)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
cor(data)
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000

We performed PCA and commented on how much of the variability is explained by which components. While performing the PCA, we perform the calculations on correlation matrix with cor=TRUE, ensuring scaling amongst the variables. PCA returns 14 components corresponding to the dataset. We see that 46% of the variability can be explained by Component 1 alone, and Proportion of Variance goes down from there. We have chosen our treshold as 90% for this dataset, so we will make use of first 8 components explanining 92% of the variability. Original attributesb mapping to these components can be observed from Loadings.

pca <- princomp(data,cor = TRUE)
summary(pca,loadings = TRUE)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     2.5585132 1.2843410 1.16142409 0.94156246 0.92244211
## Proportion of Variance 0.4675707 0.1178237 0.09635042 0.06332428 0.06077853
## Cumulative Proportion  0.4675707 0.5853944 0.68174481 0.74506909 0.80584762
##                            Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     0.81241047 0.73171771 0.63488312 0.52655824 0.50225237
## Proportion of Variance 0.04714363 0.03824363 0.02879118 0.01980454 0.01801839
## Cumulative Proportion  0.85299125 0.89123488 0.92002606 0.93983060 0.95784899
##                          Comp.11    Comp.12     Comp.13     Comp.14
## Standard deviation     0.4612919 0.42777038 0.366073349 0.245614857
## Proportion of Variance 0.0151993 0.01307054 0.009572121 0.004309047
## Cumulative Proportion  0.9730483 0.98611883 0.995690953 1.000000000
## 
## Loadings:
##         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## crim     0.242         0.395  0.100         0.225  0.777  0.157  0.254        
## zn      -0.245  0.148  0.395  0.343  0.114  0.336 -0.274 -0.380  0.383 -0.246 
## indus    0.332 -0.127                             -0.340  0.172  0.627  0.255 
## chas           -0.411 -0.125  0.700 -0.535 -0.163                             
## nox      0.325 -0.254                0.195  0.149 -0.198                0.212 
## rm      -0.203 -0.434  0.353 -0.293        -0.131        -0.438         0.526 
## age      0.297 -0.260 -0.201         0.150         0.119 -0.588        -0.246 
## dis     -0.298  0.359  0.157  0.185 -0.106        -0.104 -0.128 -0.176  0.299 
## rad      0.303         0.419        -0.230  0.135 -0.137        -0.463 -0.116 
## tax      0.324         0.343        -0.163  0.188 -0.314        -0.179        
## ptratio  0.208  0.315        -0.342 -0.616 -0.279        -0.283  0.275 -0.160 
## black   -0.197        -0.361 -0.202 -0.367  0.786                       0.146 
## lstat    0.311  0.201 -0.161  0.243  0.178               -0.357 -0.172        
## medv    -0.267 -0.445  0.163 -0.180                       0.152        -0.576 
##         Comp.11 Comp.12 Comp.13 Comp.14
## crim                                   
## zn       0.128  -0.221  -0.132         
## indus   -0.274   0.348          -0.235 
## chas                                   
## nox      0.437  -0.449   0.525         
## rm      -0.224  -0.126                 
## age      0.330   0.486                 
## dis      0.115   0.494   0.552         
## rad                             -0.635 
## tax              0.170  -0.243   0.699 
## ptratio         -0.232   0.188         
## black                                  
## lstat   -0.683  -0.182   0.249         
## medv    -0.242           0.470   0.134

Variances explained by the components can also be seen in the plot.

plot(pca)