1  Introduction to Multivariate Data

Author

Pulong Ma

1.1 Course Outline

  • Introduction to Multivariate Data
  • Numerical Summaries and Visualization of Multivariate Data
  • Multivariate Normal Distribution
  • Comparing Centers of Distributions
    (Hotelling T^2, MANOVA, Repeated Measures)
  • Principal Component Analysis
    (Summarizing data in lower dimensions)
  • Exploratory Factor Analysis
    (What to do when the variables of interest cannot be directly observed.)
  • Multi-Dimensional Scaling
  • Cluster Analysis
  • Discriminant Analysis and Classification
    (Including linear discriminants, logistic regression, classification trees, and random forests)

1.2 Objectives of Multivariate Analysis

  • Understand dependencies among variables: What is the nature of associations among variables?
  • Prediction: If variables are associated, then we might be able to predict the value of some of them given information on the others. (Statistical inference)
  • Hypothesis testing: Are differences in sets of response means for two or more groups large enough to be distinguished from sampling variation? (Statistical inference)
  • Dimensionality reduction: Can we reduce the dimensionality of the problem by considering a small number of (linear) combinations of a large number of measurements without losing important information?
    • Principal Components
    • Factor Analysis
    • Multidimensional Scaling
  • Grouping (Cluster Analysis): Identify groups of “similar” units using a common set of measured traits.
  • Classification: Classify units into previously defined groups using a common set of measured traits.

1.3 Introduction to R and RStudio

1.3.1 Getting Started

To download R, please choose your preferred CRAN mirror. Here I recommend the mirror 0-Cloud available at https://cloud.r-project.org/.

Install R on macOS

For macOS users, below are the steps you need to install R.

  1. Install Xcode (e.g., via the App Store). Skip this step if your Mac already has Xcode installed.
  2. Install XQuartz.
  3. Go to https://cloud.r-project.org/ and click “Download R for macOS.”
  4. Click on the download link for the latest R version for macOS and follow the instruction.

Install R on Windows:

For Windows users, the following instructions guide you to install R.

  1. Go to https://cloud.r-project.org/ and click “Download R for Windows.”
  2. Click “install R for the first time.”
  3. Choose a download mirror (any CRAN mirror will work).
  4. Click on the download link for the latest R version for Windows and follow the instruction

Install RStudio

RStudio has multiple panes in the window, open by default: one for writing and editing code, another for executing it, another for plots produced or help, and another that lists the R data objects available. RStudio can be download for free at https://posit.co/download/rstudio-desktop/.

1.3.2 Introduction to R Programming

Whenever you work with R for data analysis, it is recommended to load all the packages used in the data analysis pipeline before the function sessionInfo(). In addition, if random numbers are generated, it is also recommended to set random seed to ensure reproducibility using set.seed().

R Working Environment
# load packages 
library(ggplot2)
library(tidyr)
library(dplyr)
library(readr)
library(lubridate)

sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.3.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lubridate_1.9.4 readr_2.1.5     tidyr_1.3.1     dplyr_1.1.4    
[5] ggplot2_3.5.2  

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5        cli_3.6.5          knitr_1.50         rlang_1.1.6       
 [5] xfun_0.52          purrr_1.0.4        generics_0.1.4     jsonlite_2.0.0    
 [9] glue_1.8.0         htmltools_0.5.8.1  hms_1.1.3          scales_1.4.0      
[13] rmarkdown_2.29     grid_4.4.3         evaluate_1.0.3     tibble_3.2.1      
[17] tzdb_0.5.0         fastmap_1.2.0      yaml_2.3.10        lifecycle_1.0.4   
[21] compiler_4.4.3     RColorBrewer_1.1-3 timechange_0.3.0   htmlwidgets_1.6.4 
[25] pkgconfig_2.0.3    rstudioapi_0.17.1  farver_2.1.2       digest_0.6.37     
[29] R6_2.6.1           tidyselect_1.2.1   pillar_1.10.2      magrittr_2.0.3    
[33] withr_3.0.2        tools_4.4.3        gtable_0.3.6      
Code
# Numeric, logical, character types
3.14159             # numeric

[1] 3.14159

Code
T                   # logical, can also be TRUE

[1] TRUE

Code
F                   # logical, can also be FALSE 

[1] FALSE

Code
"Stat 4750/5750"    # character

[1] “Stat 4750/5750”

Code
# Basic arithmetic operators
1 + 2

[1] 3

Code
20 - 3

[1] 17

Code
2 * 6

[1] 12

Code
4 / 3

[1] 1.333333

Code
2 ^ 4

[1] 16

Code
-3.14

[1] -3.14

Code
(1 + 2) * (3 / 4)  # use parenthesis

[1] 2.25

Code
2 == 1

[1] FALSE

Code
# Vector
c()  # empty vector

NULL

Code
c(1, 2, 3, 4)

[1] 1 2 3 4

Code
1:4

[1] 1 2 3 4

Code
# Look up the function help, 3 ways:
# Following 3 ways work for most functions
          # 1. search in help pane
?sum      # 2. use ?
sum(1:4)  # 3. move your cursor to the function, then press F1, the easiest way

[1] 10

Code
# The 3rd way doesn't work for functions having some symbols, like +, ==, %*%
          # 1. search in help pane
?`+`      # 2. use ? but wrap the symbol with ``
Code
# Assignment operator <-

# Numeric class
mynumbers <- 5:12
mynumbers
[1]  5  6  7  8  9 10 11 12
Code
mynumbers + 2
[1]  7  8  9 10 11 12 13 14
Code
mynumbers * 10
[1]  50  60  70  80  90 100 110 120
Code
typeof(mynumbers)      # type of an object
[1] "integer"
Code
# check if is numeric (both integer and double type are numeric)
is.numeric(mynumbers)  
[1] TRUE
Code
is.vector(mynumbers)   # check if is vector
[1] TRUE
Code
length(mynumbers)      # length of an object
[1] 8
Code
# Character class
mytext <- c("hello", "class", "four") 
mytext
[1] "hello" "class" "four" 
Code
typeof(mytext)
[1] "character"
Code
is.numeric(mytext)
[1] FALSE
Code
is.character(mytext)
[1] TRUE
Code
is.character(mynumbers)
[1] FALSE
Code
is.vector(mytext)
[1] TRUE
Code
length(mytext)
[1] 3
Code
# Logical class
mylogic <- c(TRUE, FALSE, TRUE, TRUE)
typeof(mylogic)
[1] "logical"
Code
# Factor class
gender <- c("male", "female", "female", "female", "male") 
gender
[1] "male"   "female" "female" "female" "male"  
Code
is.vector(gender)
[1] TRUE
Code
is.character(gender)
[1] TRUE
Code
is.vector(gender)
[1] TRUE
Code
is.factor(gender)
[1] FALSE
Code
genderf <- factor(gender)
genderf
[1] male   female female female male  
Levels: female male
Code
is.character(genderf)
[1] FALSE
Code
is.factor(genderf)
[1] TRUE
Code
summary(mynumbers)  # numeric: 5-number summary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.00    6.75    8.50    8.50   10.25   12.00 
Code
summary(mylogic)    # logical: count how many T and F
   Mode   FALSE    TRUE 
logical       1       3 
Code
summary(mytext)     # character:
   Length     Class      Mode 
        3 character character 
Code
summary(genderf)    # factor: count the frequency
female   male 
     3      2 
Code
# List class
mylist <- list(1, "ABC", FALSE)
mylist
[[1]]
[1] 1

[[2]]
[1] "ABC"

[[3]]
[1] FALSE
Code
is.vector(mylist)
[1] TRUE
Code
is.list(mylist)
[1] TRUE
Code
length(mylist)
[1] 3
Code
# All elements should be the same type
mydata <- matrix(c(140, 120, 160, 145, 125, 65, 60, 63, 66, 61), ncol = 2, byrow = FALSE) 
mydata
     [,1] [,2]
[1,]  140   65
[2,]  120   60
[3,]  160   63
[4,]  145   66
[5,]  125   61
Code
dim(mydata)  # dimensions
[1] 5 2
Code
colnames(mydata) <- c("Weight.lbs", "Height.in")
rownames(mydata) <- c("a", "b", "c", "d", "e")
mydata
  Weight.lbs Height.in
a        140        65
b        120        60
c        160        63
d        145        66
e        125        61
Code
summary(mydata)  # summary of each column
   Weight.lbs    Height.in 
 Min.   :120   Min.   :60  
 1st Qu.:125   1st Qu.:61  
 Median :140   Median :63  
 Mean   :138   Mean   :63  
 3rd Qu.:145   3rd Qu.:65  
 Max.   :160   Max.   :66  
Data Frames
# Different columns can have different types
df <- data.frame(Weight.lbs = c(140, 120, 160, 145, 125, 180, 165),
                 Height.in = c(65, 60, 63, 66, 61, 70, 68),
                 Gender = c(rep("Female", 5), rep("Male", 2)), stringsAsFactors = T)
df
  Weight.lbs Height.in Gender
1        140        65 Female
2        120        60 Female
3        160        63 Female
4        145        66 Female
5        125        61 Female
6        180        70   Male
7        165        68   Male
Data Frames
dim(df)
[1] 7 3
Data Frames
head(df)     # the first 6 rows
  Weight.lbs Height.in Gender
1        140        65 Female
2        120        60 Female
3        160        63 Female
4        145        66 Female
5        125        61 Female
6        180        70   Male
Data Frames
summary(df)  # summary of each column
   Weight.lbs      Height.in        Gender 
 Min.   :120.0   Min.   :60.00   Female:5  
 1st Qu.:132.5   1st Qu.:62.00   Male  :2  
 Median :145.0   Median :65.00             
 Mean   :147.9   Mean   :64.71             
 3rd Qu.:162.5   3rd Qu.:67.00             
 Max.   :180.0   Max.   :70.00             
Data Frames
str(df)      # structure of the data frame
'data.frame':   7 obs. of  3 variables:
 $ Weight.lbs: num  140 120 160 145 125 180 165
 $ Height.in : num  65 60 63 66 61 70 68
 $ Gender    : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2

a tibble is a modern implementation of a data frame, designed to be more user-friendly and efficient, especially when working with large datasets.

tibble
# install.packages("tibble")
library(tibble)
# create a tibble from an existing object
as_tibble(df) 
# A tibble: 7 × 3
  Weight.lbs Height.in Gender
       <dbl>     <dbl> <fct> 
1        140        65 Female
2        120        60 Female
3        160        63 Female
4        145        66 Female
5        125        61 Female
6        180        70 Male  
7        165        68 Male  
tibble
# create a new tibble 
tibble(x=1:5, 
       y=1.0,
       z=x^2+y)
# A tibble: 5 × 3
      x     y     z
  <int> <dbl> <dbl>
1     1     1     2
2     2     1     5
3     3     1    10
4     4     1    17
5     5     1    26
tibble
# define a row-by-row tibble
tribble(
  ~x, ~y, ~z,
  "a", 2, 1,
  "b", 3, 4
)
# A tibble: 2 × 3
  x         y     z
  <chr> <dbl> <dbl>
1 a         2     1
2 b         3     4
Code
mydata[, 1]                     # extract the first column
  a   b   c   d   e 
140 120 160 145 125 
Code
mydata[1, ]                     # extract the first row
Weight.lbs  Height.in 
       140         65 
Code
mydata[1:2, ]                   # extract the first two rows
  Weight.lbs Height.in
a        140        65
b        120        60
Code
mydata[3, 2]                    # extract the element in the third row and the second column
[1] 63
Code
df[seq(1, 7, 2), ]              # extract every second row
  Weight.lbs Height.in Gender
1        140        65 Female
3        160        63 Female
5        125        61 Female
7        165        68   Male
Code
seq(1, 7, 2)
[1] 1 3 5 7
Code
subset(df, Gender == "Male")    # select based on value
  Weight.lbs Height.in Gender
6        180        70   Male
7        165        68   Male
Code
t(mydata)                       # transpose of matrix
             a   b   c   d   e
Weight.lbs 140 120 160 145 125
Height.in   65  60  63  66  61
Code
mydata[, 1] * mydata[, 2]       # element-wise multiplication
    a     b     c     d     e 
 9100  7200 10080  9570  7625 
Code
mydata %*% t(mydata)            # matrix multiplication
      a     b     c     d     e
a 23825 20700 26495 24590 21465
b 20700 18000 22980 21360 18660
c 26495 22980 29569 27358 23843
d 24590 21360 27358 25381 22151
e 21465 18660 23843 22151 19346
Code
library(ggplot2)
data("USArrests")
dat = USArrests %>% as_tibble()
dat %>% 
  ggplot(aes(x=UrbanPop, y=Murder)) + 
  geom_point()

Code
dat_long = dat %>%
  tidyr::pivot_longer(1:4,names_to = "Variable", values_to = "Value") 
head(dat_long)
# A tibble: 6 × 2
  Variable Value
  <chr>    <dbl>
1 Murder    13.2
2 Assault  236  
3 UrbanPop  58  
4 Rape      21.2
5 Murder    10  
6 Assault  263  
Code
# faceted histogram
ggplot(dat_long, aes(Value)) + 
  geom_histogram(bins=20, fill="grey80", color="white") + 
  facet_wrap(~Variable, scales="free") + 
  labs(title = "Distributions by Variable")

1.4 Organization of Data and Notation

  • n = the number of observations or units
  • p = the number of variables measured on each unit
  • If p = 1, then we are back in the usual univariate setting
  • x_{ik} = the i-th observation of the k-th variable

\text{Observations} \quad \overset{\text{Variables}}{ \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} }

  • Data matrix: X_{n \times p}=\left[ \begin{array}{cccc} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np}\end{array} \right ]

  • This can be written as n rows or as p columns X_{n \times p} = \begin{bmatrix} \mathbf{x}_1^{\prime} \\ \mathbf{x}_2^{\prime} \\ \vdots \\ \mathbf{x}_n^{\prime} \end{bmatrix} = \begin{bmatrix} \mathbf{x}_1^{\top} \\ \mathbf{x}_2^{\top} \\ \vdots \\ \mathbf{x}_n^{\top} \end{bmatrix} = \left[ \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_p \right] where both \prime and \top represent matrix transpose.

R Code: Data Organization
X = as.matrix(USArrests)
dim(X)
[1] 50  4
R Code: Matrix Transpose
t(X[1:5,])
         Alabama Alaska Arizona Arkansas California
Murder      13.2   10.0     8.1      8.8        9.0
Assault    236.0  263.0   294.0    190.0      276.0
UrbanPop    58.0   48.0    80.0     50.0       91.0
Rape        21.2   44.5    31.0     19.5       40.6

1.5 Descriptive Statistics

1.5.1 Sample Mean

  • The sample mean of the kth variable (k = 1,...,p) is \bar{x}_k = \frac{1}{n} \sum_{i = 1}^n x_{ik}

1.5.2 Sample Variance and Sample Standard Deviation

  • The sample variance of the kth variable is s^2_k = \frac{1}{n-1} \sum_{i=1}^n (x_{ik} - \bar{x}_k)^2

  • The sample standard deviation is given by s_k = \sqrt{s^2_k}

  • We often use s_{kk} to denote the sample variance for the k-th variable. Thus, s^2_k = s_{kk}

1.5.3 Sample Covariance and Sample Correlation

  • The sample covariance between variable k and variable j is computed as s_{jk} = \frac{1}{n-1} \sum_{i=1}^n (x_{ij} - \bar{x}_j) (x_{ik} - \bar{x}_k)

  • If variables k and j are independent, the population covariance will be exactly zero, but the sample covariance will vary about zero.

  • The sample correlation between variables k and j is defined as r_{jk} = \frac{s_{jk}}{\sqrt{s_{jj}} \sqrt{s_{kk}}}

  • r_{jk} is between -1 and 1.

  • r_{jk} = r_{kj}

  • The sample correlation is equal to the sample covariance if measurements are standardized.

  • The sample correlation (r_{ij}) will vary about the value of the population correlation (\rho_{ij})

# load built-in US crime rates data
data("USArrests") 
dat = USArrests 
head(dat)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7
Sample mean
mu = colMeans(dat)
mu 
  Murder  Assault UrbanPop     Rape 
   7.788  170.760   65.540   21.232 
Sample mean
library(dplyr)
dat %>% 
  summarise(across(where(is.numeric), mean))
  Murder Assault UrbanPop   Rape
1  7.788  170.76    65.54 21.232
Sample variance
s2 = apply(dat, 2, var)
s2 
    Murder    Assault   UrbanPop       Rape 
  18.97047 6945.16571  209.51878   87.72916 
Sample variance
# using dplyr
dat %>% 
  summarise(across(where(is.numeric), var))
    Murder  Assault UrbanPop     Rape
1 18.97047 6945.166 209.5188 87.72916
Sample standard deviation
# compute standard deviation from data 
s = apply(dat, 2, sd)  
# or from sample variance 
s = sqrt(s2)
s 
   Murder   Assault  UrbanPop      Rape 
 4.355510 83.337661 14.474763  9.366385 
Sample covariance
n = nrow(dat)
s_jk = 1/(n-1) * sum((dat[,1] - mu[1]) * (dat[,2] - mu[2]))
s_jk 
[1] 291.0624
Sample correlation
r_jk = s_jk / sqrt(s2[1] * s2[2])

1.5.4 Covariance and Correlation Measures

  • Covariance and correlation measure linear association.
    Other non-linear (or curved) relationships may exist among variables even if r_{jk} = 0.

  • A population correlation of zero means no linear association,
    but it does not necessarily imply independence.

Correlation Measures Linear Association

Nonlinear Dependence with (Near) Zero Correlation

1.6 Matrix Organization of Descriptive Statistics

1.6.1 Sample Mean

  • Sample mean: \bar{\mathbf{x}} is the p \times 1 vector of sample means:

\bar{\mathbf{x}} = \begin{bmatrix} \bar{x}_1 \\ \bar{x}_2 \\ \vdots \\ \bar{x}_p \end{bmatrix}

  • \bar{\mathbf{x}} is an estimate of the vector of population means:

\boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_p \end{bmatrix}

R Code: Sample Mean
X = as.matrix(USArrests)
xbar = colMeans(X)
xbar 
  Murder  Assault UrbanPop     Rape 
   7.788  170.760   65.540   21.232 
  • Centered Data: X_c = X - \mathbf{1}_n \bar{\mathbf{x}}^\top
R Code: Centered Data
n = nrow(X) 
ones_n = matrix(1, n, ncol=1)
Xc = X - ones_n %*% t(xbar)

1.6.2 Sample Covariance

  • S is the p \times p symmetric matrix of sample variances (on the diagonal) and sample covariances (the off-diagonal elements):

S = \begin{bmatrix} s_{11} & s_{12} & s_{13} & \cdots & s_{1p} \\ s_{21} & s_{22} & s_{23} & \cdots & s_{2p} \\ \vdots & \vdots & \vdots & & \vdots \\ s_{p1} & s_{p2} & s_{p3} & \cdots & s_{pp} \end{bmatrix} = \frac{1}{n-1} \sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^{\top} = \frac{1}{n-1} X_c^\top X_c

  • S is an estimate of the population covariance matrix:

\Sigma = \begin{bmatrix} \sigma_{11} & \sigma_{12} & \sigma_{13} & \cdots & \sigma_{1p} \\ \sigma_{21} & \sigma_{22} & \sigma_{23} & \cdots & \sigma_{2p} \\ \vdots & \vdots & \vdots & & \vdots \\ \sigma_{p1} & \sigma_{p2} & \sigma_{p3} & \cdots & \sigma_{pp} \end{bmatrix}

R Code: Sample Covariance
S = cov(X) 
# or using matrix algebra
S_mat = t(Xc)%*%Xc/(n-1)
signif(cbind(S, S_mat),4)
          Murder Assault UrbanPop   Rape  Murder Assault UrbanPop   Rape
Murder    18.970   291.1    4.386  22.99  18.970   291.1    4.386  22.99
Assault  291.100  6945.0  312.300 519.30 291.100  6945.0  312.300 519.30
UrbanPop   4.386   312.3  209.500  55.77   4.386   312.3  209.500  55.77
Rape      22.990   519.3   55.770  87.73  22.990   519.3   55.770  87.73

1.6.3 Sample Correlation

  • The p \times p matrix of sample correlations is also symmetric:

R = \begin{bmatrix} 1 & r_{12} & r_{13} & \cdots & r_{1p} \\ r_{21} & 1 & r_{23} & \cdots & r_{2p} \\ \vdots & \vdots & \vdots & & \vdots \\ r_{p1} & r_{p2} & r_{p3} & \cdots & 1 \end{bmatrix} = D^{-1/2} \, S \, D^{-1/2}

  • D^{-1/2} is a diagonal matrix with (j,j) entry 1/\sqrt{s_{jj}} = 1/s_j, i.e.,

D^{-1/2} = \begin{bmatrix} \frac{1}{\sqrt{s_{11}}} & 0 & 0 & \cdots & 0 \\ 0 & \frac{1}{\sqrt{s_{22}}} & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & & \vdots \\ 0 & 0 & 0 & \cdots & \frac{1}{\sqrt{s_{pp}}} \end{bmatrix}

  • R is an estimate of the population correlation matrix P = \begin{bmatrix} 1 & \rho_{12} & \rho_{13} & \cdots & \rho_{1p} \\ \rho_{21} & 1 & \rho_{23} & \cdots & \rho_{2p} \\ \vdots & \vdots & \vdots & & \vdots \\ \rho_{p1} & \rho_{p2} & \rho_{p3} & \cdots & 1 \end{bmatrix}
R Code: Sample Correlation
R = cor(X)
R
             Murder   Assault   UrbanPop      Rape
Murder   1.00000000 0.8018733 0.06957262 0.5635788
Assault  0.80187331 1.0000000 0.25887170 0.6652412
UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
Rape     0.56357883 0.6652412 0.41134124 1.0000000
R Code: Pairwise Scatterplots and Correlations
GGally::ggpairs(dat)

1.7 Standardization

1.7.1 Standardized Data (or z-Scores)

  • Suppose x_{ij} is the measurement on the j-th outcome variable for the i-th subject in the data set.

  • The standardized value is
    z_{ij} = \frac{x_{ij} - \bar{x}_{j}}{s_j}

  • The entire set of p standardized responses for the i-th subject can be computed as
    \mathbf{z}_i = \begin{bmatrix} z_{i1} \\ z_{i2} \\ \vdots \\ z_{ip} \end{bmatrix} = \begin{bmatrix} \dfrac{x_{i1} - \bar{x}_{1}}{s_1} \\ \dfrac{x_{i2} - \bar{x}_{2}}{s_2} \\ \vdots \\ \dfrac{x_{ip} - \bar{x}_{p}}{s_p} \end{bmatrix} = D^{-1/2} (\mathbf{x}_i - \bar{\mathbf{x}})

R Code: Standardized Data
s_j = sqrt(diag(S))

Dsqrt = diag(1/s_j)

# using matrix algebra
Z =  (X -  matrix(1, nrow(X), 1) %*% matrix(xbar, 1)) %*% Dsqrt

# using scale function 
X_scaled = as.matrix(scale(X, center=TRUE, scale=TRUE))


# using mutate function
dat_scaled <- dat %>%
  mutate(across(everything(), ~ (.x - mean(.x)) / sd(.x)))
R Code: Plotting the Data
p1 = ggplot(dat, aes(Murder, Assault)) + 
  geom_point(alpha=0.8) + 
  labs(title="Unscaled")

p2 = ggplot(dat_scaled, aes(Murder, Assault)) + 
  geom_point(alpha=.8) + 
  labs(title="Standardized (z-scores)")

patchwork::wrap_plots(p1, p2, ncol=1)

1.7.2 Standardized Population Mean

  • The vector of true means for a set of standardized responses is
    a vector of zeros:

E(\mathbf{z}_i) = E\begin{bmatrix} z_{i1} \\ z_{i2} \\ \vdots \\ z_{ip} \end{bmatrix} = E\begin{bmatrix} \dfrac{x_{i1} - \bar{x}_{1}}{s_1} \\ \dfrac{x_{i2} - \bar{x}_{2}}{s_2} \\ \vdots \\ \dfrac{x_{ip} - \bar{x}_{p}}{s_p} \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}

  • The vector of estimated means for a set of standardized responses
    is also a vector of zeros.

1.7.3 Standardized Population Covariance

  • The true covariance matrix for a set of standardized responses is the population correlation matrix:

\text{Var}(\mathbf{z}_i) = P = \begin{bmatrix} 1 & \rho_{12} & \cdots & \rho_{1p} \\ \rho_{21} & 1 & \cdots & \rho_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{p1} & \rho_{p2} & \cdots & 1 \end{bmatrix}

  • An estimate of the true covariance matrix for a set of standardized responses is the sample correlation matrix:

\widehat{\text{Var}(\mathbf{z}_i)} = R = \begin{bmatrix} 1 & r_{12} & \cdots & r_{1p} \\ r_{21} & 1 & \cdots & r_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ r_{p1} & r_{p2} & \cdots & 1 \end{bmatrix} = D^{-1/2} S D^{-1/2}

1.8 Exercises

1.8.1 Exercise 1: Data Types

Common types of objects for data analysis are numeric, character, logical, factors, and dates. Character data do not have numerical values, such as names of people or words in a book, Logical data takes values of true or false and can be used to make decisions.

  1. For each of the following commands, either explain why they should be errors, or explain the non-erroneous result.
    x <- c("1","2","3")
    max(x)
    sort(x)
    sum(x)
  1. For the next two commands, either explain their results, or why they should produce errors.
    y <- c("1",3,4)
    y[2] + y[3]
  1. For the next two commands, either explain their results, or why they should produce errors.
    z <- data.frame(z1="1", z2=3, z3=5)
    z[1,2] + z[1,3]
  • x is bound to a vector of characters instead of numerical values. The functions max, sort, sum should take numerical values as input, so the values in x are first converted from characters to numerical values implicitly and then the functions apply to these incorrect numerical values

  • y is a vector of characters, which cannot be used for addition

  • z is a data frame, but the variable z1 is a character while z2 and z3 are numerical. Adding z2 and z3 will produce the correct results.

1.8.2 Exercise 2: Working with Matrix and Data Frames

  1. A matrix is a 2D array, with rows and columns, much like you would have seen in linear algebra. Primarily we think of these as numeric objects. Arrays can have more than two dimensions. In multivariate analysis we are typically thinking of data in the form of a matrix, with samples/cases in the rows and variables represented as columns. Data frames are 2D arrays that could have multiple types of data in different columns. Lists are collections of possibly different length and different types of objects.
  1. Create a matrix and print it out.
View Solution
dat = matrix(c(140, 120, 160, 145, 125, 
               65, 60, 63, 66, 61), 
             ncol = 2, byrow = FALSE) 
View Solution
dat # or print(dat) 
     [,1] [,2]
[1,]  140   65
[2,]  120   60
[3,]  160   63
[4,]  145   66
[5,]  125   61
  1. Rename the column names of a matrix.
View Solution
colnames(dat) = c("weight.lbs", "Height.in")
dat 
     weight.lbs Height.in
[1,]        140        65
[2,]        120        60
[3,]        160        63
[4,]        145        66
[5,]        125        61
  1. Data frames can store both columns of numeric data and columns of character data, columns of integers, and factors.
  1. Create a data frame with Male and Female observations, and print out the data frame.
View Solution
df = data.frame(
  Weight.lbs = c(140, 120, 160, 145, 125,
       180, 165), 
  Height.in = c(65, 60, 63, 66, 61, 70, 68),
  Gender = c(rep("Female", 5), rep("Male", 2))
  ) 

df
  Weight.lbs Height.in Gender
1        140        65 Female
2        120        60 Female
3        160        63 Female
4        145        66 Female
5        125        61 Female
6        180        70   Male
7        165        68   Male
  1. Get the rows and columns of a data frame.
View Solution
dim(df)
[1] 7 3
View Solution
nrow(df)
[1] 7
View Solution
ncol(df)
[1] 3
  1. Get summary statistics of a data frame.
View Solution
summary(df)
   Weight.lbs      Height.in        Gender         
 Min.   :120.0   Min.   :60.00   Length:7          
 1st Qu.:132.5   1st Qu.:62.00   Class :character  
 Median :145.0   Median :65.00   Mode  :character  
 Mean   :147.9   Mean   :64.71                     
 3rd Qu.:162.5   3rd Qu.:67.00                     
 Max.   :180.0   Max.   :70.00                     
  1. Operations with Matrix and Data Frame
  1. Extract the first row and second column of a matrix.
View Solution
dat[1,]
weight.lbs  Height.in 
       140         65 
View Solution
dat[,2]
[1] 65 60 63 66 61
  1. Subset the first two rows of a matrix
View Solution
dat[1:2,]
     weight.lbs Height.in
[1,]        140        65
[2,]        120        60
  1. Subset rows 1,3,5 of a matrix.
View Solution
dat[c(1,3,5), ]
     weight.lbs Height.in
[1,]        140        65
[2,]        160        63
[3,]        125        61
  1. Select all the Male observations from a data frame.
View Solution
subset(df, Gender=="Male")
  Weight.lbs Height.in Gender
6        180        70   Male
7        165        68   Male
  1. Transpose of a matrix
View Solution
A = matrix(c(3,1,2,4), ncol=2)
A
     [,1] [,2]
[1,]    3    2
[2,]    1    4
  1. Matrix multiplication
View Solution
B = matrix(c(1,2,3,4), ncol=2)
A %*% B 
     [,1] [,2]
[1,]    7   17
[2,]    9   19
View Solution
B %*% A 
     [,1] [,2]
[1,]    6   14
[2,]   10   20
  1. Matrix inversion
View Solution
solve(A)
     [,1] [,2]
[1,]  0.4 -0.2
[2,] -0.1  0.3

1.8.3 Exercise 3: Linear Algebra

Consider the linear system A X = b, where A is an n\times n positive definite matrix and b is a n-dimensional vector, the unique solution is X = A^{-1}b. Please answer the following questions:

  1. Write an R function called my_solver() such that given inputs A and b, the function my_solver() returns the solution of the linear system, i.e., X <- my_solver(A, b).
  2. Run the following code to get A and b.
set.seed(123)
A = matrix(c(5,1,1,6), ncol=2)
n = nrow(A)
b = rnorm(n,1)

Then use your function my_solver() to produce the answer and verify your solution. (hint: AX should be equal to b)

View Solution
my_solver <- function(A, b){
  x = solve(A, b)
  return(x)
}

A = matrix(c(5,1,1,6), ncol=2)
n = nrow(A)
b = rnorm(n,1)
x1 = my_solver(A, b)
sum((A%*%x1-b)^2)
[1] 1.232595e-32

1.8.4 Exercise 4: Working with ggplot2 Package

EPA monitors Air Quality data across the entire U.S. The file AQSdata.csv contains daily PM 2.5 concentrations and other information. Please answer the following questions using the ggplot() function for plotting. In addition make sure that all the x-axis and y-axis labels have 14 font size.

  1. Read the data file AQSdata.csv into R.
  2. Generate density plots of PM2.5 concentrations grouped by County in one single panel, where each density should have its own color. What do you find from the figure?
  3. Plot histograms of PM2.5 concentrations across different counties with one panel for one histogram.
  4. Generate boxplots of PM2.5 concentrations by County. What would you say about the distributions?
  5. Reorder the boxplots above by the median value of PM2.5 concentrations.
  6. Converting the Site ID to a factor and plot the histogram grouped by Site ID.
  7. Generate the time series plot for the monitoring Site ID 450190048.
  8. Plot time series of PM2.5 concentrations for all monitoring sites in one panel, where each site has its own color
  9. Plot time series of PM2.5 concentrations across all monitoring sites in multiple panels, where one panel only has one site, and each row only has two panels.
  10. In the time series plot, there seems to be not enough space to hold the x-axis labels. One way to avoid this is to rotate the axis labels. Please rotate all the time labels 45 degree.
Code
library(readr)
library(lubridate)

df = read_csv("AQSdata.csv")
head(df)
# A tibble: 6 × 20
  Date       Source `Site ID`   POC Daily Mean PM2.5 Con…¹ UNITS DAILY_AQI_VALUE
  <chr>      <chr>      <dbl> <dbl>                  <dbl> <chr>           <dbl>
1 11/09/2021 AQS    450190020     1                   15.5 ug/m…              58
2 11/10/2021 AQS    450190020     1                   13.6 ug/m…              54
3 11/11/2021 AQS    450190020     1                    8.1 ug/m…              34
4 11/12/2021 AQS    450190020     1                    7.1 ug/m…              30
5 11/13/2021 AQS    450190020     1                   10.7 ug/m…              45
6 11/14/2021 AQS    450190020     1                    7.5 ug/m…              31
# ℹ abbreviated name: ¹​`Daily Mean PM2.5 Concentration`
# ℹ 13 more variables: `Site Name` <chr>, DAILY_OBS_COUNT <dbl>,
#   PERCENT_COMPLETE <dbl>, AQS_PARAMETER_CODE <dbl>, AQS_PARAMETER_DESC <chr>,
#   CBSA_CODE <dbl>, CBSA_NAME <chr>, STATE_CODE <dbl>, STATE <chr>,
#   COUNTY_CODE <chr>, COUNTY <chr>, SITE_LATITUDE <dbl>, SITE_LONGITUDE <dbl>
Code
theme_mat = theme(axis.text = element_text(size = 14),
                  axis.title = element_text(size = 14, face = "bold"))
df = rename(df, PM2.5 = `Daily Mean PM2.5 Concentration`)
ggplot(df) +
  geom_freqpoly(aes(
    x = PM2.5,
    y = after_stat(density),
    color = COUNTY
  ), binwidth = 2) +
  xlab("Daily PM2.5 concentration (ug/m3 LC)") +
  theme_mat

Code
ggplot(df) +
  geom_histogram(aes(x = PM2.5), binwidth = .8) +
  facet_wrap(~ COUNTY, ncol = 3) +
  xlab("Daily PM2.5 concentration (ug/m3 LC)") +
  theme_mat

Code
ggplot(df) +
  geom_boxplot(aes(x = COUNTY, y = PM2.5)) +
  xlab("County") +
  ylab("Daily PM2.5 concentration (ug/m3 LC)") +
  theme_mat

Code
ggplot(df) +
  geom_boxplot(aes(x = reorder(COUNTY, PM2.5, FUN = median), 
                   y = PM2.5)) +
  xlab("County") +
  ylab("Daily PM2.5 concentration (ug/m3 LC)") +
  theme_mat

Code
df1 = df
df1$`Site ID` =  as.factor(df$`Site ID`)
ggplot(df1) +
  geom_freqpoly(aes(x = PM2.5, color = `Site ID`), binwidth = 2) +
  xlab("Daily PM2.5 concentration (ug/m3 LC)") +
  theme_mat

Code
df1 %>%
  filter(`Site ID` == 450190048) %>%
  ggplot() +
  geom_line(aes(x = mdy(Date), y = PM2.5)) +
  labs(x = "Time", y = "Daily PM2.5 concentration (ug/m3 LC)") + 
  theme_mat

Code
df1 %>%
  ggplot() +
  geom_line(aes(x = mdy(Date), y = PM2.5, color = `Site ID`)) +
  labs(x = "Time", y = "Daily PM2.5 concentration (ug/m3 LC)") + 
  theme_mat

Code
g <- df1 %>%
  ggplot() +
  geom_line(aes(x = mdy(Date), y = PM2.5)) +
  facet_wrap(~ `Site ID`, ncol = 2) +
  labs(x = "Time", y = "Daily PM2.5 concentration (ug/m3 LC)") + 
  theme_mat
print(g)

Code
g + theme(axis.text.x = element_text(angle = 45))

1.8.5 Exercise 5: Working with dplyr Package

Continuing working with the above PM 2.5 data.

  1. Filter all the observations in the county Greenville. How many observations are there?
  2. Filter all the observations in Greenville in August 2021
  3. Filter all the observations in Greenville in August 2021 and select the variables PM2.5 concentrations, Date, latitude and longitude of sites
  4. Generate scatterplots of PM2.5 against latitude and longitude in two different panels
Code
library(dplyr)

df %>%
  filter(COUNTY == "Greenville")
# A tibble: 937 × 20
   Date       Source `Site ID`   POC PM2.5 UNITS    DAILY_AQI_VALUE `Site Name` 
   <chr>      <chr>      <dbl> <dbl> <dbl> <chr>              <dbl> <chr>       
 1 01/01/2021 AQS    450450015     1   6.3 ug/m3 LC              26 Greenville …
 2 01/02/2021 AQS    450450015     1   6.6 ug/m3 LC              28 Greenville …
 3 01/03/2021 AQS    450450015     1   5.4 ug/m3 LC              23 Greenville …
 4 01/05/2021 AQS    450450015     1   7.4 ug/m3 LC              31 Greenville …
 5 01/06/2021 AQS    450450015     1   7.7 ug/m3 LC              32 Greenville …
 6 01/07/2021 AQS    450450015     1   9.5 ug/m3 LC              40 Greenville …
 7 01/08/2021 AQS    450450015     1   7.5 ug/m3 LC              31 Greenville …
 8 01/09/2021 AQS    450450015     1   5.3 ug/m3 LC              22 Greenville …
 9 01/10/2021 AQS    450450015     1  12.1 ug/m3 LC              51 Greenville …
10 01/11/2021 AQS    450450015     1  16.7 ug/m3 LC              61 Greenville …
# ℹ 927 more rows
# ℹ 12 more variables: DAILY_OBS_COUNT <dbl>, PERCENT_COMPLETE <dbl>,
#   AQS_PARAMETER_CODE <dbl>, AQS_PARAMETER_DESC <chr>, CBSA_CODE <dbl>,
#   CBSA_NAME <chr>, STATE_CODE <dbl>, STATE <chr>, COUNTY_CODE <chr>,
#   COUNTY <chr>, SITE_LATITUDE <dbl>, SITE_LONGITUDE <dbl>
Code
df %>%
  mutate(Date = mdy(Date), YM = format_ISO8601(Date, precision = "ym")) %>%
  filter(COUNTY == "Greenville", YM == "2021-08")
# A tibble: 82 × 21
   Date       Source `Site ID`   POC PM2.5 UNITS    DAILY_AQI_VALUE `Site Name` 
   <date>     <chr>      <dbl> <dbl> <dbl> <chr>              <dbl> <chr>       
 1 2021-08-01 AQS    450450015     1  13.8 ug/m3 LC              55 Greenville …
 2 2021-08-02 AQS    450450015     1  19   ug/m3 LC              66 Greenville …
 3 2021-08-03 AQS    450450015     1  16.9 ug/m3 LC              61 Greenville …
 4 2021-08-04 AQS    450450015     1  15.6 ug/m3 LC              58 Greenville …
 5 2021-08-05 AQS    450450015     1  11   ug/m3 LC              46 Greenville …
 6 2021-08-06 AQS    450450015     1  10.3 ug/m3 LC              43 Greenville …
 7 2021-08-07 AQS    450450015     1   9.7 ug/m3 LC              40 Greenville …
 8 2021-08-08 AQS    450450015     1  10   ug/m3 LC              42 Greenville …
 9 2021-08-09 AQS    450450015     1  12   ug/m3 LC              50 Greenville …
10 2021-08-10 AQS    450450015     1  12.5 ug/m3 LC              52 Greenville …
# ℹ 72 more rows
# ℹ 13 more variables: DAILY_OBS_COUNT <dbl>, PERCENT_COMPLETE <dbl>,
#   AQS_PARAMETER_CODE <dbl>, AQS_PARAMETER_DESC <chr>, CBSA_CODE <dbl>,
#   CBSA_NAME <chr>, STATE_CODE <dbl>, STATE <chr>, COUNTY_CODE <chr>,
#   COUNTY <chr>, SITE_LATITUDE <dbl>, SITE_LONGITUDE <dbl>, YM <chr>
Code
df %>%
  mutate(Date = mdy(Date), YM = format_ISO8601(Date, precision = "ym")) %>%
  filter(COUNTY == "Greenville", YM == "2021-08") %>%
  dplyr::select(PM2.5, Date, SITE_LATITUDE, SITE_LONGITUDE)
# A tibble: 82 × 4
   PM2.5 Date       SITE_LATITUDE SITE_LONGITUDE
   <dbl> <date>             <dbl>          <dbl>
 1  13.8 2021-08-01          34.8          -82.4
 2  19   2021-08-02          34.8          -82.4
 3  16.9 2021-08-03          34.8          -82.4
 4  15.6 2021-08-04          34.8          -82.4
 5  11   2021-08-05          34.8          -82.4
 6  10.3 2021-08-06          34.8          -82.4
 7   9.7 2021-08-07          34.8          -82.4
 8  10   2021-08-08          34.8          -82.4
 9  12   2021-08-09          34.8          -82.4
10  12.5 2021-08-10          34.8          -82.4
# ℹ 72 more rows
Code
df %>%
  dplyr::select(PM2.5, SITE_LATITUDE, SITE_LONGITUDE) %>%
  pivot_longer(
    cols = c("SITE_LATITUDE", "SITE_LONGITUDE"),
    names_to = "variable",
    values_to = "value"
  ) %>%
  ggplot(aes(x = value, y = PM2.5)) +
  geom_point() +
  facet_wrap( ~ variable, scale = "free") +
  xlab("") +
  ylab("Daily PM2.5 concentration (ug/m3 LC)") +
  theme_mat