7  Factor Analysis

Author

Pulong Ma

7.1 Why Factor Analysis?

Imagine you’re a psychologist who has just created a new survey to measure “student engagement.” You’ve written 10 different questions, like

  • “How often do you participate in class discussions?”,
  • “How much time do you spend studying?”,
  • and “Do you find your courses interesting?”

After collecting responses, you notice that students who answer “often” to one question tend to answer similarly to a few others. It seems like your questions aren’t measuring 10 completely different things, but rather, they are tapping into a few underlying, unobservable traits.

  • This is the exact scenario where you would use Exploratory Factor Analysis.
  • Factor analysis is a statistical method used to uncover the latent structure (or “factors”) from a set of variables. There are two types of factor analysis: exploratory factor analysis (EFA) and confirmatory factor analysis. This course will focus on EFA.

7.1.1 Factor Analysis v.s. PCA: What’s the Difference?

  • PCA is a variance-focused technique with the goal to reduce dimensionality by creating a smaller set of components that capture maximum amount of total variance in the original data.

  • Factor analysis is a covariance-focused technique with the goal to explain the shared variance (covariance) among the original variables by modeling the underlying latent factors.

7.1.2 Spearman’s Exam Marks

Suppose we observe three exam scores for each student: Classics (X_1), French (X_2), and English (X_3). Researchers collect exam scores from many different students across these subjects and would like to investigate if what could affect students’ performances.

Research Question

Is there a single latent factor that drives students’ performance across different subjects?

Experimental Data

Historical reports give approximate correlations as follows:

R <- matrix(c(1.00, 0.83, 0.78,
              0.83, 1.00, 0.67,
              0.78, 0.67, 1.00),
            nrow = 3, byrow = TRUE,
            dimnames = list(c("Classics","French","English"),
                            c("Classics","French","English")))
R
         Classics French English
Classics     1.00   0.83    0.78
French       0.83   1.00    0.67
English      0.78   0.67    1.00

Interpretation: The variables are strongly correlated—suggesting a single, latent “general ability” factor F that drives performance across subjects.

An Orthogonal One-Factor Model
  • Let \mathbf{z} = (z_1,\ldots,z_p)^\top be standardized variable of \mathbf{x}.

  • Let R=\text{Var}(z) denote the p\times p correlation matrix. An orthogonal one-factor model has the form \begin{equation*} \mathbf{z} = L f + \boldsymbol \varepsilon,\\ \text{Var}(\varepsilon)=\boldsymbol \Psi=\text{diag}\{\psi_1,\ldots,\psi_p\},\quad R = LL^\top + \boldsymbol \Psi. \end{equation*}

  • f is called the common factor assumed with mean E(f)=0 and variance \text{Var}(f)=1.

  • L=[\ell_1, \ldots, \ell_p]^\top is the p\times 1 matrix of factor loadings.

  • \boldsymbol\varepsilon is a p-dimensional vector of measurement errors (also called specific factor).

  • h_i^2:= \ell_{i}^2 is called the communality of X_i and represents the variance explained by the factor f.

  • \psi_i=1-h_i^2 is called the uniqueness.

Given the correlation matrix only, we can use it to fit the one-factor model via R function factanal.

fit1 = factanal(factors = 1, covmat = R)
str(fit1)
List of 10
 $ converged   : logi TRUE
 $ loadings    : 'loadings' num [1:3, 1] 0.983 0.844 0.793
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "Classics" "French" "English"
  .. ..$ : chr "Factor1"
 $ uniquenesses: Named num [1:3] 0.0337 0.2871 0.3704
  ..- attr(*, "names")= chr [1:3] "Classics" "French" "English"
 $ correlation : num [1:3, 1:3] 1 0.83 0.78 0.83 1 0.67 0.78 0.67 1
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "Classics" "French" "English"
  .. ..$ : chr [1:3] "Classics" "French" "English"
 $ criteria    : Named num [1:3] 4.26e-12 1.20e+01 1.20e+01
  ..- attr(*, "names")= chr [1:3] "objective" "counts.function" "counts.gradient"
 $ factors     : num 1
 $ dof         : num 0
 $ method      : chr "mle"
 $ n.obs       : logi NA
 $ call        : language factanal(factors = 1, covmat = R)
 - attr(*, "class")= chr "factanal"

Output interpretation:

  • loadings: a matrix of loadings with one column for each factor. The factors are in decreasing order of communality.
  • uniquenesses: a vector of estimated measurement error variances.

7.2 The Exploratory Factor Model

Example: bfi Data

Background: A classroom/teaching dataset for personality measurement under the Big Five model: Agreeableness (A), Conscientiousness (C), Extraversion (E), Neuroticism (N), and Openness (O). The bfi data is collected via the SAPA Project and is available from the R package psych.

  • 25 items: 5 per trait, named A1-A5, C1-C5, E1-E5, N1-N5, O1-O5.
  • Responses: 6-point Likert (1=Very Inaccurate, …, 6=Very Accurate).
  • Demographics: gender, education, age
  • Sample size: 2800, but with some missing values.
library(psych)
data(bfi, package="psych")
dim(bfi)
[1] 2800   28
head(bfi)
  • What are the underlying latent structures measured by the set of survey items?
  • Do the latent structures show any differences across demographic variables (e.g., gender, education, and age group)?
The Factor Model
  • Let \mathbf{z}_i = (z_{i1},\ldots,z_{ip})^\top denote the vector of p=25 standardized variables for the i-th observation.

  • Assume that there are m common factors denoted by the m-dimensional vector \mathbf{f}_i for the i-th observation.

  • The m-factor model is \mathbf{z}_{i} = L \mathbf{f}_i + \boldsymbol \varepsilon_i, i=1,\ldots, n

    • L is the p\times m matrix of factor loadings with (i,j)th entry \ell_{ij}.
    • \mathbf{f}_i is the m-dimensional vector of common factors.
    • \boldsymbol \varepsilon_i is the p-dimensional vector of specific factors.
  • In matrix notation, stacking all observations together yields Z = F L^\top + \boldsymbol \varepsilon

    • Z=[\mathbf{x}_1, \ldots, \mathbf{x}_n]^\top is the n\times p matrix of standardized data;
    • F=[\mathbf{f}_1, \ldots, \mathbf{f}_n]^\top is the n\times m matrix of common factors for all observations.
    • \boldsymbol \varepsilon=[\boldsymbol \varepsilon_1, \ldots, \boldsymbol \varepsilon_n]^\top is the n\times p of specific factors for all observations.
  • The communality of X_i is h_i^2: = \ell_{i1}^2 + \ell_{i2}^2 + \cdots + \ell_{im}^2

    • it measures the variation explained by the m common factors;
  • The proportion of the total variance explained by the jth factor is given by \frac{\sum_{i=1}^p \ell_{ij}^2}{\text{trace}(R)}

Assumptions
  • E(F) = \mathbf{0}, \text{Var}(F) = I_{m\times m}.
  • E(\boldsymbol \varepsilon) = \mathbf{0}, \text{Var}(\boldsymbol \varepsilon) = \boldsymbol \Psi = \text{diag}\{\psi_1, \ldots, \psi_p\}
  • F and \boldsymbol \varepsilon are independent.

Step: Data Preparation and Visualization

R Code: Prepare Data
# select the 25 personality items (columns 1-25)
df = bfi %>% drop_na()
dat = df[, 1:25]
head(dat)
R Code: Visualization
corrplot::corrplot(cor(dat))

7.3 Measures of Sampling Adequacy

Step: Check Assumptions

Before performing factor analysis, we need to ensure our data is suitable. Two common methods are introduced below.

Measure of Sampling Adequacy (MSA)

The first method is measure of sampling adequacy (MSA). MSA (Kaiser 1970, Psychometrica) is a statistic that measures the relative sizes of the pairwise correlations to the partial correlations between all pairs of variables: MSA = 1- \frac{\sum_j \sum_{k \neq j} q_{jk}^2}{\sum_j \sum_{k \neq j} r_{jk}^2 }

  • r_{jk} is the marginal sample correlation between variables j and k;
  • q_{jk} is the partial correlation between the two variables after accounting for all other variables in the data;
  • MSA indicates the proportion of relationships in the data that is shared and common among groups of variables
  • A high MSA (close to 1) suggests that there are underlying factors influencing groups of variables and the dataset is suitable for factor analysis.
  • A low MSA (close to 0) suggestions that most of the correlations are just unique, one-on-one relationships that cannot be well explained by common factors.

The MSA can take on values between 0 and 1. Kaiser proposed the following guidelines for interpretation:

MSA Range Interpretation
0.9 to 1.0 Marvelous
0.8 to 0.9 Meritorious
0.7 to 0.8 Middling
0.6 to 0.7 Mediocre
0.5 to 0.6 Miserable
0.0 to 0.5 Unacceptable
R Code: MSA
# compute measures of sampling adequacy (MSA)
psych::KMO(dat) 
Kaiser-Meyer-Olkin factor adequacy
Call: psych::KMO(r = dat)
Overall MSA =  0.85
MSA for each item = 
  A1   A2   A3   A4   A5   C1   C2   C3   C4   C5   E1   E2   E3   E4   E5   N1 
0.74 0.83 0.87 0.87 0.90 0.84 0.79 0.85 0.82 0.86 0.84 0.88 0.89 0.88 0.89 0.78 
  N2   N3   N4   N5   O1   O2   O3   O4   O5 
0.78 0.86 0.89 0.86 0.86 0.78 0.83 0.78 0.76 

Interpretation: The overall MSA is 0.85, which is “meritorious.” This confirms our data is appropriate for factor analysis.

Cronbach’s Alpha

The second method is called Cronbach’s alpha. For n samples of p-dimensional observations \mathbf{X}=(X_1, \ldots, X_p)^\top, Cronbach’s alpha is \alpha = \frac{p \bar{r}}{1+(p-1)\bar{r}}

  • \bar{r} is the average correlation: \bar{r} = \frac{\frac{1}{\frac{p(p-1)}{2}} \sum \sum_{i< j} \text{Cov}(X_i, X_j)}{\frac{1}{p}\sum_{i=1}^p \text{Var}(X_i)}
  • A high alpha value (close to 1) suggests that the items are all measuring the same underlying latent factor.
  • A low alpha value (close to 0) suggests that the items may be measuring different things.
R Code: Cronbach’s Alpha
summary(psych::alpha(dat, check.keys=TRUE))
Warning in psych::alpha(dat, check.keys = TRUE): Some items were negatively correlated with the first principal component and were automatically reversed.
 This is indicated by a negative sign for the variable name.

Reliability analysis   
 raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
      0.82      0.82    0.86      0.15 4.6 0.0054  4.2 0.61     0.13

7.4 How Many Factors Should Be Used?

This is one of the most crucial and perhaps subjective steps. We will introduce two common methods:

  1. The first is to decide the number of factors, m, prior to the analysis using the idea from PCA.
  2. The second is to use a formal likelihood ratio test.

7.4.1 Scree Plot for Choosing the Number of Factors

Prior to the analysis, one often needs to decide how many factors should be used. A good exploratory way is to use scree plot to decide m so that the contribution of each potential factor to the total variation is examined.

  • Scree Plot: We look for the “elbow” or point of inflection in the plot of eigenvalues of correlation matrix.
R Code: Scree Plot
eigenvalues = eigen(cor(dat))$values 
scree_data <- data.frame(
  Factor = 1:length(eigenvalues),
  Eigenvalue = eigenvalues
)

cumsum(eigenvalues)/sum(eigenvalues)
 [1] 0.2027406 0.3132398 0.3993447 0.4750381 0.5357394 0.5788925 0.6121288
 [8] 0.6443088 0.6728723 0.7009338 0.7281675 0.7541265 0.7793767 0.8028980
[15] 0.8255366 0.8473302 0.8681275 0.8878823 0.9071917 0.9248917 0.9420466
[22] 0.9583305 0.9738855 0.9892760 1.0000000
R Code: Scree Plot
ggplot(scree_data, aes(x = Factor, y = Eigenvalue)) +
  geom_point() +
  geom_line() +
  ggtitle("Scree Plot") +
  xlab("Factor") + 
  ylab("Eigenvalue")

Interpretation:

  • As seen from the scree pot, the cumulative eigenvalues do not increase too much and “elbow” point seem to occur at either 5 or 6; so we might want to try the factor model with m=6 factors.
  • The scree plot often gives somewhat subjective answers, so we might also try different number of factors, fit the factor analysis model for each selection and compare the results among all the choices.

7.4.2 Likelihood Ratio Test

A formal way to choose the number of factors is to use a likelihood ratio test.

Likelihood Ratio Test (LRT)
  • We wish to test whether the m factor model appropriately describes the correlations among the p variables.
  • Null hypothesis: m factors are sufficient, i.e, H_0: R_{p \times p} = L_{p \times m}L'_{m \times p} + \Psi_{p \times p}
  • Alternative hypothesis: the correlation matrix can be any positive definite matrix H_a: R_{p \times p} \text{ is a positive definite matrix}
  • The test statistic is -2\ln \Lambda := \{n - 1 - (2p+4m+5)/6\} \log \frac{|\hat{L}\hat{L}' + \hat{\Psi}|}{|\hat{R}|}, where \hat{L}, \hat{\Psi} and \hat{R} denote the their estimates from the data.
  • -2\ln \Lambda has an approximate \chi^2 distribution under the null hypothesis with degrees of freedom \text{df}=[(p - m)^2 - p - m]/2.
  • To have df >0, we must have m < \frac{1}{2}(2p + 1 - \sqrt{8p + 1}).
  • We reject H_0 at level \alpha if
    [n - 1 - (2p+4m+5)/6] \log \left( \frac{|{L}{L}' + {\Psi}|}{|{R}|} \right) > \chi^2_{\text{df}, 1-\alpha}
  • This test has two major drawbacks:
    • It is very sensitive to sample size n. With large sample sizes, the LRT tends to suggest that large number of factors be include.
    • It replies on the assumption of multivariate normality. If the assumption is violated, the LRT also tends to indicate the need for too many factors.
  • In practice, this LRT test should only be used as a guideline for selecting the number of factors, and we should also other things such as the interpretation of the factors, proportion of variance explained, and the desire for simplicity.
R Code: Multivariate Normality Test
mvShapiroTest::mvShapiro.Test(as.matrix(dat))

    Generalized Shapiro-Wilk test for Multivariate Normality by
    Villasenor-Alva and Gonzalez-Estrada

data:  as.matrix(dat)
MVW = 0.95906, p-value < 2.2e-16
R Code: Likelihood Ratio Test
# compute p-values for different number of factors
sapply(1:15, function(f) 
factanal(dat, factors = f, method ="mle")$PVAL)
    objective     objective     objective     objective     objective 
 0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00 1.881649e-177 
    objective     objective     objective     objective     objective 
 1.411671e-85  4.776750e-51  3.708668e-31  4.993573e-19  5.950411e-11 
    objective     objective     objective     objective     objective 
 3.479191e-05  7.032046e-03  1.427903e-01  1.500403e-01  6.859678e-01 

Interpretation: Since the data fails the Shapiro Wilk’s test, the multivariate normality is violated. As expected, the LRT indicates at least 14 factors to be included with p-value > 0.2, which are too many to be included in the factor analysis model. Based on the result from scree plot, we will use 6 factors in the follow-up analysis.

7.5 Estimation for Factor Loadings

In the factor analysis model, we often first estimate the factor loadings L, and then estimate the factor scores F given L. To estimate the factor loadings, three methods are introduced below.

  • The principal component method
  • The iterative principal factor method
  • Maximum likelihood estimation (assumes multivariate normality)

7.5.1 The Principal Component Method

The basic idea of the principal component method is to approximate the correlation matrix using a rank-m approximation via eigenvalues and eigenvectors.

  • Let (\lambda_i, \mathbf{e}_i) denote the eigenpairs of the p\times p correlation matrix R.
  • The rank-m approximation to R is \hat{R}^{(m)} : = \lambda_1 \mathbf{e}_1 \mathbf{e}_1^\top + \ldots + \lambda_m \mathbf{e}_m \mathbf{e}_m^\top.
  • The columns of L is given by the columns \sqrt{\lambda_1} \mathbf{e}_1, \ldots, \sqrt{\lambda_m}\mathbf{e}_m.
  • \Psi is estimated by \hat{\Psi} = \text{diag}(R - LL^\top).
R Code: Estimation via the PC Method
fit.pc = prcomp(dat, center=TRUE, scale=TRUE)

plot(fit.pc$sdev^2, xlab="Component Number",
        ylab="Component Variance (eigenvalue)",  
        main="Scree Diagram", type="b")

R Code: Factor Loadings
m = 6
fit.pc$loadings = fit.pc$rotation %*% diag(fit.pc$sdev)
corrplot::corrplot(fit.pc$loadings[, 1:m])

Visualizing Factor Loadings
Writing an R Function
plt_loadings = function(loadings, K = 3,
  with_ties = FALSE, angle=30,
  palette = c("tomato", "white", "steelblue")) 
{
  # Required packages
  require(dplyr)
  require(tidyr)
  require(tibble)
  require(stringr)
  require(forcats)
  require(ggplot2)

  # Ensure matrix form
  #L = as.matrix(loadings)
  L = suppressWarnings(as.matrix(unclass(loadings)))
  
  # Default names
  if (is.null(rownames(L))) 
    rownames(L) = paste0("var", seq_len(nrow(L)))
  if (is.null(colnames(L))) 
    colnames(L) = paste0("Factor", seq_len(ncol(L)))
  
  var_levels = rownames(L)

  # Build long-format data frame
  load_df <- as.data.frame(L) %>%
    tibble::rownames_to_column("variable") %>%
    pivot_longer(-variable, 
                        names_to = "Factor", 
                        values_to = "loading") %>%
    mutate(abs_loading = abs(loading))

  # Select top-K per Factor
  tops = load_df %>%
    group_by(Factor) %>%
    slice_max(abs_loading, n = K, 
                     with_ties = with_ties) %>%
    ungroup() %>%
    mutate(highlight = TRUE)

  # Mark highlighted entries
  load_df = load_df %>%
    left_join(
      tops %>% dplyr::select(variable, Factor, highlight),
      by = c("variable", "Factor")
    ) %>%
    mutate(
      highlight = ifelse(is.na(highlight), 
                         FALSE, TRUE))

  # Reorder factors numerically 
  # if they are like "Factor 1", "Factor 2", ...
  load_df = load_df %>%
    mutate(
      PC_num = 
          as.integer(stringr::str_extract(Factor, "\\d+")),
      PC = if (all(!is.na(PC_num))) {
        forcats::fct_reorder(Factor, PC_num, .fun = min)
      } else {
        factor(Factor, levels = rev(var_levels))
      }
    )

  # Plot
  p = ggplot(load_df, 
    aes(x = PC, y = fct_rev(variable), fill = loading)) +
    geom_tile(color = "white") +
    scale_fill_gradient2(
      low = palette[1], 
      mid = palette[2], 
      high = palette[3], 
      midpoint = 0,
      name = "Loading"
    ) +
    geom_point(
      data = subset(load_df, highlight),
      shape = 21, size = 4, stroke = 1.1, 
      color = "black", fill = NA
    ) +
    labs(
      x = NULL, y = NULL,
      title = 
        paste0("Top ", 
               K, " loadings per column highlighted")
    ) +
    theme_minimal(base_size = 12) + 
    theme(
      axis.text.x = 
        element_text(
          face = "bold",
          angle = angle, 
          vjust = 0.5, 
          hjust = 1)
      )

  print(p)
}
plt_loadings(fit.pc$loadings[, 1:m], K=4)

7.5.2 The Principal Factor Method

The principal factor method is an iterative modification of the principal components method that allows for greater focus on explaining correlations among observed traits.

  • The general algorithm is as follows:
    1. \Psi is estimated via the PC method above.
    2. L is estimated to approximate R - \Psi.
    3. Given L, \Psi is estimated.
    4. Repeat Step 2-3 until convergence.
  • There are two options during estimation:
    • HEYWOOD: Set any estimated communality larger than one equal to 1 and contiute iterations with the remaining variables.
    • ULTRAHEYWOOD: Continue iterations with all of the variables and hope that iterations eventually give allowable parameter estimates. (Doing nothing)
factanal v.s. fn

In R, both functions factanal and fa (from the psych package) implement factor analysis, but they use different estimation methods. The principal factor method is available in fa but not factanal.

R Code: Principal Factor Method
library(psych)
fit.pa = psych::fa(dat, nfactors=5, fm="pa", rotate="none")
summary(fit.pa)

Factor analysis with Call: psych::fa(r = dat, nfactors = 5, rotate = "none", fm = "pa")

Test of the hypothesis that 5 factors are sufficient.
The degrees of freedom for the model is 185  and the objective function was  0.63 
The number of observations was  2236  with Chi Square =  1400.57  with prob <  1.5e-185 

The root mean square of the residuals (RMSA) is  0.03 
The df corrected root mean square of the residuals is  0.04 

Tucker Lewis Index of factoring reliability =  0.878
RMSEA index =  0.054  and the 90 % confidence intervals are  0.052 0.057
BIC =  -26.23
R Code: Plot Factor Loadings
plt_loadings(fit.pa$loadings, K=4)

7.5.3 Maximum Likelihood Estimation

The maximum likelihood estimation (MLE) method is a probabilistic method by assuming multivariate normal for the data and then maximizing the likelihood function to obtain the estimates.

Maximum Likelihood Estimation
  • The distributional assumptions for the factor model are: \mathbf{x}_j \sim \text{N}_p(\boldsymbol{\mu}, {\Sigma}), \;\; \mathbf{f}_j \sim \text{N}_m(\mathbf{0}, \mathbf{I}_m), \;\; \boldsymbol{\epsilon}_j \sim \text{N}_p(\mathbf{0}, \boldsymbol{\Psi}_{p \times p}),
    • \mathbf{x}_j = \mathbf{L} \mathbf{f}_j + \boldsymbol{\varepsilon}_j,
    • {\Sigma} = \mathbf{L}\mathbf{L}^\top + \boldsymbol{\Psi},
    • and \mathbf{f}_j is independent of \boldsymbol{\varepsilon}_j.
    • Also, \boldsymbol{\Psi} is a diagonal matrix.
  • The log-likelihood function for the data is given by: \begin{align*} \ell(\boldsymbol{\mu}, {\Sigma}) &:= -\frac{n}{2}\log |2\pi {\Sigma}| - \frac{1}{2} \sum_{i=1}^n (\mathbf{x}_i - \boldsymbol{\mu})^\top {\Sigma}^{-1}(\mathbf{x}_i - \boldsymbol{\mu}) \\ &= -\frac{n}{2}\log |2\pi {\Sigma}| - \frac{n}{2} \text{tr}({\Sigma}^{-1}{S}) - \frac{n}{2}(\bar{\mathbf{x}}-\boldsymbol{\mu})^\top{\Sigma}^{-1} (\bar{\mathbf{x}}-\boldsymbol{\mu}). \end{align*} where S is the sample covariance matrix.
R Code: MLE via factanal
fit.mle = factanal(dat, factors=m, method="mle", 
                   rotation="none")
print(fit.mle$loadings)

Loadings:
   Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
A1  0.223          -0.134          -0.374   0.344 
A2 -0.392   0.359   0.161   0.152   0.347  -0.227 
A3 -0.456   0.392   0.222   0.123   0.288         
A4 -0.373   0.201           0.289   0.171         
A5 -0.534   0.283   0.247           0.179         
C1 -0.277   0.192  -0.458                   0.147 
C2 -0.260   0.245  -0.504   0.193   0.122   0.222 
C3 -0.286   0.127  -0.383   0.252                 
C4  0.451           0.532  -0.189           0.215 
C5  0.481           0.351  -0.245   0.122         
E1  0.362  -0.317  -0.230           0.256   0.179 
E2  0.587  -0.234  -0.175           0.326         
E3 -0.465   0.439   0.119  -0.204           0.187 
E4 -0.566   0.330   0.287   0.119  -0.163   0.135 
E5 -0.421   0.427  -0.121          -0.177         
N1  0.589   0.577                  -0.169         
N2  0.573   0.566                  -0.127  -0.138 
N3  0.519   0.492                                 
N4  0.590   0.226                   0.292   0.102 
N5  0.420   0.302           0.178   0.193   0.107 
O1 -0.274   0.225  -0.167  -0.405           0.122 
O2  0.196           0.278   0.407           0.154 
O3 -0.326   0.334  -0.102  -0.503           0.105 
O4  0.127   0.161  -0.103  -0.276   0.327         
O5  0.178           0.235   0.457  -0.121   0.233 

               Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
SS loadings      4.421   2.392   1.571   1.302   0.977   0.544
Proportion Var   0.177   0.096   0.063   0.052   0.039   0.022
Cumulative Var   0.177   0.273   0.335   0.387   0.427   0.448
Visualize Factor Loadings
plt_loadings(fit.mle$loadings, K=5)

Factor Loading Interpretations: A loading represents how strongly a variable is associated with a factor. We could consider the loading |\ell_{ij}| > 0.3 or 0.4 so that each factor could have meaningful interpretations. The signs of factor loadings could be flipped for all factors.

  • Factor 1: Items that load strongly on this factor could include A3-A5, C4,C5, E2-E5, N1-N5. In general, it is difficult to interpret since most variables depend this factor. It may suggest that there is a common factor to explain all the variables. This could be further verified using Quartimax rotation.
  • Factor 2: Items N1-N5 load strongly on this factor. (Neuroticism)
  • Factor 3: Items C1-C5 load strongly here. (Conscientiousness)
  • Factor 4: Items O1-O5 load strongly here. O2 and O5 are reverse-coded. (Openness)
  • Factor 5: Items that strongly load on this include A1-A3, E2, O4. It could be more related to Agreeableness.

Factor Correlations: The table at the bottom shows the correlations between the factors. This could tell us how the latent factors are correlated. For orthogonal rotations, the factors will show essentially zero values; for non-orthogonal rotations, this could be used to check the validity of the rotation method.

R Code: Correlation Among Factor Loadings
corrplot::corrplot(cor(fit.mle$loadings))

In constast to factanal(), one can also use fa() to get maximum likelihood estimates.

R Code: MLE via fa
fit.mle2 = fa(dat, nfactors=m, fm="ml")
fit.mle2$loadings

Loadings:
   ML2    ML1    ML3    ML5    ML4    ML6   
A1  0.108 -0.131        -0.562         0.273
A2                       0.696              
A3        -0.119         0.608         0.127
A4                0.194  0.406 -0.153  0.137
A5 -0.163 -0.203         0.446         0.225
C1                0.546         0.181       
C2         0.150  0.668                0.193
C3                0.561                     
C4               -0.662                0.287
C5  0.131  0.188 -0.555                     
E1 -0.136  0.579  0.108 -0.148              
E2         0.671                            
E3        -0.361         0.139  0.372  0.262
E4        -0.546         0.218         0.274
E5  0.177 -0.409  0.261         0.226       
N1  0.855                                   
N2  0.853                                   
N3  0.639  0.163                       0.116
N4  0.381  0.463 -0.135                0.119
N5  0.400  0.252         0.163 -0.131  0.189
O1                              0.549       
O2  0.119                0.100 -0.448  0.273
O3        -0.108                0.661       
O4         0.350         0.143  0.359       
O5                             -0.517  0.324

                 ML2   ML1   ML3   ML5   ML4   ML6
SS loadings    2.321 1.988 1.974 1.722 1.656 0.716
Proportion Var 0.093 0.080 0.079 0.069 0.066 0.029
Cumulative Var 0.093 0.172 0.251 0.320 0.386 0.415
R Code: MLE via fa
plt_loadings(fit.mle2$loadings, K=5)

Final Conclusion: The exploratory factor analysis (EFA) successfully identified three of the “Big Five” personality traits (Conscientiousness, Neuroticism, and Openness) as the primary latent structures within the 25 survey items using the maximum likelihood estimation method without rotation on the factors.

7.6 Rotation of Factors

In the factor model with m>1, there is no unique set of loadings and thus there is ambiguity associated with the factor model. This can be seen by introducing any m\times m orthogonal matrix T in the factor model \mathbf{z}_i = L\mathbf{f}_i + \boldsymbol \varepsilon = L T T^\top \mathbf{f}_i + \boldsymbol \varepsilon = L^* \mathbf{f}_i^* + \boldsymbol \varepsilon_i

  • TT^\top = I and T^\top T = I
  • L^*:=LT and \mathbf{f}_i^*:=T^\top \mathbf{f}_i
  • This indicates that loadings are not uniquely determined.

In what follows, we introduce three different factor rotations to better interpret the results.

Varimax Rotation

Varimax rotation aims to have each one of the p variables load highly on only one factor and have moderate to negligible loads on all other factors.

  • The transformations in Varimax rotation is an orthogonal transformation.
  • After varimax rotation, each of the p variables should load highly on at most one of the rotated factors, but this may not always be true.
R Code: Varimax Rotation
fit.varimax = factanal(dat, factors=m, method="mle", rotation="varimax")
fit.varimax$loadings

Loadings:
   Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
A1                         -0.534  -0.113   0.124 
A2  0.260           0.128   0.645                 
A3  0.384           0.127   0.568           0.153 
A4  0.239           0.236   0.387  -0.152   0.102 
A5  0.446  -0.137   0.107   0.435           0.227 
C1                  0.549           0.188         
C2                  0.665                   0.158 
C3                  0.551                         
C4          0.222  -0.633  -0.101  -0.120   0.305 
C5 -0.184   0.273  -0.548                   0.137 
E1 -0.575                  -0.133           0.178 
E2 -0.675   0.233                           0.124 
E3  0.594           0.112   0.141   0.250   0.231 
E4  0.678  -0.140   0.114   0.215  -0.108   0.140 
E5  0.515           0.306           0.197         
N1          0.815          -0.162          -0.124 
N2          0.802          -0.122          -0.186 
N3          0.714                                 
N4 -0.342   0.562  -0.160                   0.198 
N5 -0.149   0.516                  -0.165   0.164 
O1  0.232           0.134           0.487   0.158 
O2          0.168                  -0.500   0.138 
O3  0.343                           0.581   0.178 
O4 -0.163   0.210           0.125   0.348   0.170 
O5                                 -0.580   0.148 

               Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
SS loadings      2.728   2.718   2.049   1.560   1.537   0.617
Proportion Var   0.109   0.109   0.082   0.062   0.061   0.025
Cumulative Var   0.109   0.218   0.300   0.362   0.424   0.448
Code
plt_loadings(fit.varimax$loadings, K=5)

The goal is to name each factor by identifying the common theme among the variables that have high loadings on it according to a cutoff. This cutoff should be chosen on a case-by-case basis so that one can achieve good interpretability

  • Factor 1: Extraversion️ This factor is defined by high loadings defined by E1-E5. Note the strong negative loadings for the introversion-keyed items.
    • E1: Don’t talk a lot
    • E2: Find it difficult to approach others
    • E3: Know how to captivate people
    • E4: Make friends easily
    • E5: Am the life of the party
  • Factor 2: Neuroticism This factor is defined by high positive loadings from all the Neuroticism items. Individuals scoring high on this factor tend to experience negative emotions.
    • N1: Get stressed out easily
    • N2: Worry about things
    • N3: Am easily disturbed
    • N4: Get upset easily
    • N5: Change my mood a lot
  • Factor 3: Conscientiousness This factor is defined by the Conscientiousness items. The negative loadings for items C4 and C5 indicate that people high on this factor are not likely to do things halfway or waste time.
    • C1: Am always prepared
    • C2: Pay attention to details
    • C3: Get chores done right away
    • C4: Do things in a half-way manner
    • C5: Waste my time
  • Factor 4: Agreeableness This factor clearly represents the Agreeableness trait. A1, which is reverse-keyed, has a moderate negative loading as expected.
    • A2: Am interested in people
    • A3: Sympathize with others’ feelings
    • A4: Have a soft heart
    • A5: Take time out for others
    • A1: Am indifferent to the feelings of others
  • Factor 5: Openness to Experience This factor is defined by the Openness items. The negative loadings for O2 and O5 are consistent with being low on this trait.
    • O1: Have a rich vocabulary
    • O3: Have a vivid imagination
    • O4: Am full of ideas
    • O2: Am not interested in abstract ideas
    • O5: Do not enjoy going to art museums
  • Summary of Variance Explained The first 5 factors explains 42.4% variations, however, the factor 6 has very small loadings (e.g., \ell_{ij}< 0.5), which indicates that factor 6 may not be interpretable.
Quartimax Rotation
  • The varimax rotation destroy an “overall” factor.
  • In contrast, the quartimax rotation tries to
    1. Preserve an overall factor such that each of the p variables has a high loading on the overall factor;
    2. Create other factors such that each of the p variables has a high loading on at most one factor.
R Code: Quartimax Rotation
# install.packages("GPArotation")
library(GPArotation)
quartimax(fit.mle$loadings)
Orthogonal rotation method Quartimax converged.
Loadings:
     Factor1  Factor2 Factor3  Factor4 Factor5 Factor6
A1 -0.081420  0.13326  0.0298 -0.07668 -0.5368  0.0907
A2  0.414291  0.00380  0.1164  0.00996  0.5641  0.0119
A3  0.538823 -0.02658  0.0862  0.00556  0.4371  0.1420
A4  0.360211 -0.08975  0.2026 -0.17057  0.2819  0.1140
A5  0.578553 -0.16346  0.0535  0.01204  0.2745  0.1745
C1  0.123787 -0.00478  0.5341  0.17723 -0.0410  0.1257
C2  0.120050  0.04917  0.6377  0.07489  0.0044  0.2339
C3  0.108611 -0.04848  0.5416 -0.03148  0.0619  0.0670
C4 -0.086235  0.20244 -0.6681 -0.08046 -0.0915  0.2475
C5 -0.211614  0.25231 -0.5494  0.05720  0.0290  0.1233
E1 -0.551133 -0.01293  0.0594 -0.07073 -0.0246  0.2748
E2 -0.660465  0.18776 -0.0682 -0.04615  0.0657  0.2478
E3  0.640897  0.00252  0.0567  0.26261 -0.0161  0.1358
E4  0.734463 -0.12557  0.0541 -0.10235  0.0134  0.0346
E5  0.519389  0.07199  0.2892  0.19436 -0.0317 -0.1067
N1 -0.030707  0.83647 -0.0443 -0.05125 -0.0901 -0.0622
N2 -0.068316  0.82158 -0.0145  0.01652 -0.0318 -0.1130
N3 -0.076434  0.69737 -0.0612  0.00601  0.0503  0.1580
N4 -0.323978  0.51793 -0.1650  0.07876  0.1068  0.2898
N5 -0.110418  0.48374 -0.0586 -0.15774  0.1268  0.2465
O1  0.244793 -0.02933  0.1169  0.49435 -0.0613  0.1088
O2  0.038826  0.15911 -0.1272 -0.48961  0.0145  0.1579
O3  0.361420  0.00777  0.0597  0.59050 -0.0241  0.1084
O4 -0.111197  0.16732 -0.0292  0.34745  0.1744  0.2112
O5 -0.000448  0.06525 -0.0853 -0.56502 -0.1232  0.1566

               Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
SS loadings      3.320   2.641   1.975   1.514   1.063   0.695
Proportion Var   0.133   0.106   0.079   0.061   0.043   0.028
Cumulative Var   0.133   0.238   0.317   0.378   0.421   0.448

The goal of a Quartimax rotation is to simplify the rows of the factor loading matrix, making it easier to see which factor each variable is associated with. Based on the loadings (where values > 0.4 are considered significant), we can interpret the six factors as follows.

  • Factor 1: Extraversion & Agreeableness (A “Sociability” Factor): This is a very broad factor that combines items from two distinct personality traits. It captures a general tendency towards being sociable, outgoing, and agreeable. Quartimax rotation can sometimes produce a large general factor, which seems to have happened here.

    • E4: Make friends easily
    • E2: Find it difficult to approach others
    • E3: Know how to captivate people
    • A5: Take time out for others
    • A3: Sympathize with others’ feelings
    • E1: Don’t talk a lot
    • A2: Am interested in people
  • Factor 2: Neuroticism: This is a very strong and clear factor, defined by high positive loadings from all the Neuroticism items.

    • N1: Get stressed out easily
    • N2: Worry about things
    • N3: Am easily disturbed
    • N4: Get upset easily
    • N5: Change my mood a lot
  • Factor 3: Conscientiousness: This factor is clearly defined by the Conscientiousness items. The negative loadings for C4 and C5 are expected as they are reverse-keyed items.

    • C4: Do things in a half-way manner
    • C2: Pay attention to details
    • C5: Waste my time
    • C3: Get chores done right away
    • C1: Am always prepared
  • Factor 4: Openness to Experience: This factor is defined by the Openness items. The negative loadings for the reverse-keyed items (O2, O5) fit the pattern perfectly.

    • O3: Have a vivid imagination
    • O5: Do not enjoy going to art museums
    • O1: Have a rich vocabulary
    • O2: Am not interested in abstract ideas
  • Factor 5: Agreeableness: This factor has high loadings for A1-A3, which means describes the agreeablness trait. A1 is reverse coded.

    • A1: Am indifferent to the feelings of others
    • A2: Am interested in people
    • A3: Sympathize with others’ feelings
  • Factor 6: A Weak or Ill-Defined Factor: This factor has no high loadings (all are below 0.4). This suggests that it does not represent a clear, substantial underlying trait. In a real data analysis, you would likely conclude that a 5-factor solution is more appropriate than a 6-factor solution, as this sixth factor is not interpretable and its SS loading is below the typical cutoff of 1.0.

  • Summary of Variance Explained: The five factors together explain 42.1% of the total variance in the personality items. However, since the fifth factor is not interpretable, a 4-factor solution explaining 38.2% of the variance would likely be the more practical and parsimonious choice.

Promax Transformation
  • The varimax and quartimax rotations produce uncorrelcted factors.
  • Promax is a non-orthogonal (oblique) transformation that
    1. is not a rotation,
    2. can produce correlated factors,
    3. and tries to force each of the p variables to load highly on one of the factors.
R Code: Promax Transformation
result = stats::promax(fit.mle$loadings)
result  
$loadings

Loadings:
   Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
A1 -0.169                   0.109  -0.606   0.171 
A2 -0.124                           0.658         
A3 -0.289                           0.513   0.165 
A4 -0.165          -0.194   0.190   0.332         
A5 -0.394  -0.191                   0.330   0.283 
C1                 -0.592  -0.134                 
C2                 -0.735                         
C3                 -0.610                  -0.183 
C4                  0.704          -0.113   0.582 
C5  0.101   0.102   0.576                   0.327 
E1  0.630  -0.210  -0.183          -0.117   0.114 
E2  0.729                                         
E3 -0.604                  -0.236           0.336 
E4 -0.726                   0.128           0.222 
E5 -0.497   0.225  -0.233  -0.169                 
N1 -0.136   0.926                          -0.123 
N2          0.938                          -0.211 
N3          0.654                           0.101 
N4  0.359   0.347                           0.240 
N5  0.148   0.380           0.172   0.118   0.174 
O1 -0.184                  -0.480           0.199 
O2                          0.504           0.181 
O3 -0.297                  -0.579           0.258 
O4  0.250                  -0.350   0.141   0.191 
O5                          0.586  -0.124   0.183 

               Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
SS loadings      2.788   2.611   2.255   1.516   1.396   1.186
Proportion Var   0.112   0.104   0.090   0.061   0.056   0.047
Cumulative Var   0.112   0.216   0.306   0.367   0.423   0.470

$rotmat
            [,1]       [,2]       [,3]       [,4]       [,5]         [,6]
[1,]  0.45115168  0.5423513  0.2258156  0.1249663 -0.2001219  0.002288356
[2,] -0.53744441  0.8627228 -0.1458693 -0.1403394  0.2335533  0.066627707
[3,] -0.52840876 -0.1799881  0.9929386  0.2843336  0.2698631  0.523885615
[4,]  0.09623074  0.1553164 -0.5313589  0.9322012  0.2464121 -0.314175730
[5,]  0.78664249 -0.3674958 -0.0713201 -0.1901115  0.7539700  0.345176079
[6,] -0.19004259 -0.5154202 -0.1272708  0.1613486 -0.6725570  1.069855011
R Code: Promax Transformation
# compute correlations for factor loadings 
corfac = solve(t(result$rotmat) %*% result$rotmat)
corfac
           [,1]        [,2]        [,3]          [,4]       [,5]          [,6]
[1,]  1.0000000  0.34614290  0.34343748  0.1567108779 -0.3308908  0.1470709666
[2,]  0.3461429  1.00000000  0.06268382  0.0414459766 -0.0860179  0.4838349238
[3,]  0.3434375  0.06268382  1.00000000  0.1842433801 -0.2723639 -0.2866271867
[4,]  0.1567109  0.04144598  0.18424338  1.0000000000 -0.0640062  0.0005207452
[5,] -0.3308908 -0.08601790 -0.27236391 -0.0640061974  1.0000000  0.2205450811
[6,]  0.1470710  0.48383492 -0.28662719  0.0005207452  0.2205451  1.0000000000

A Promax rotation is used when we expect the underlying latent factors to be correlated. The interpretation focuses on the Pattern Matrix, which shows the unique contribution of each variable to a factor. We will consider loadings with an absolute value above 0.4 as significant for naming the factors.

Interpretation of Factors

  • Factor 1: Extraversion (vs. Introversion) This factor is clearly defined by the Extraversion items. The signs are flipped compared to the previous analyses, so high scores on this factor indicate Introversion.
    • E2: Find it difficult to approach others
    • E1: Don’t talk a lot
    • E4: Make friends easily
    • E3: Know how to captivate people
    • E5: Am the life of the party
  • Factor 2: Neuroticism This is a very strong and clear factor, capturing emotional stability. It is defined by high positive loadings from the Neuroticism items.
    • N1: Get stressed out easily
    • N2: Worry about things
    • N3: Am easily disturbed
  • Factor 3: Conscientiousness This factor is clearly defined by the Conscientiousness items. The negative loadings for C1-C3 are expected as they are reverse-keyed. This factor captures the opposite side of conscientiousness, such as lack of direction or irresponsibility.
    • C2: Pay attention to details
    • C4: Do things in a half-way manner
    • C3: Get chores done right away
    • C5: Waste my time
    • C1: Am always prepared
  • Factor 4: Openness to Experience (Reversed) This factor is defined by the Openness items, but the signs are flipped. High scores on this factor indicate a lower degree of openness or a preference for the concrete over the abstract.
    • O3: Have a vivid imagination
    • O1: Have a rich vocabulary
    • O5: Do not enjoy going to art museums
    • O2: Am not interested in abstract ideas
  • Factor 5: Agreeableness This factor is clearly defined by the Agreeableness items. The negative loading for the reverse-keyed item A1 is consistent with the trait.
    • A3: Sympathize with others’ feelings
    • A2: Am interested in people
    • A5: Take time out for others
    • A1: Am indifferent to the feelings of others

Interpreting Factor Correlations: A critical step in an oblique rotation is to examine the Factor Correlation Matrix. This matrix shows how the five personality factors are related to each other.

Here are the key relationships shown in the matrix:

  • Introversion and Neuroticism: There is a moderate positive correlation between Introversion (Factor 1) and Neuroticism (Factor 2). This is a classic finding in personality psychology: individuals who are more introverted also tend to be more prone to experiencing negative emotions like anxiety and stress.
  • Introversion and opposite Conscientiousness: There is a moderate negative correlation between Introversion (Factor 1) and Conscientiousness (Factor 3). This suggests that individuals who are more introverted tend to be slightly less conscientious (less organized, less disciplined). Conversely, more extraverted individuals tend to be more conscientious.
  • Neuroticism and Conscientiousness: There is a small negative correlation between Neuroticism (Factor 2) and Conscientiousness (Factor 3). This indicates that individuals who are more emotionally stable (low Neuroticism) tend to be more organized and self-disciplined (high Conscientiousness).
  • Conscientiousness and Openness: The matrix shows a small positive correlation between Factor 3 (lack of Conscientiousness) and Factor 4 (Low Openness). This actually means there is a small positive correlation between Conscientiousness and Openness. People who are more disciplined and organized also tend to be slightly more open to new experiences.
  • Introversion and Agreeableness: There is a small negative correlation between Introversion (Factor 1) and Agreeableness (Factor 5). This suggests that more extraverted individuals tend to be slightly more agreeable and sympathetic towards others.

Overall, these correlations align well with established findings in personality research and confirm that allowing the factors to correlate was a sensible decision for this analysis.

  • Summary of Variance Explained: The five factors together explain 42.3% of the total variance in the personality items. Note that in an oblique rotation, the variance explained by each factor is not simply additive because the factors themselves share some variance.

7.7 Estimation of Factor Scores

For each observation \mathbf{x}_i (or its z-score \mathbf{z}_i), we can estimate the (vector of) factor scores \mathbf{f}_i once L is estimated. In general, there are three methods to estimating factor scores:

  • ordinary least squares (OLS),
  • Weighted least squares (WLS),
  • and regression method.

These methods can be specified via the option scores in the factanal function. In practice, one use often the regression method, which estimate \mathbf{f}_i by the conditional mean given the observations: \hat{\mathbf{f}}_i:=E[\mathbf{f}_i| \mathbf{x}_i] = \hat{L}^\top (\hat{L}\hat{L}^\top + \hat{\Psi})(\mathbf{x}_i - \bar{\mathbf{x}}).

R Code: Factor Scores Estimation via Regression Method
fit = factanal(dat, factors=m, method="mle", 
               rotation="varimax", scores="regression")
head(fit$scores)
        Factor1    Factor2    Factor3    Factor4    Factor5     Factor6
[1,]  1.1256448  0.3347749  1.3311765 -0.3292141  0.2411510 -0.18141115
[2,] -1.4676037  0.4234724 -0.8052180 -1.6220757 -0.2979351 -0.76894404
[3,]  0.3348235 -0.1676162 -0.2626270 -0.2672289 -0.3954644 -0.04196801
[4,] -0.1021384 -0.4126529  0.6523772 -1.6680618 -0.2782254 -0.40778415
[5,]  0.2615627 -0.9838066 -1.6811813  0.7779028  0.4237653 -0.02633796
[6,]  0.4588120  1.0659874 -0.2737617  0.8797621  0.7791719  0.45639287
R Code: Plot Factor Scores
scores = fit$scores
df = cbind(df, scores)

ggplot(df, aes(x = Factor1, y = Factor2, color = factor(gender))) +
  geom_point(alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dashed", alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  scale_color_manual(values = c("blue", "red"), name = "Gender", labels = c("Male", "Female")) +
  labs(
    title = "Factor Scores: Agreeableness vs. Conscientiousness",
    subtitle = "Points colored by gender",
    x = "Factor 1: Agreeableness",
    y = "Factor 2: Conscientiousness"
  ) +
  theme_bw()

R Code: Plot Factor Scores
GGally::ggpairs(
  df,
  columns = 29:33, # The 6 factor score columns 
  aes(color = factor(gender), alpha = 0.6),
  title = "Pairwise Scatter Plot of Six Factor Scores by Gender"
) +
  theme_bw()

7.8 Exercise: Track Record Data

There are 54 countries’ track records for men and women. The dataset for women in records.women.csv includes variables x_1,\dots,x_7 for 100 m, 200 m, 400 m (seconds) and 800 m, 1500 m, 3000 m, marathon (minutes). The dataset for men in records.men.csv includes variables x_1,\dots,x_8 for 100 m, 200 m, 400 m (seconds) and 800 m, 1500 m, 5000 m, 10000 m, marathon (minutes). The first three times are in seconds and the remaining times are in minutes.

Are there any latent structures to explain the the performance difference between men and women?

Step 1: Data Preparation and Visualization

View Solution
Data Preparation
dat.m = read.csv("records.men.csv")
dat.w = read.csv("records.women.csv")
GGally::ggpairs(dat.m[,-1])

Data Preparation
GGally::ggpairs(dat.w[,-1])

Step 2: Check Assumptions

View Solution
R Code: Measures of Sampling Adequacy
dat = dat.m[,-1]
psych::KMO(dat.m[,-1]) 
Kaiser-Meyer-Olkin factor adequacy
Call: psych::KMO(r = dat.m[, -1])
Overall MSA =  0.89
MSA for each item = 
  x1   x2   x3   x4   x5   x6   x7   x8 
0.84 0.84 0.97 0.90 0.94 0.85 0.85 0.95 
R Code: Measures of Sampling Adequacy
psych::KMO(dat.w[,-1])
Kaiser-Meyer-Olkin factor adequacy
Call: psych::KMO(r = dat.w[, -1])
Overall MSA =  0.82
MSA for each item = 
  x1   x2   x3   x4   x5   x6   x7 
0.89 0.78 0.86 0.85 0.74 0.76 0.88 

Interpretation: Both of the overall MSAs are suggesting the data are appropriate for factor analysis.

R Code: Multivariate Normality Test
mvShapiroTest::mvShapiro.Test(as.matrix(dat))

    Generalized Shapiro-Wilk test for Multivariate Normality by
    Villasenor-Alva and Gonzalez-Estrada

data:  as.matrix(dat)
MVW = 0.95037, p-value = 1.277e-06

Step 3: Determine the Number of Factors to Extract

View Solution
Scree Plot and Hypothesis Tests
fit.pc = prcomp(dat, center=TRUE, scale=TRUE)
plot(fit.pc$sdev^2, type="b")

Scree Plot and Hypothesis Tests
sapply(1:3, function(f) 
factanal(dat, factors = f, method ="mle")$PVAL)
   objective    objective    objective 
5.849375e-16 1.734258e-02 2.226606e-01 

Step 4: Extract and Rotate the Factors

View Solution
R Code: Varimax Rotation
fit2 = factanal(dat, factors=2, method="mle", rotation="varimax")
fit3 = factanal(dat, factors=3, method="mle", rotation="varimax")

data.frame(
  m     = c(2,3),
  stat  = c(fit2$STATISTIC, fit3$STATISTIC),
  df    = c(fit2$dof,       fit3$dof),
  pval  = c(fit2$PVAL,      fit3$PVAL)
)
Factor Loadings
fit3$loadings

Loadings:
   Factor1 Factor2 Factor3
x1 0.366   0.866   0.187  
x2 0.374   0.829   0.322  
x3 0.472   0.676   0.302  
x4 0.538   0.441   0.715  
x5 0.671   0.494   0.443  
x6 0.842   0.426   0.307  
x7 0.870   0.400   0.278  
x8 0.837   0.377   0.266  

               Factor1 Factor2 Factor3
SS loadings      3.403   2.816   1.179
Proportion Var   0.425   0.352   0.147
Cumulative Var   0.425   0.777   0.925
View Solution
R Code: Promax Rotation
fit = factanal(dat, factors=3, method="mle", rotation="promax")
fit$loadings

Loadings:
   Factor1 Factor2 Factor3
x1          1.037  -0.130 
x2          0.906   0.119 
x3  0.202   0.633         
x4  0.112           0.887 
x5  0.505   0.179   0.327 
x6  0.918                 
x7  0.999                 
x8  0.966                 

               Factor1 Factor2 Factor3
SS loadings      3.085   2.334   0.935
Proportion Var   0.386   0.292   0.117
Cumulative Var   0.386   0.677   0.794
R Code: Factor Correlation Matrix
M = fit$rotmat
# correlation 
corfac = solve(t(M) %*% M)
corfac
          [,1]      [,2]      [,3]
[1,] 1.0000000 0.8126219 0.7657261
[2,] 0.8126219 1.0000000 0.7710439
[3,] 0.7657261 0.7710439 1.0000000

Step 5: Interpret Factor Loadings

View Solution
Code
plt_loadings(fit$loadings, K=3, angle=0)

The data can be divided into three physiological groups:

  • Sprints: 100, 200, 400 (anaerobic exercise)

  • Middle distance: 800, 1500 (mixed energy)

  • Long distance: 3000 (women), 5000/10000, marathon (aerobic endurance)

  • The varimax rotation aims to explain the data using one overall common theme. We will consider a loading above 0.6 (or below -0.6) as significant.

    • Factor 1: This factor is defined by high positive loadings corresponding to middle distance and long distance. This measures aerobic endurance.
    • Factor 2: This factor is defined by high positive loadings corresponding to sprints. This measures anaerobic speed/power.
    • Factor 3: This factor has very high positive load on the variable of 800m. This indicates that both anaerobic speed/power and aerobic endurance are very important contributors for 800m.
  • What about interpretations using other rotations?