BT Question Set P1-T2-20-19: Regression diagnostics (1st set)

Last updated on Sep 21, 2020 30 min read R

P1.T2.20.19. Regression diagnostics: omitted variables, heteroskedasticity, and multicollinearity

Question 1: Fama-french 2-factor <- omitted variable

20.19.1. Jane manages a market-neutral equity fund for her investment management firm. The fund’s market-neutral style implies (we will assume) that the fund’s beta with respect to the market’s excess return is zero. However, the fund does seek exposure to other factors. The size factor captures the excess return of small-capitalization stocks (SMB = “small minus big”). Jane tests her portfolio’s exposure to the size factor by regressing the portfolio’s excess return against the size factor returns. Her regression takes the form PORTFOLIO(i) = α + β1×SMB(i) + ε(i). The results of this single-variable (aka, simple) regression are displayed below.

… univariate regression output …

In this simple regression, we can observe that SMB’s coefficient is 0.6771 and significant. Jane is concerned that this simple regression might suffer from omitted variable bias. Specifically, she thinks the value factor has been omitted. The value factor captures the excess returns of value stocks (HML = “high book-to-market minus low book-to-market”)’. She confirms that the omitted variable, HML, is associated with her response variable. Further, the omitted variable, HML, is correlated to SMB. The correlation between HML and SMB is 0.30. Jane runs a multivariate regression with both explanatory variables, SMB and HML; in this regression, HML’s beta coefficient is 0.7240 such that the new term is β2×HML(i) = 0.7240×HML(i). Which of the following is nearest to the revised SMB coefficient; i.e., what is the revised β1?

library(tidyverse)
library(broom)
library(gt)

intercept <- .03
intercept_sig <- .01

# x1 is market factor
x1_mu <- .05
x1_sig <- .01
x1_beta <- 0.5

# x2 is size factor
x2_mu <- .04
x2_sig <- .01
x2_beta <- 0.7

#x3 is value factor
x3_mu <- .03
x3_sig <- .01
x3_beta <- -0.5

noise_mu <- 0
noise_sig <- 0 # low value gets low p-value b/c low noise

size <- 120
set.seed(52)

results <- tibble(
  x0 = rnorm(size, intercept, intercept_sig),
  x1 = rnorm(size, x1_mu, x1_sig),
  x2 = rnorm(size, x2_mu, x2_sig),
  x2_r = 0.3 * x1 + sqrt(1 - 0.3^2)* x2,
  x3 = rnorm(size, x3_mu, x3_sig),
  x1_b = rep(x1_beta, size),
  x2_b = rep(x2_beta, size),
  x3_b = rep(x3_beta, size),
  noise = rnorm(size, 0, noise_sig)
)

results1 <- results %>% mutate(
  y1 = x0 + x1_b * x1 + noise, # market factor only
  y2 = x0 + x1_b * x1 + x2_b * x2_r + noise, # market and size
  y3 = x0 + x1_b * x1 + x2_b * x2 + x3_b * x3 + noise # all three
)

model_y1 <- lm(y2 ~ x1, data = results1) # market only
model_y2 <- lm(y2 ~ x1 + x2_r, data = results1) # market and size
model_y3 <- lm(y3 ~ x1 + x2 + x3, data = results1)

cor_x1_x2 <- cor(results1$x1, results1$x2_r)
cov_x1_x2 <- cov(results1$x1, results1$x2_r)
var_x1 <- var(results1$x1)
var_x2 <- var(results1$x2_r)
beta_x1 <- cov_x1_x2/var_x1

model_omit <- lm(x2_r ~ x1, data = results1)

# model_y1 suffers omitted variable bias: x1 = 0.6771 as given in problem
summary(model_y1)

## 
## Call:
## lm(formula = y2 ~ x1, data = results1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.038216 -0.008610 -0.001528  0.008894  0.033695 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.058839   0.005298  11.105  < 2e-16 ***
## x1          0.677063   0.106361   6.366 3.86e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01155 on 118 degrees of freedom
## Multiple R-squared:  0.2556, Adjusted R-squared:  0.2493 
## F-statistic: 40.52 on 1 and 118 DF,  p-value: 3.864e-09

# And note that "true" X1 = 0.463 per the solution's 0.460
# Problem provides rounded 0.7240 as HML coefficient
summary(model_y2)

## 
## Call:
## lm(formula = y2 ~ x1 + x2_r, data = results1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0221522 -0.0069359  0.0000787  0.0066889  0.0227758 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.030915   0.005373   5.754 7.12e-08 ***
## x1          0.463067   0.088293   5.245 7.05e-07 ***
## x2_r        0.724203   0.086597   8.363 1.48e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009173 on 117 degrees of freedom
## Multiple R-squared:  0.5341, Adjusted R-squared:  0.5262 
## F-statistic: 67.07 on 2 and 117 DF,  p-value: < 2.2e-16

model_tidy <- tidy(model_y1)
model_tidy[2,1] <- "SMB"

gt_table_model <- gt(model_tidy)

gt_table_model <- 
  gt_table_model %>% 
  tab_options(
    table.font.size = 14
  ) %>% 
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_body()
  ) %>% 
  tab_header(
    title = "Market-neutral portfolio excess returns regressed against SMB",
    subtitle = md("(but HML is an ***omitted variable***)")
  #) %>% tab_source_note(
  #   source_note = md("the source is ... FRED")
  ) %>% cols_label(
    term = "Coefficient",
    estimate = "Estimate",
    std.error = "Std Error",
    statistic = "t-stat",
    p.value = "p value"
  ) %>% fmt_number(
    columns = vars(estimate, std.error, statistic, p.value),
    decimals = 4
  ) %>% fmt_scientific(
    columns = vars(statistic, p.value),
  ) %>% tab_options(
    heading.title.font.size = 14,
    heading.subtitle.font.size = 14
  )

gt_table_model

Market-neutral portfolio excess returns regressed against SMB
(but HML is an omitted variable)
Coefficient	Estimate	Std Error	t-stat	p value
(Intercept)	0.0588	0.0053	1.11 × 10¹	4.74 × 10⁻²⁰
SMB	0.6771	0.1064	6.37	3.86 × 10⁻⁹

Question 2: House Prices - Heteroskedasticity

20.19.2. Josh regressed house prices (as the response or dependent variable) against two explanatory variables: square footage (SQFEET) and the number of rooms in the house (ROOMS). The dependent variable, PRICE, is expressed in thousands of dollars ($000); e.g., the average PRICE is $XXX.000 because the average house price in the sample of 150 houses is $XXX.000. The units of SQFEET are unadjusted units; e.g., the average SQFEET in the sample is X,XXX ft^2. The variable ROOMS is equal to the sum of the number of bedrooms and bathrooms; because much of the sample is 2- and 3-bedroom houses with 2 baths, the average of ROOM is X.XX. Josh’s regression results are displayed below.

Josh is concerned that the data might not be homoscedastic. He decides to conduct a White test for heteroskedasticity. In this test, he regresses the squared residuals against each of the explanatory variables and the cross-product of the explanatory variables (including the product of each variable with itself). The results of this regression are displayed below.

# library(tidyverse)
# library(broom)
# library(gt)

intercept <- 40
intercept_sig <- .01

x1_mu <- 1000
x1_sig <- 400
x1_beta <- 0.3

x2_mu <- 4.5
x2_sig <- 2
x2_beta <- 20.0

noise_mu <- 0
noise_sig <- 500 # low value gets low p-value b/c low noise

size <- 150
set.seed(65)

rho_noise_x1 <- 0.7

ft2_start = 800
ft2_end = 3000

results <- tibble(
  x0_sn = rnorm(size),
  x1_sn = rnorm(size),
  x2_sn = rnorm(size),
  e = rnorm(size),
  # e_corr = rho_noise_x1 * x1_sn + sqrt(1 - rho_noise_x1^2) * noise_sn,
  # 
  x0 = intercept + x0_sn * intercept_sig,
  # x1 = x1_mu + x1_sn * x1_sig, 
  x1 = seq(ft2_start, ft2_end - 1, by = (ft2_end - ft2_start)/size),
  x2 = x2_mu + x2_sn * x2_sig,
  # 
  # here the heteroskadisticity is deliberately introduced:
  # the error term's sigma is increasing with SQFEET, x1
  e_sigma = ((x1 - ft2_start)/100)^2, 
  # noise = noise_sn_corr * noise_sig, 
  #  
  x1_b = rep(x1_beta, size),
  x2_b = rep(x2_beta, size)
)

results1 <- results %>% mutate(
  y = x0 + (x1_b * x1) + (x2_b * x2) + (e * e_sigma)
)

# cor(results1$e * results1$e_sigma, results1$x1)

model_house <- lm(y ~ x1 + x2, data = results1)
summary(model_house)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = results1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -710.33  -79.41    8.68   61.16  731.16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.45712   69.59813  -0.136    0.892    
## x1           0.36961    0.02734  13.522   <2e-16 ***
## x2           8.78355    8.61236   1.020    0.309    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 211.1 on 147 degrees of freedom
## Multiple R-squared:  0.5548, Adjusted R-squared:  0.5488 
## F-statistic: 91.61 on 2 and 147 DF,  p-value: < 2.2e-16

model_tidy_house <- tidy(model_house)
model_tidy_house[2,1] <- "SQFEET"
model_tidy_house[3,1] <- "ROOMS"

gt_table_model_house <- gt(model_tidy_house)

gt_table_model_house <- 
  gt_table_model_house %>% 
  tab_options(
    table.font.size = 12
  ) %>% 
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_body()
  ) %>% 
  tab_header(
    title = "House Price regressed against ft^2 (SQFEET) + ROOMS(#)",
    subtitle = md("House Price in Thousands **($000)** of dollars")
  ) %>% tab_source_note(
    source_note = md("Residual standard error: 211.1 on 147 degrees of freedom")
  ) %>% tab_source_note(
    source_note = md("Multiple R-squared:  0.5548,  Adjusted R-squared:  0.5488")
  ) %>% tab_source_note(
    source_note = md("F-statistic: 91.61 on 2 and 147 DF,  p-value: < 2.2e-16")  
  ) %>% cols_label(
    term = "Coefficient",
    estimate = "Estimate",
    std.error = "Std Error",
    statistic = "t-stat",
    p.value = "p value"
  ) %>% fmt_number(
    columns = vars(estimate, std.error, statistic, p.value),
    decimals = 3
  ) %>% fmt_scientific(
    columns = vars(p.value),
  ) %>% tab_options(
    heading.title.font.size = 14,
    heading.subtitle.font.size = 12,
    source_notes.font.size = 12
  )

gt_table_model_house

House Price regressed against ft^2 (SQFEET) + ROOMS(#)
House Price in Thousands ($000) of dollars
Coefficient	Estimate	Std Error	t-stat	p value
(Intercept)	−9.457	69.598	−0.136	8.92 × 10⁻¹
SQFEET	0.370	0.027	13.522	1.40 × 10⁻²⁷
ROOMS	8.784	8.612	1.020	3.09 × 10⁻¹
Residual standard error: 211.1 on 147 degrees of freedom
Multiple R-squared: 0.5548, Adjusted R-squared: 0.5488
F-statistic: 91.61 on 2 and 147 DF, p-value: < 2.2e-16

mean(results1$y) # price

## [1] 728.2827

mean(results1$x0) # intercept

## [1] 39.99987

mean(results1$x1) # sqfeet

## [1] 1892.667

mean(results1$x2) # rooms

## [1] 4.347188

ressq <- resid(model_house)^2
results_v2 <- cbind(results1, ressq)
white_test <- lm(ressq ~ x1 * x2 + I(x1^2) + I(x2^2), data = results_v2)
summary(white_test)

## 
## Call:
## lm(formula = ressq ~ x1 * x2 + I(x1^2) + I(x2^2), data = results_v2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -163838  -26649   -2683    6834  385087 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.594e+05  7.490e+04   2.128 0.035015 *  
## x1          -1.962e+02  6.936e+01  -2.828 0.005346 ** 
## x2          -1.578e+04  1.490e+04  -1.059 0.291521    
## I(x1^2)      6.017e-02  1.742e-02   3.454 0.000727 ***
## I(x2^2)      1.494e+02  1.265e+03   0.118 0.906193    
## x1:x2        1.002e+01  4.936e+00   2.030 0.044183 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76920 on 144 degrees of freedom
## Multiple R-squared:  0.3381, Adjusted R-squared:  0.3152 
## F-statistic: 14.71 on 5 and 144 DF,  p-value: 1.201e-11

white_tidy <- tidy(white_test)
white_tidy[2,1] <- "SQFEET"
white_tidy[3,1] <- "ROOMS"
white_tidy[4,1] <- "SQFEET^2"
white_tidy[5,1] <- "ROOMS^2"
white_tidy[6,1] <- "SQFEET*ROOMS"

gt_table_white <- gt(white_tidy)

gt_table_white <- 
  gt_table_white %>% 
  tab_options(
    table.font.size = 12,
  ) %>% 
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_body()
  ) %>% 
  tab_header(
    title = "RESIDUAL^2 regressed against SQFEET + ROOMS + SQFEET^2 + ROOMS^2 + ROOMS*SQFEET",
    subtitle = md("White's Test for Heteroskedasticity")
  ) %>% tab_source_note(
    source_note = md("Residual standard error: 76920 on 144 degrees of freedom")
  ) %>% tab_source_note(
    source_note = md("Multiple R-squared: 0.3381,   Adjusted R-squared: 0.3152, F-statistic: 14.71 on 5 and 144 DF, p-value: 1.201e-11")
  ) %>% cols_label(
    term = "Coefficient",
    estimate = "Estimate",
    std.error = "Std Error",
    statistic = "t-stat",
    p.value = "p value"
  ) %>% fmt_number(
    columns = vars(estimate, std.error, statistic, p.value),
    decimals = 3
  ) %>% fmt_scientific(
     columns = vars(estimate, std.error),
  ) %>% tab_options(
    heading.title.font.size = 14,
    heading.subtitle.font.size = 12,
    source_notes.font.size = 12
  )

gt_table_white

RESIDUAL^2 regressed against SQFEET + ROOMS + SQFEET^2 + ROOMS^2 + ROOMS*SQFEET
White's Test for Heteroskedasticity
Coefficient	Estimate	Std Error	t-stat	p value
(Intercept)	1.59 × 10⁵	7.49 × 10⁴	2.128	0.035
SQFEET	−1.96 × 10²	6.94 × 10¹	−2.828	0.005
ROOMS	−1.58 × 10⁴	1.49 × 10⁴	−1.059	0.292
SQFEET^2	6.02 × 10⁻²	1.74 × 10⁻²	3.454	0.001
ROOMS^2	1.49 × 10²	1.27 × 10³	0.118	0.906
SQFEET*ROOMS	1.00 × 10¹	4.94	2.030	0.044
Residual standard error: 76920 on 144 degrees of freedom
Multiple R-squared: 0.3381, Adjusted R-squared: 0.3152, F-statistic: 14.71 on 5 and 144 DF, p-value: 1.201e-11

# and the heteroskadisticity is visually VERY obvious!
results_v2 %>% ggplot(aes(x = x1, y = y)) +
  geom_point()

Question 3: Multicollinearity

20.19.3. Emily works for an insurance company and she has regressed medical costs (aka, the response or dependent variable) for a sample of patients against three independent variables: AGE, BMI, and CHARITY. The sample’s average age is X.X years. Body mass index (BMI) is mass divided by height squared and the sample’s average BMI is X.Y. CHARITY is the dollar amount of charitable spending in the last year; the sample average is $X.Y donated to charity in the last year. Emily’s regression results are displayed below.

library(tidyverse)
library(broom)
library(gt)

intercept <- 150
intercept_sig <- 40

# age
x1_mu <- 38
x1_sig <- 7
x1_beta <- 50

# bmi
x2_mu <- 22
x2_sig <- 4
x2_beta <- 100

# spend
x3_mu <- 500
x3_sig <- 250
x3_beta <- -0.4

rho_x1_x2 = 0.3
rho_x1_x3 = 0.97

noise_mu <- 0
noise_sig <- 300 # low value gets low p-value b/c low noise

size <- 43
set.seed(12)

results <- tibble(
  x0_sn <- rnorm(size),
  x1_sn <- rnorm(size),
  x2_sn <- rnorm(size),
  x3_sn <- rnorm(size),
  #
  x0 = intercept + intercept_sig * x0_sn,
  #
  x1 = x1_mu + x1_sig * x1_sn,
  x2_sn_corr = rho_x1_x2*x1_sn + sqrt(1 - rho_x1_x2^2)*x2_sn,
  x2 = x2_mu + x2_sig * x2_sn_corr,
  x3_sn_corr = rho_x1_x3*x1_sn + sqrt(1 - rho_x1_x3^2)*x3_sn,
  x3 = x3_mu + x3_sig * x3_sn_corr,
  #
  x1_b = rep(x1_beta, size),
  x2_b = rep(x2_beta, size),
  x3_b = rep(x3_beta, size),
  noise = rnorm(size, 0, noise_sig)
)


results1 <- results %>% mutate(
  y = x0 + (x1_b * x1) + (x2_b * x2) + (x3_b * x3) + noise
)

cor(results1$x1, results1$x3)

## [1] 0.962464

model <- lm(y ~ x1 + x2 + x3, data = results1)

model_x1 <- lm(x1 ~ x2 + x3, data = results1)
summary(model_x1)

## 
## Call:
## lm(formula = x1 ~ x2 + x3, data = results1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2878 -0.9601 -0.2345  1.1477  3.4376 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.022184   1.622772  14.803   <2e-16 ***
## x2           0.029178   0.072426   0.403    0.689    
## x3           0.026997   0.001234  21.883   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.733 on 40 degrees of freedom
## Multiple R-squared:  0.9266, Adjusted R-squared:  0.923 
## F-statistic: 252.6 on 2 and 40 DF,  p-value: < 2.2e-16

rsq_x1 <- summary(model_x1)$r.squared

model_x2 <- lm(x2 ~ x1 + x3, data = results1)
summary(model_x2)

## 
## Call:
## lm(formula = x2 ~ x1 + x3, data = results1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9298 -2.1503 -0.5579  2.5672  7.9527 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 16.9897894  8.5885276   1.978   0.0548 .
## x1           0.1384986  0.3437854   0.403   0.6892  
## x3          -0.0001614  0.0096805  -0.017   0.9868  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.776 on 40 degrees of freedom
## Multiple R-squared:  0.04833,    Adjusted R-squared:  0.0007499 
## F-statistic: 1.016 on 2 and 40 DF,  p-value: 0.3713

rsq_x2 <- summary(model_x2)$r.squared

model_x3 <- lm(x3 ~ x1 + x2, data = results1)
summary(model_x3)

## 
## Call:
## lm(formula = x3 ~ x1 + x2, data = results1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -119.098  -39.213    9.803   48.540   99.936 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -802.99922   74.04616 -10.845 1.77e-13 ***
## x1            34.18590    1.56221  21.883  < 2e-16 ***
## x2            -0.04306    2.58249  -0.017    0.987    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.67 on 40 degrees of freedom
## Multiple R-squared:  0.9263, Adjusted R-squared:  0.9227 
## F-statistic: 251.5 on 2 and 40 DF,  p-value: < 2.2e-16

rsq_x3 <- summary(model_x3)$r.squared

var_if_x1 <- 1/(1 - rsq_x1 )
var_if_x2 <- 1/(1 - rsq_x2)
var_if_x3 <- 1/(1 - rsq_x3)

multi_table <- tibble(
  regression = c(1,2,3),
  response = c("AGE", "BMI", "CHARITY"),
  explanatory = c("BMI + CHARITY", "AGE + CHARITY", "AGE + BMI"),
  Rsquared = c(rsq_x1, rsq_x2, rsq_x3)
)


model_tidy <- tidy(model)
model_tidy[2,1] <- "AGE"
model_tidy[3,1] <- "BMI"
model_tidy[4,1] <- "CHARITY"

gt_table_model <- gt(model_tidy)

gt_table_model <- 
  gt_table_model %>% 
  tab_options(
    table.font.size = 12
  ) %>% 
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_body()
  ) %>% 
  tab_header(
    title = "Medical COST regressed against AGE + BMI + CHARITY($)",
    subtitle = md("Simulated data")
  ) %>% tab_source_note(
    source_note = md("Residual standard error: 325.1 on 39 degrees of freedom")
  ) %>% tab_source_note(
    source_note = md("Multiple R-squared: 0.6961,   Adjusted R-squared: 0.6727, F-statistic: 29.77 on 3 and 39 DF, p-value: 3.514e-10")
  ) %>% cols_label(
    term = "Coefficient",
    estimate = "Estimate",
    std.error = "Std Error",
    statistic = "t-stat",
    p.value = "p value"
  ) %>% fmt_number(
    columns = vars(estimate, std.error, statistic, p.value),
    decimals = 2
  ) %>% fmt_scientific(
    columns = vars(p.value),
  ) %>% tab_options(
    heading.title.font.size = 14,
    heading.subtitle.font.size = 12,
     source_notes.font.size = 12
  )

##

gt_multi_table <- gt(multi_table)

gt_multi_table <- 
  gt_multi_table %>% 
  tab_options(
    table.font.size = 12
  ) %>% 
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_body()
  ) %>% 
  tab_header(
    title = "Each response variable regressed against the others",
    subtitle = md("Testing for multicollinearity")
  ) %>% tab_source_note(
    source_note = md("Note: According to GARP, the standard test of")
  ) %>% tab_source_note(
    source_note = md("multicollinearity is the variance inflation factor (VIF)")
  ) %>% cols_label(
    regression = "Regression",
    response = "Response",
    explanatory = "Explanatory",
    Rsquared = "R-squared",
  ) %>% cols_align(
    align = "center",
    columns = vars(regression)
  ) %>% fmt_number(
    columns = vars(Rsquared),
    decimals = 3
  # ) %>% fmt_scientific(
    # columns = vars(p.value),
  ) %>% tab_options(
    heading.title.font.size = 14,
    heading.subtitle.font.size = 12,
     source_notes.font.size = 12
  )

summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = results1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -923.93 -243.27   21.75  192.64  554.97 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -165.2505   774.7267  -0.213   0.8322    
## x1            63.6695    29.6571   2.147   0.0381 *  
## x2           102.4645    13.6123   7.527 4.09e-09 ***
## x3            -0.9461     0.8334  -1.135   0.2632    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 325.1 on 39 degrees of freedom
## Multiple R-squared:  0.6961, Adjusted R-squared:  0.6727 
## F-statistic: 29.77 on 3 and 39 DF,  p-value: 3.514e-10

summary(multi_table)

##    regression    response         explanatory           Rsquared      
##  Min.   :1.0   Length:3           Length:3           Min.   :0.04833  
##  1st Qu.:1.5   Class :character   Class :character   1st Qu.:0.48734  
##  Median :2.0   Mode  :character   Mode  :character   Median :0.92634  
##  Mean   :2.0                                         Mean   :0.63377  
##  3rd Qu.:2.5                                         3rd Qu.:0.92649  
##  Max.   :3.0                                         Max.   :0.92663

gt_table_model

Medical COST regressed against AGE + BMI + CHARITY($)
Simulated data
Coefficient	Estimate	Std Error	t-stat	p value
(Intercept)	−165.25	774.73	−0.21	8.32 × 10⁻¹
AGE	63.67	29.66	2.15	3.81 × 10⁻²
BMI	102.46	13.61	7.53	4.09 × 10⁻⁹
CHARITY	−0.95	0.83	−1.14	2.63 × 10⁻¹
Residual standard error: 325.1 on 39 degrees of freedom
Multiple R-squared: 0.6961, Adjusted R-squared: 0.6727, F-statistic: 29.77 on 3 and 39 DF, p-value: 3.514e-10

gt_multi_table

Each response variable regressed against the others
Testing for multicollinearity
Regression	Response	Explanatory	R-squared
1	AGE	BMI + CHARITY	0.927
2	BMI	AGE + CHARITY	0.048
3	CHARITY	AGE + BMI	0.926
Note: According to GARP, the standard test of
multicollinearity is the variance inflation factor (VIF)

mean(results1$y) # cost

## [1] 4079.452

mean(results1$x0) # intercept

## [1] 145.2152

mean(results1$x1) # age

## [1] 38.48411

mean(results1$x2) # bmi

## [1] 22.2372

mean(results1$x3) # spend

## [1] 511.6571

# let's see the VIFs
var_if_x1

## [1] 13.63041

var_if_x2

## [1] 1.050788

var_if_x3

## [1] 13.57542

regression

David Harper

Founder & CEO of Bionic Turtle

I teach financial risk and enjoy learning data science