BT Question P1-T2-20-16-3: Univariate regression: Monthly rental versus footage

20.16.3. Sally works at a real estate firm and was asked by her client to quantify the relationship between rental size (in square feet) and rental price. She explained to her client that the relationship is multivariate but, given that caveat, she offered to perform a linear regression with a single explanatory variable. She retrieved a massive dataset (n = 360,400 observations and includes rentals across the United States) and regressed monthly rental price (aka, the explained variable) against rental size as measured by square feet. To illustrate the units, one of data points in the dataset is (y = $1,200 per month, X = 1,000 feet^2). The results are displayed below.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.2
## -- Attaching packages ------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.1
## v tidyr   1.1.1     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'tibble' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## -- Conflicts --------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gt)
## Warning: package 'gt' was built under R version 4.0.2
library(broom)
## Warning: package 'broom' was built under R version 4.0.2
# rentals_raw <- read_csv("housing.csv")
# rentals_sort <- rentals %>% arrange(price)
# rentals_df1 <- rentals_raw %>% filter(price > 500, price < 10000, 
#                                       sqfeet> 500, sqfeet < 10000)
# boxplot(rentals$price)
# boxplot(rentals$price)$out
# 
# rentals_df1 <- rentals_df1 %>% rename(
#  "Price" = "price",
#  "SquareFeet" = "sqfeet")
# 
# saveRDS(rentals_df1, "rentals-sm.rds")

con <- url("http://frm-bionicturtle.s3.amazonaws.com/david/rentals-sm.rds")
rentals_df1 <- readRDS(con)
close(con)

model1 <- rentals_df1 %>% lm(Price ~ SquareFeet, data =  .)
summary(model1)
## 
## Call:
## lm(formula = Price ~ SquareFeet, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5382.8  -325.1  -122.7   185.9  8262.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 624.42303    2.59775   240.4   <2e-16 ***
## SquareFeet    0.57889    0.00239   242.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 545.4 on 360399 degrees of freedom
## Multiple R-squared:   0.14,  Adjusted R-squared:   0.14 
## F-statistic: 5.866e+04 on 1 and 360399 DF,  p-value: < 2.2e-16
price_avg_act <- mean(rentals_df1$Price)
size_ave_act <- mean(rentals_df1$SquareFeet)

new.df.rentals <- data.frame(SquareFeet = c(1000, 1500, 1800, 2000, 2500))
predict(model1, new.df.rentals)
##        1        2        3        4        5 
## 1203.313 1492.758 1666.425 1782.203 2071.648
model1_tidy <- tidy(model1)

gt_table_rentals <- gt(model1_tidy)

gt_table_rentals <- 
  gt_table_rentals %>% 
  tab_options(
    table.font.size = 14
  ) %>% 
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_body()
  ) %>% 
  tab_header(
    title = "Monthly Rental PRICE regressed against Square Feet",
    subtitle = md("Entire United States, n = 360,400 observations")
  ) %>% 
  tab_source_note(
    source_note = "Source: USA Housing Listings @ kaggle https://www.kaggle.com/datasets"
  ) %>% cols_label(
    term = "Coefficient",
    estimate = "Estimate",
    std.error = "Std Error",
    statistic = "t-stat",
    p.value = "p value"
  ) %>% fmt_number(
    columns = vars(estimate, std.error, statistic),
    decimals = 3
  ) %>% fmt_scientific(
    columns = vars(p.value),
  ) %>% 
  tab_options(
    heading.title.font.size = 14,
    heading.subtitle.font.size = 12
  )
  
gt_table_rentals
Monthly Rental PRICE regressed against Square Feet
Entire United States, n = 360,400 observations
Coefficient Estimate Std Error t-stat p value
(Intercept) 624.423 2.598 240.370 0.00
SquareFeet 0.579 0.002 242.206 0.00
Source: USA Housing Listings @ kaggle https://www.kaggle.com/datasets
rentals_df1 %>% ggplot(aes(SquareFeet, Price)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

David Harper
David Harper
Founder & CEO of Bionic Turtle

I teach financial risk and enjoy learning data science

Related