HW7: Creating Fake Data Sets To Explore Hypotheses
April 15, 2024
Open relevant libraries
library(tidyverse) # To use ggplot2 and dplyr packages
library(kableExtra) # To make fancy tables
library(gridExtra)
Background
My lab studied gene expression from human alveolar type 2 cells grown on different stiffness plates (2kPA, 16kPA and 64kPA) in different media (expansion and differentiation media) for various culture duration (Time). It was hypothesized that stiffness, media and duration (time) will affect gene expression. Assessing the validity of this hypotheses involves several statistical analyses, however for the sake of simplicity only one predictor variable ‘Time’ and one response variable ‘AGER’ gene expression will be analysed.
The ‘Time’ variable is a continuous variable denoting the number of hours post-seeding and the ‘AGER’ variable is the differential gene expression for the AGER gene and is also continuous. Because both the predictor and response variables are continuous, the linear regression analysis will the most appropriate statistical test to assess the relationship between both variables.
For this exercise different normal distribution simulated data sets will be used to visualize how means and sample sizes effect the linear regression output.
User-designed Functions
In order to repeat the linear regression analysis multiples times with different random data, 3 functions were designed to simplify this process. The user-designed functions are:
- sim_norm(): a function that creates a simulated (from a normal
distribution) AGER gene expression dataset for different culture
durations.
- Input Argument: ‘list’ (a list containing the mean, variance and size of each ‘Time’ group)
- Output: ‘sim_df’ (a dataframe containing a unique ID variable, Time and sim_AGER (simulated from normal distribution))
- Reg_Stat(): Performs the linear regression analysis on a dataset
- Input Arguments: ‘df’ (a dataframe containing the variable the linear regression will be ran on)
- Output: ‘a’ (an atomic vector containing the coefficient ‘b’ (Response = a + bPredictor), the p-value and conclusion (significant (p < 0.05) or insignificant))
- Reg_Plot(): Plot the dataset
- Input Arguments: ‘df’ (a dataframe containing the variable to be plotted)
- Output: ‘plot’ (a scatterplot with a trendline showing the correlation between the response and predictor variables))
Below shows the code for the functions:
##############################################################################################
# FUNCTION: sim_rnorm()
# Description: a function that creates a simulated gene expression dataset for different culture durations.
# Input Arguments: 'list' (a list containing the mean, variance and size of each 'Time' categories)
# Output: 'sim_df' (a dataframe containing a unique ID variable, Time and sim_AGER (simulated from normal distribution))
#___________________________________________________________________________________________
<- function(list = input){
sim_rnorm
#Times = c("24h", "+4h", "+24h", "5d")
= rnorm(n = list[[1]][1], mean = list[[1]][2], sd = sqrt(list[[1]][3]))
sim24h = rnorm(n = list[[2]][1], mean = list[[2]][2], sd = sqrt(list[[2]][3]))
sim48h = rnorm(n = list[[3]][1], mean = list[[3]][2], sd = sqrt(list[[3]][3]))
sim72h = rnorm(n = list[[4]][1], mean = list[[4]][2], sd = sqrt(list[[4]][3]))
sim144h
<- c(sim24h, sim48h, sim72h, sim144h)
sim_AGER <- 1:length(sim_AGER)
ID <- c (rep(24, length(sim24h)), rep(48, length(sim48h)), rep(72, length(sim72h)), rep(144, length(sim144h))) # Assigning Time categories to the rnorm() generated data
Time
<- data.frame(ID, Time, sim_AGER) # Create a dataframe from the simulated Time and AGER data.
sim_df
return(sim_df)
}##############################################################################################
##############################################################################################
# FUNCTION: Reg_Stat()
# Description: Performs linear regression on a dataset
# Input Arguments: 'df' (a dataframe containing the variable the linear regression will be ran on)
# Output: ‘a’ (an atomic vector containing the coefficient 'b' (Response = a + bPredictor), the p-value and conclusion (significant or insignificant))
#___________________________________________________________________________________________
<- function(df = DF){
Reg_Stat
<- lm(sim_AGER ~ Time, data = df)
z <-summary(z)$coefficients[2,1] # pulls out the coeffiecient of the relationship btw the response and predictor variables
coeff<- summary(z)$coefficients[2,4] # pulls out the p-value of the relationship btw the response and predictor variables
pVal
<-c(signif(coeff, 4), signif(pVal, 4), ifelse(pVal < 0.05, "Significant", "Non-significant"))
a names(a) <- c("Coefficient", "P-Value", "Conclusion")
return(a)
}##############################################################################################
##############################################################################################
# FUNCTION: Reg_Plot()
# Description: Plot the datset
# Input Arguments: ‘df’ (a dataframe containing the variable to be plotted)
# Output:'plot'
#___________________________________________________________________________________________
<- function(df = DF){
Reg_Plot
### Plot of the data
<- ggplot(df) +
plot aes(x = Time, y = sim_AGER) +
geom_point() +
stat_smooth(method=lm,se=0.99) # default se=0.95
return(plot)
}##############################################################################################
Running Linear Regression on Different Randomly Generated Datasets with the Same Means, Variance and Size
Getting group means, variances and sizes that will yield significant effects on AGER response
Instead of guessing the sample sizes, means and variances for each ‘Time’ group that would yield significant effect in AGER expression, the sample sizes, means and variances of AGER variable for each Time group from my lab’s dataset will be used; as a quick linear regression analysis on AGER variable with Time as the predictor yields a significant p-value (p = 0.0002489788). See code chuck:
<- data.frame(read.csv("RawData.csv")) # read in the original csv dataset into a dataframe
df
= lm(df$AGER ~ df$Time, data = df)
a <-summary(a)$coefficients[2,1]
coeff<- summary(a)$coefficients[2,4]
pVal cat("The coefficient of the linear regression is: ", coeff, "\n",
"The p-value is: ", pVal, "\n")
## The coefficient of the linear regression is: 0.01042223
## The p-value is: 0.0002489788
The hypothesized mean, variance and sizes for each Time group are shown below in Table 1.
<- df %>% group_by(Time) %>%
Subset summarize(Mean = mean(AGER, na.rm = T), Variance = sd(AGER, na.rm = T)^2, Size = n()) # Get the size, mean and variance of each Time group
kable(Subset, caption = "*Table 1 - Mean, Variance and Size for Each Time Group*", row.names = NA, align = "c")
Time | Mean | Variance | Size |
---|---|---|---|
24 | 0.8439886 | 0.0604894 | 84 |
48 | 2.2456403 | 3.0218083 | 86 |
72 | 2.7526489 | 7.8850666 | 84 |
144 | 2.4733090 | 9.2351024 | 82 |
Creating a random dataset using the hypothesized group means, variances and sizes that will yield significant effects on AGER response
Using the hypothesized means, variance and sizes for each group (Table 1) a simulated dataset containing simulated ‘AGER’ values (simulated from a normal distribution) for each time point was created using the ‘sim_rnorm()’ function’. The linear regression analysis was done on the simulated dataset and the coefficient and p-value was returned using the ‘Reg_Stats()’ function. Lastly, the dataset was plotted using the ‘Reg_Plot()’ function. The coefficients and p-values for each linear regression model from the 4 test are displayed on table 2.
## Create input variables needed to create simulated dataset
= 84; mean_24h = 0.8439886; var_24h = 0.0604894;
n_24h = 86; mean_48h = 2.2456403; var_48h = 3.0218083;
n_48h = 84; mean_72h = 2.7526489; var_72h = 7.8850666;
n_72h = 82; mean_144h = 2.4733090; var_144h = 9.2351024;
n_144h
# Create a list with all input data
<- list(sim24h = c(n_24h, mean_24h,var_24h),
input sim48h = c(n_48h, mean_48h,var_48h),
sim72h = c(n_72h, mean_72h,var_72h),
sim144h = c(n_144h, mean_144h,var_144h))
<- sim_rnorm(input) # create a randomly generated Time and AGER dataset
df1 <- Reg_Stat(df1) # computes the linear regression analysis of AGER(response) and Time (predictor)
LR1
<- sim_rnorm(input) # create a randomly generated Time and AGER dataset
df2 <- Reg_Stat(df2) # computes the linear regression analysis of AGER(response) and Time (predictor)
LR2
<- sim_rnorm(input) # create a randomly generated Time and AGER dataset
df3 <- Reg_Stat(df3) # computes the linear regression analysis of AGER(response) and Time (predictor)
LR3
<- sim_rnorm(input) # create a randomly generated Time and AGER dataset
df4 <- Reg_Stat(df4) # computes the linear regression analysis of AGER(response) and Time (predictor)
LR4
<- c("1", "2", "3", "4")
Test <- bind_rows(LR1, LR2, LR3, LR4)
RegStats1 <- cbind(Test,RegStats1)
s
kable(s, caption = "**Table 2 - Coefficient, p-Value and Conclusion from Linear Regression Analysis on 4 Different Randomly Generated Dataset**", align = "c")
Test | Coefficient | P-Value | Conclusion |
---|---|---|---|
1 | 0.003713 | 0.2165 | Non-significant |
2 | 0.01369 | 2.178e-06 | Significant |
3 | 0.01003 | 0.0007388 | Significant |
4 | 0.01271 | 3.844e-05 | Significant |
<-list(Reg_Plot(df1), Reg_Plot(df2), Reg_Plot(df3), Reg_Plot(df4)) # compress all the plots into a list
plotList
<- "Fig 1 - Plot of AGER (simulated from normal distribution with the same means, variance and sample size) and Time."
caption
suppressMessages(grid.arrange(grobs = plotList, bottom = caption)) # plot all the plots on a grid
Running Linear-Regression While Adjusting The Group Means And Sample Sizes
In order to observe how the means and sample sizes affect linear regression model, the means and sample sizes for the response variable was adjusted to see how small the difference between the groups need to be for there to still be significant pattern (p <0.05).
To isolate the effects of means and sample size differences, the mean or sample size (one at a time) was adjusted while keeping the other constant.
Adjusting mean differences while keeping variance and sample sizes constant
To clearly observe the p_value pattern, due to varying group mean differences the linear regression model was repeated multiple times using the group sample sizes and variances from my dataset, however the means were set to 2.5 and one the group mean varied from 1.5 to 2.5 to study the effect of mean differences. To accomplish this, a for loop was used to compute the p_values for different datasets. Note: the seed was set to 99 to limit variation in each LR model repetition.
See code chunk below:
<- NULL
Reg_output <- NULL
trial_mean
<- seq(from = 1.5, to = 2.5, by = 0.05) # create a sequence of means for one group
h
for(i in seq_along(h)){
= 84; mean_24h = h[i]; var_24h = 0.0604894;
n_24h = 86; mean_48h = 2.5; var_48h = 3.0218083;
n_48h = 84; mean_72h = 2.5; var_72h = 7.8850666;
n_72h = 82; mean_144h = 2.5; var_144h = 9.2351024;
n_144h
# Create a list with all input data
<- list(sim24h = c(n_24h, mean_24h,var_24h),
input2 sim48h = c(n_48h, mean_48h,var_48h),
sim72h = c(n_72h, mean_72h,var_72h),
sim144h = c(n_144h, mean_144h,var_144h))
<- mean_48h - mean_24h # compute mean difference
mean_diff set.seed(99)
<- sim_rnorm(input2) # create a randomly generated Time and AGER dataset
dfz
<- data.frame(t(Reg_Stat(dfz))) # computes the linear regression analysis of AGER(response) and Time
RegOutput <- data.frame(t(c(i, mean_diff))) # create a dataframe with the trial no. and the mean diff. for the current trial
tf
<- rbind(trial_mean, tf) # update the 'trial_mean' dataframe
trial_mean <-rbind(Reg_output, RegOutput) # update the 'Reg_output' dataframe
Reg_output
}
names(trial_mean) <- c("Trial No", "Mean Difference")
<- cbind(trial_mean,Reg_output)
RegResults <- RegResults %>% select(-Coefficient)
output
kable(output, caption = "**Table 3 - LR Output with Varying Group Means Difference **", align = "c")
Trial No | Mean Difference | P.Value | Conclusion |
---|---|---|---|
1 | 1.00 | 0.02324 | Significant |
2 | 0.95 | 0.02997 | Significant |
3 | 0.90 | 0.03836 | Significant |
4 | 0.85 | 0.04869 | Significant |
5 | 0.80 | 0.06132 | Non-significant |
6 | 0.75 | 0.0766 | Non-significant |
7 | 0.70 | 0.09491 | Non-significant |
8 | 0.65 | 0.1166 | Non-significant |
9 | 0.60 | 0.1422 | Non-significant |
10 | 0.55 | 0.1719 | Non-significant |
11 | 0.50 | 0.2061 | Non-significant |
12 | 0.45 | 0.245 | Non-significant |
13 | 0.40 | 0.289 | Non-significant |
14 | 0.35 | 0.3381 | Non-significant |
15 | 0.30 | 0.3923 | Non-significant |
16 | 0.25 | 0.4515 | Non-significant |
17 | 0.20 | 0.5156 | Non-significant |
18 | 0.15 | 0.5841 | Non-significant |
19 | 0.10 | 0.6567 | Non-significant |
20 | 0.05 | 0.7326 | Non-significant |
21 | 0.00 | 0.8113 | Non-significant |
As shown in table 4, the smallest mean difference that still yield a significant p-value is 0.85.
Adjusting sample size while keeping mean and variance constant
To see the effect of sample size differences in the p-value, the hypothesized means and variances from my dataset was used, however all the group sample sizes was changed to 84 (except on that was varied from 0 to 84). A for loop was used to simulate the dataset and run the LR analysis multiple times (22 times) with the varying sample size. To limit data variation, the seed was set to 99.
See code chunk:
<- NULL
Reg_output <- NULL
trial_size
<- seq(from = 0, to = 84, by = 2) # create a sequence of means for one group
h
for(i in seq_along(h)){
= h[i]; mean_24h = 0.8439886; var_24h = 0.0604894;
n_24h = 84; mean_48h = 2.2456403; var_48h = 3.0218083;
n_48h = 84; mean_72h = 2.7526489; var_72h = 7.8850666;
n_72h = 84; mean_144h = 2.4733090; var_144h = 9.2351024;
n_144h
# Create a list with all input data
<- list(sim24h = c(n_24h, mean_24h,var_24h),
input2 sim48h = c(n_48h, mean_48h,var_48h),
sim72h = c(n_72h, mean_72h,var_72h),
sim144h = c(n_144h, mean_144h,var_144h))
<- n_48h - n_24h # compute mean difference
size_diff set.seed(99)
<- sim_rnorm(input2) # create a randomly generated Time and AGER dataset
dfz
<- data.frame(t(Reg_Stat(dfz))) # computes the linear regression analysis of AGER(response) and Time
RegOutput <- data.frame(t(c(i, size_diff))) # create a dataframe with the trial no. and the mean diff. for the current trial
tf
<- rbind(trial_size, tf) # update the 'trial_mean' dataframe
trial_size <-rbind(Reg_output, RegOutput) # update the 'Reg_output' dataframe
Reg_output
}
names(trial_size) <- c("Trial No", "Size Difference")
<- cbind(trial_size,Reg_output)
RegResults <- RegResults %>% select(-Coefficient)
output
kable(output, caption = "**Table 4 - LR Output with Varying Group Size Difference (with seed set to 99)**", align = "c")
Trial No | Size Difference | P.Value | Conclusion |
---|---|---|---|
1 | 84 | 0.8201 | Non-significant |
2 | 82 | 0.8848 | Non-significant |
3 | 80 | 0.9478 | Non-significant |
4 | 78 | 0.9666 | Non-significant |
5 | 76 | 0.8834 | Non-significant |
6 | 74 | 0.8853 | Non-significant |
7 | 72 | 0.8877 | Non-significant |
8 | 70 | 0.8607 | Non-significant |
9 | 68 | 0.8233 | Non-significant |
10 | 66 | 0.631 | Non-significant |
11 | 64 | 0.6505 | Non-significant |
12 | 62 | 0.9932 | Non-significant |
13 | 60 | 0.911 | Non-significant |
14 | 58 | 0.9459 | Non-significant |
15 | 56 | 0.6313 | Non-significant |
16 | 54 | 0.3729 | Non-significant |
17 | 52 | 0.2806 | Non-significant |
18 | 50 | 0.1532 | Non-significant |
19 | 48 | 0.04651 | Significant |
20 | 46 | 0.01779 | Significant |
21 | 44 | 0.00582 | Significant |
22 | 42 | 0.004681 | Significant |
23 | 40 | 0.005257 | Significant |
24 | 38 | 0.006602 | Significant |
25 | 36 | 0.001318 | Significant |
26 | 34 | 0.001231 | Significant |
27 | 32 | 0.0005218 | Significant |
28 | 30 | 0.001147 | Significant |
29 | 28 | 0.0004707 | Significant |
30 | 26 | 0.001102 | Significant |
31 | 24 | 0.001791 | Significant |
32 | 22 | 0.0009856 | Significant |
33 | 20 | 0.0004924 | Significant |
34 | 18 | 0.001818 | Significant |
35 | 16 | 0.0003167 | Significant |
36 | 14 | 0.0001868 | Significant |
37 | 12 | 0.0001722 | Significant |
38 | 10 | 0.0003901 | Significant |
39 | 8 | 0.0002476 | Significant |
40 | 6 | 8.304e-05 | Significant |
41 | 4 | 7.856e-05 | Significant |
42 | 2 | 0.0002776 | Significant |
43 | 0 | 0.0002533 | Significant |
The smallest sample size difference for my hypothesized group means and variance that still yields significance is 48 as shown in table 4. However, this was based on randomly generated data with the seed set to eliminate variation in the random number generation. To visualize the effect sample size with random variation in the data, the code was re-ran without the seed set.
See code chunk:
<- NULL
Reg_output <- NULL
trial_size
<- seq(from = 0, to = 84, by = 2) # create a sequence of means for one group
h
for(i in seq_along(h)){
= h[i]; mean_24h = 0.8439886; var_24h = 0.0604894;
n_24h = 84; mean_48h = 2.2456403; var_48h = 3.0218083;
n_48h = 84; mean_72h = 2.7526489; var_72h = 7.8850666;
n_72h = 84; mean_144h = 2.4733090; var_144h = 9.2351024;
n_144h
# Create a list with all input data
<- list(sim24h = c(n_24h, mean_24h,var_24h),
input2 sim48h = c(n_48h, mean_48h,var_48h),
sim72h = c(n_72h, mean_72h,var_72h),
sim144h = c(n_144h, mean_144h,var_144h))
<- n_48h - n_24h # compute mean difference
size_diff #set.seed(99)
<- sim_rnorm(input2) # create a randomly generated Time and AGER dataset
dfz
<- data.frame(t(Reg_Stat(dfz))) # computes the linear regression analysis of AGER(response) and Time
RegOutput <- data.frame(t(c(i, size_diff))) # create a dataframe with the trial no. and the mean diff. for the current trial
tf
<- rbind(trial_size, tf) # update the 'trial_mean' dataframe
trial_size <-rbind(Reg_output, RegOutput) # update the 'Reg_output' dataframe
Reg_output
}
names(trial_size) <- c("Trial No", "Size Difference")
<- cbind(trial_size,Reg_output)
RegResults <- RegResults %>% select(-Coefficient)
output
kable(output, caption = "**Table 5 - LR Output with Varying Group Size Difference (without seed set) **", align = "c")
Trial No | Size Difference | P.Value | Conclusion |
---|---|---|---|
1 | 84 | 0.5468 | Non-significant |
2 | 82 | 0.05544 | Non-significant |
3 | 80 | 0.9721 | Non-significant |
4 | 78 | 0.2186 | Non-significant |
5 | 76 | 0.5577 | Non-significant |
6 | 74 | 0.2129 | Non-significant |
7 | 72 | 0.612 | Non-significant |
8 | 70 | 0.4714 | Non-significant |
9 | 68 | 0.1283 | Non-significant |
10 | 66 | 1.497e-06 | Significant |
11 | 64 | 0.03588 | Significant |
12 | 62 | 0.03044 | Significant |
13 | 60 | 0.1047 | Non-significant |
14 | 58 | 0.9671 | Non-significant |
15 | 56 | 0.9432 | Non-significant |
16 | 54 | 0.6843 | Non-significant |
17 | 52 | 0.0007491 | Significant |
18 | 50 | 0.07374 | Non-significant |
19 | 48 | 0.9746 | Non-significant |
20 | 46 | 0.1672 | Non-significant |
21 | 44 | 0.004454 | Significant |
22 | 42 | 0.0001467 | Significant |
23 | 40 | 0.8272 | Non-significant |
24 | 38 | 0.008273 | Significant |
25 | 36 | 0.0001807 | Significant |
26 | 34 | 0.0004801 | Significant |
27 | 32 | 0.001485 | Significant |
28 | 30 | 2.657e-05 | Significant |
29 | 28 | 2.095e-05 | Significant |
30 | 26 | 0.04369 | Significant |
31 | 24 | 0.06534 | Non-significant |
32 | 22 | 0.001869 | Significant |
33 | 20 | 0.07304 | Non-significant |
34 | 18 | 5.435e-06 | Significant |
35 | 16 | 9.15e-06 | Significant |
36 | 14 | 0.0002092 | Significant |
37 | 12 | 1.326e-09 | Significant |
38 | 10 | 3.67e-05 | Significant |
39 | 8 | 2.319e-05 | Significant |
40 | 6 | 4.337e-05 | Significant |
41 | 4 | 0.02904 | Significant |
42 | 2 | 2.284e-06 | Significant |
43 | 0 | 0.01453 | Significant |
As seen from table 5, there is no clear distinctive point where the significance changes due to the effect of random variation in the data between all the trial runs.