HW7: Creating Fake Data Sets To Explore Hypotheses

April 15, 2024

Chika Ikpechukwu.

Open relevant libraries

library(tidyverse) # To use ggplot2 and dplyr packages
library(kableExtra) # To make fancy tables
library(gridExtra)

Background

My lab studied gene expression from human alveolar type 2 cells grown on different stiffness plates (2kPA, 16kPA and 64kPA) in different media (expansion and differentiation media) for various culture duration (Time). It was hypothesized that stiffness, media and duration (time) will affect gene expression. Assessing the validity of this hypotheses involves several statistical analyses, however for the sake of simplicity only one predictor variable ‘Time’ and one response variable ‘AGER’ gene expression will be analysed.

The ‘Time’ variable is a continuous variable denoting the number of hours post-seeding and the ‘AGER’ variable is the differential gene expression for the AGER gene and is also continuous. Because both the predictor and response variables are continuous, the linear regression analysis will the most appropriate statistical test to assess the relationship between both variables.

For this exercise different normal distribution simulated data sets will be used to visualize how means and sample sizes effect the linear regression output.

User-designed Functions

In order to repeat the linear regression analysis multiples times with different random data, 3 functions were designed to simplify this process. The user-designed functions are:

Below shows the code for the functions:

##############################################################################################
# FUNCTION: sim_rnorm()
# Description: a function that creates a simulated gene expression dataset for different culture durations.
# Input Arguments: 'list' (a list containing the mean, variance and size of each 'Time' categories)
# Output: 'sim_df' (a dataframe containing a unique ID variable, Time and sim_AGER (simulated from normal distribution))
#___________________________________________________________________________________________

sim_rnorm <- function(list = input){

#Times = c("24h", "+4h", "+24h", "5d")

sim24h = rnorm(n = list[[1]][1], mean = list[[1]][2], sd = sqrt(list[[1]][3]))
sim48h = rnorm(n = list[[2]][1], mean = list[[2]][2], sd = sqrt(list[[2]][3]))
sim72h = rnorm(n = list[[3]][1], mean = list[[3]][2], sd = sqrt(list[[3]][3]))
sim144h = rnorm(n = list[[4]][1], mean = list[[4]][2], sd = sqrt(list[[4]][3]))

sim_AGER <- c(sim24h, sim48h, sim72h, sim144h)
ID <- 1:length(sim_AGER)
Time <- c (rep(24, length(sim24h)), rep(48, length(sim48h)), rep(72, length(sim72h)), rep(144, length(sim144h))) # Assigning Time categories to the rnorm() generated data

sim_df <- data.frame(ID, Time, sim_AGER) # Create a dataframe from the simulated Time and AGER data. 
  
return(sim_df)
}
##############################################################################################


##############################################################################################
# FUNCTION: Reg_Stat()
# Description: Performs linear regression on a dataset
# Input Arguments: 'df' (a dataframe containing the variable the linear regression will be ran on)
# Output: ‘a’ (an atomic vector containing the coefficient 'b' (Response = a + bPredictor), the p-value and conclusion (significant or insignificant))
#___________________________________________________________________________________________

Reg_Stat <- function(df = DF){
  
   z <- lm(sim_AGER ~ Time, data = df)
   coeff<-summary(z)$coefficients[2,1] # pulls out the coeffiecient of the relationship btw the response and predictor variables
   pVal <- summary(z)$coefficients[2,4] # pulls out the p-value of the relationship btw the response and predictor variables
  
   a <-c(signif(coeff, 4), signif(pVal, 4), ifelse(pVal < 0.05, "Significant", "Non-significant"))
   names(a) <- c("Coefficient", "P-Value", "Conclusion")
   return(a)
   }
##############################################################################################


##############################################################################################
# FUNCTION: Reg_Plot()
# Description: Plot the datset
# Input Arguments: ‘df’ (a dataframe containing the variable to be plotted)
# Output:'plot'
#___________________________________________________________________________________________

Reg_Plot <- function(df = DF){
  
   ### Plot of the data
   plot <- ggplot(df) +
           aes(x = Time, y = sim_AGER) +
           geom_point() +
           stat_smooth(method=lm,se=0.99) # default se=0.95 
   
  return(plot)
 }
##############################################################################################

Running Linear Regression on Different Randomly Generated Datasets with the Same Means, Variance and Size

Getting group means, variances and sizes that will yield significant effects on AGER response

Instead of guessing the sample sizes, means and variances for each ‘Time’ group that would yield significant effect in AGER expression, the sample sizes, means and variances of AGER variable for each Time group from my lab’s dataset will be used; as a quick linear regression analysis on AGER variable with Time as the predictor yields a significant p-value (p = 0.0002489788). See code chuck:

df <- data.frame(read.csv("RawData.csv")) # read in the original csv dataset into a dataframe

a = lm(df$AGER ~ df$Time, data = df)
coeff<-summary(a)$coefficients[2,1]
pVal <- summary(a)$coefficients[2,4]
cat("The coefficient of the linear regression is: ", coeff, "\n",
    "The p-value is: ", pVal, "\n")
## The coefficient of the linear regression is:  0.01042223 
##  The p-value is:  0.0002489788

The hypothesized mean, variance and sizes for each Time group are shown below in Table 1.

Subset <- df %>% group_by(Time) %>% 
                 summarize(Mean = mean(AGER, na.rm = T), Variance = sd(AGER, na.rm = T)^2, Size = n()) # Get the size, mean and variance of each Time group

kable(Subset, caption = "*Table 1 - Mean, Variance and Size for Each Time Group*", row.names = NA, align = "c")
Table 1 - Mean, Variance and Size for Each Time Group
Time Mean Variance Size
24 0.8439886 0.0604894 84
48 2.2456403 3.0218083 86
72 2.7526489 7.8850666 84
144 2.4733090 9.2351024 82

Creating a random dataset using the hypothesized group means, variances and sizes that will yield significant effects on AGER response

Using the hypothesized means, variance and sizes for each group (Table 1) a simulated dataset containing simulated ‘AGER’ values (simulated from a normal distribution) for each time point was created using the ‘sim_rnorm()’ function’. The linear regression analysis was done on the simulated dataset and the coefficient and p-value was returned using the ‘Reg_Stats()’ function. Lastly, the dataset was plotted using the ‘Reg_Plot()’ function. The coefficients and p-values for each linear regression model from the 4 test are displayed on table 2.

## Create input variables needed to create simulated dataset

n_24h = 84; mean_24h = 0.8439886; var_24h = 0.0604894;
n_48h = 86; mean_48h = 2.2456403; var_48h = 3.0218083;
n_72h = 84; mean_72h = 2.7526489; var_72h = 7.8850666;
n_144h = 82; mean_144h = 2.4733090; var_144h = 9.2351024;

# Create a list with all input data

input <- list(sim24h = c(n_24h, mean_24h,var_24h), 
              sim48h = c(n_48h, mean_48h,var_48h),
              sim72h = c(n_72h, mean_72h,var_72h),
              sim144h = c(n_144h, mean_144h,var_144h))


df1 <- sim_rnorm(input) # create a randomly generated Time and AGER dataset
LR1 <- Reg_Stat(df1) # computes the linear regression analysis of AGER(response) and Time (predictor)

df2 <- sim_rnorm(input) # create a randomly generated Time and AGER dataset
LR2 <- Reg_Stat(df2) # computes the linear regression analysis of AGER(response) and Time (predictor)

df3 <- sim_rnorm(input) # create a randomly generated Time and AGER dataset
LR3 <- Reg_Stat(df3) # computes the linear regression analysis of AGER(response) and Time (predictor)

df4 <- sim_rnorm(input) # create a randomly generated Time and AGER dataset
LR4 <- Reg_Stat(df4) # computes the linear regression analysis of AGER(response) and Time (predictor)

Test <- c("1", "2", "3", "4")
RegStats1 <- bind_rows(LR1, LR2, LR3, LR4)
s <- cbind(Test,RegStats1)

kable(s, caption = "**Table 2 - Coefficient, p-Value and Conclusion from Linear Regression Analysis on 4 Different Randomly Generated Dataset**", align = "c")
Table 2 - Coefficient, p-Value and Conclusion from Linear Regression Analysis on 4 Different Randomly Generated Dataset
Test Coefficient P-Value Conclusion
1 0.003713 0.2165 Non-significant
2 0.01369 2.178e-06 Significant
3 0.01003 0.0007388 Significant
4 0.01271 3.844e-05 Significant
plotList <-list(Reg_Plot(df1), Reg_Plot(df2), Reg_Plot(df3), Reg_Plot(df4)) # compress all the plots into a list

caption <- "Fig 1 - Plot of AGER (simulated from normal distribution with the same means, variance and sample size) and Time."

suppressMessages(grid.arrange(grobs = plotList, bottom = caption)) # plot all the plots on a grid

Running Linear-Regression While Adjusting The Group Means And Sample Sizes

In order to observe how the means and sample sizes affect linear regression model, the means and sample sizes for the response variable was adjusted to see how small the difference between the groups need to be for there to still be significant pattern (p <0.05).

To isolate the effects of means and sample size differences, the mean or sample size (one at a time) was adjusted while keeping the other constant.

Adjusting mean differences while keeping variance and sample sizes constant

To clearly observe the p_value pattern, due to varying group mean differences the linear regression model was repeated multiple times using the group sample sizes and variances from my dataset, however the means were set to 2.5 and one the group mean varied from 1.5 to 2.5 to study the effect of mean differences. To accomplish this, a for loop was used to compute the p_values for different datasets. Note: the seed was set to 99 to limit variation in each LR model repetition.

See code chunk below:

Reg_output <- NULL
trial_mean <- NULL

h <- seq(from = 1.5, to = 2.5, by = 0.05) # create a sequence of means for one group

for(i in seq_along(h)){
   
    n_24h = 84; mean_24h = h[i]; var_24h = 0.0604894;
    n_48h = 86; mean_48h = 2.5; var_48h = 3.0218083;
    n_72h = 84; mean_72h = 2.5; var_72h = 7.8850666;
    n_144h = 82; mean_144h = 2.5; var_144h = 9.2351024;
    
    # Create a list with all input data
    input2 <- list(sim24h = c(n_24h, mean_24h,var_24h), 
                  sim48h = c(n_48h, mean_48h,var_48h),
                  sim72h = c(n_72h, mean_72h,var_72h),
                  sim144h = c(n_144h, mean_144h,var_144h))
    
    mean_diff <- mean_48h - mean_24h # compute mean difference
    set.seed(99)
    dfz <- sim_rnorm(input2) # create a randomly generated Time and AGER dataset
    
    RegOutput <- data.frame(t(Reg_Stat(dfz))) # computes the linear regression analysis of AGER(response) and Time
    tf <- data.frame(t(c(i, mean_diff))) # create a dataframe with the trial no. and the mean diff. for the current trial
    
    trial_mean <- rbind(trial_mean, tf)    # update the 'trial_mean' dataframe
    Reg_output <-rbind(Reg_output, RegOutput) # update the 'Reg_output' dataframe
    
}

names(trial_mean) <- c("Trial No", "Mean Difference")
RegResults <- cbind(trial_mean,Reg_output)
output <- RegResults %>% select(-Coefficient)

kable(output, caption = "**Table 3 - LR Output with Varying Group Means Difference **", align = "c")
Table 3 - LR Output with Varying Group Means Difference
Trial No Mean Difference P.Value Conclusion
1 1.00 0.02324 Significant
2 0.95 0.02997 Significant
3 0.90 0.03836 Significant
4 0.85 0.04869 Significant
5 0.80 0.06132 Non-significant
6 0.75 0.0766 Non-significant
7 0.70 0.09491 Non-significant
8 0.65 0.1166 Non-significant
9 0.60 0.1422 Non-significant
10 0.55 0.1719 Non-significant
11 0.50 0.2061 Non-significant
12 0.45 0.245 Non-significant
13 0.40 0.289 Non-significant
14 0.35 0.3381 Non-significant
15 0.30 0.3923 Non-significant
16 0.25 0.4515 Non-significant
17 0.20 0.5156 Non-significant
18 0.15 0.5841 Non-significant
19 0.10 0.6567 Non-significant
20 0.05 0.7326 Non-significant
21 0.00 0.8113 Non-significant

As shown in table 4, the smallest mean difference that still yield a significant p-value is 0.85.

Adjusting sample size while keeping mean and variance constant

To see the effect of sample size differences in the p-value, the hypothesized means and variances from my dataset was used, however all the group sample sizes was changed to 84 (except on that was varied from 0 to 84). A for loop was used to simulate the dataset and run the LR analysis multiple times (22 times) with the varying sample size. To limit data variation, the seed was set to 99.

See code chunk:

Reg_output <- NULL
trial_size <- NULL

h <- seq(from = 0, to = 84, by = 2) # create a sequence of means for one group

for(i in seq_along(h)){
   
    n_24h = h[i]; mean_24h = 0.8439886; var_24h = 0.0604894;
    n_48h = 84; mean_48h = 2.2456403; var_48h = 3.0218083;
    n_72h = 84; mean_72h = 2.7526489; var_72h = 7.8850666;
    n_144h = 84; mean_144h = 2.4733090; var_144h = 9.2351024;
        
    # Create a list with all input data
    input2 <- list(sim24h = c(n_24h, mean_24h,var_24h), 
                  sim48h = c(n_48h, mean_48h,var_48h),
                  sim72h = c(n_72h, mean_72h,var_72h),
                  sim144h = c(n_144h, mean_144h,var_144h))
    
    size_diff <- n_48h - n_24h # compute mean difference
    set.seed(99)
    dfz <- sim_rnorm(input2) # create a randomly generated Time and AGER dataset
    
    RegOutput <- data.frame(t(Reg_Stat(dfz))) # computes the linear regression analysis of AGER(response) and Time
    tf <- data.frame(t(c(i, size_diff))) # create a dataframe with the trial no. and the mean diff. for the current trial
    
    trial_size <- rbind(trial_size, tf)    # update the 'trial_mean' dataframe
    Reg_output <-rbind(Reg_output, RegOutput) # update the 'Reg_output' dataframe
    
}

names(trial_size) <- c("Trial No", "Size Difference")
RegResults <- cbind(trial_size,Reg_output)
output <- RegResults %>% select(-Coefficient)

kable(output, caption = "**Table 4 - LR Output with Varying Group Size Difference (with seed set to 99)**", align = "c")
Table 4 - LR Output with Varying Group Size Difference (with seed set to 99)
Trial No Size Difference P.Value Conclusion
1 84 0.8201 Non-significant
2 82 0.8848 Non-significant
3 80 0.9478 Non-significant
4 78 0.9666 Non-significant
5 76 0.8834 Non-significant
6 74 0.8853 Non-significant
7 72 0.8877 Non-significant
8 70 0.8607 Non-significant
9 68 0.8233 Non-significant
10 66 0.631 Non-significant
11 64 0.6505 Non-significant
12 62 0.9932 Non-significant
13 60 0.911 Non-significant
14 58 0.9459 Non-significant
15 56 0.6313 Non-significant
16 54 0.3729 Non-significant
17 52 0.2806 Non-significant
18 50 0.1532 Non-significant
19 48 0.04651 Significant
20 46 0.01779 Significant
21 44 0.00582 Significant
22 42 0.004681 Significant
23 40 0.005257 Significant
24 38 0.006602 Significant
25 36 0.001318 Significant
26 34 0.001231 Significant
27 32 0.0005218 Significant
28 30 0.001147 Significant
29 28 0.0004707 Significant
30 26 0.001102 Significant
31 24 0.001791 Significant
32 22 0.0009856 Significant
33 20 0.0004924 Significant
34 18 0.001818 Significant
35 16 0.0003167 Significant
36 14 0.0001868 Significant
37 12 0.0001722 Significant
38 10 0.0003901 Significant
39 8 0.0002476 Significant
40 6 8.304e-05 Significant
41 4 7.856e-05 Significant
42 2 0.0002776 Significant
43 0 0.0002533 Significant

The smallest sample size difference for my hypothesized group means and variance that still yields significance is 48 as shown in table 4. However, this was based on randomly generated data with the seed set to eliminate variation in the random number generation. To visualize the effect sample size with random variation in the data, the code was re-ran without the seed set.

See code chunk:

Reg_output <- NULL
trial_size <- NULL

h <- seq(from = 0, to = 84, by = 2) # create a sequence of means for one group

for(i in seq_along(h)){
   
    n_24h = h[i]; mean_24h = 0.8439886; var_24h = 0.0604894;
    n_48h = 84; mean_48h = 2.2456403; var_48h = 3.0218083;
    n_72h = 84; mean_72h = 2.7526489; var_72h = 7.8850666;
    n_144h = 84; mean_144h = 2.4733090; var_144h = 9.2351024;
        
    # Create a list with all input data
    input2 <- list(sim24h = c(n_24h, mean_24h,var_24h), 
                  sim48h = c(n_48h, mean_48h,var_48h),
                  sim72h = c(n_72h, mean_72h,var_72h),
                  sim144h = c(n_144h, mean_144h,var_144h))
    
    size_diff <- n_48h - n_24h # compute mean difference
    #set.seed(99)
    dfz <- sim_rnorm(input2) # create a randomly generated Time and AGER dataset
    
    RegOutput <- data.frame(t(Reg_Stat(dfz))) # computes the linear regression analysis of AGER(response) and Time
    tf <- data.frame(t(c(i, size_diff))) # create a dataframe with the trial no. and the mean diff. for the current trial
    
    trial_size <- rbind(trial_size, tf)    # update the 'trial_mean' dataframe
    Reg_output <-rbind(Reg_output, RegOutput) # update the 'Reg_output' dataframe
    
}

names(trial_size) <- c("Trial No", "Size Difference")
RegResults <- cbind(trial_size,Reg_output)
output <- RegResults %>% select(-Coefficient)

kable(output, caption = "**Table 5 - LR Output with Varying Group Size Difference (without seed set) **", align = "c")
Table 5 - LR Output with Varying Group Size Difference (without seed set)
Trial No Size Difference P.Value Conclusion
1 84 0.5468 Non-significant
2 82 0.05544 Non-significant
3 80 0.9721 Non-significant
4 78 0.2186 Non-significant
5 76 0.5577 Non-significant
6 74 0.2129 Non-significant
7 72 0.612 Non-significant
8 70 0.4714 Non-significant
9 68 0.1283 Non-significant
10 66 1.497e-06 Significant
11 64 0.03588 Significant
12 62 0.03044 Significant
13 60 0.1047 Non-significant
14 58 0.9671 Non-significant
15 56 0.9432 Non-significant
16 54 0.6843 Non-significant
17 52 0.0007491 Significant
18 50 0.07374 Non-significant
19 48 0.9746 Non-significant
20 46 0.1672 Non-significant
21 44 0.004454 Significant
22 42 0.0001467 Significant
23 40 0.8272 Non-significant
24 38 0.008273 Significant
25 36 0.0001807 Significant
26 34 0.0004801 Significant
27 32 0.001485 Significant
28 30 2.657e-05 Significant
29 28 2.095e-05 Significant
30 26 0.04369 Significant
31 24 0.06534 Non-significant
32 22 0.001869 Significant
33 20 0.07304 Non-significant
34 18 5.435e-06 Significant
35 16 9.15e-06 Significant
36 14 0.0002092 Significant
37 12 1.326e-09 Significant
38 10 3.67e-05 Significant
39 8 2.319e-05 Significant
40 6 4.337e-05 Significant
41 4 0.02904 Significant
42 2 2.284e-06 Significant
43 0 0.01453 Significant

As seen from table 5, there is no clear distinctive point where the significance changes due to the effect of random variation in the data between all the trial runs.