HW7: Creating Fake Data Sets To Explore Hypotheses

April 15, 2024

Chika Ikpechukwu.

Open relevant libraries

library(tidyverse) # To use ggplot2 and dplyr packages
library(kableExtra) # To make fancy tables
library(gridExtra)

Background

My lab studied gene expression from human alveolar type 2 cells grown on different stiffness plates (2kPA, 16kPA and 64kPA) in different media (expansion and differentiation media) for various culture duration (Time). It was hypothesized that stiffness, media and duration (time) will affect gene expression. Assessing the validity of this hypotheses involves several statistical analyses, however for the sake of simplicity only one predictor variable ‘Time’ and one response variable ‘AGER’ gene expression will be analysed.

The ‘Time’ variable is a continuous variable denoting the number of hours post-seeding and the ‘AGER’ variable is the differential gene expression for the AGER gene and is also continuous. Because both the predictor and response variables are continuous, the linear regression analysis will the most appropriate statistical test to assess the relationship between both variables.

For this exercise different normal distribution simulated data sets will be used to visualize how means and sample sizes effect the linear regression output.

User-designed Functions

In order to repeat the linear regression analysis multiples times with different random data, 3 functions were designed to simplify this process. The user-designed functions are:

sim_norm(): a function that creates a simulated (from a normal distribution) AGER gene expression dataset for different culture durations.
- Input Argument: ‘list’ (a list containing the mean, variance and size of each ‘Time’ group)
- Output: ‘sim_df’ (a dataframe containing a unique ID variable, Time and sim_AGER (simulated from normal distribution))
Reg_Stat(): Performs the linear regression analysis on a dataset
- Input Arguments: ‘df’ (a dataframe containing the variable the linear regression will be ran on)
- Output: ‘a’ (an atomic vector containing the coefficient ‘b’ (Response = a + bPredictor), the p-value and conclusion (significant (p < 0.05) or insignificant))
Reg_Plot(): Plot the dataset
- Input Arguments: ‘df’ (a dataframe containing the variable to be plotted)
- Output: ‘plot’ (a scatterplot with a trendline showing the correlation between the response and predictor variables))

Below shows the code for the functions:

##############################################################################################
# FUNCTION: sim_rnorm()
# Description: a function that creates a simulated gene expression dataset for different culture durations.
# Input Arguments: 'list' (a list containing the mean, variance and size of each 'Time' categories)
# Output: 'sim_df' (a dataframe containing a unique ID variable, Time and sim_AGER (simulated from normal distribution))
#___________________________________________________________________________________________

sim_rnorm <- function(list = input){

#Times = c("24h", "+4h", "+24h", "5d")

sim24h = rnorm(n = list[[1]][1], mean = list[[1]][2], sd = sqrt(list[[1]][3]))
sim48h = rnorm(n = list[[2]][1], mean = list[[2]][2], sd = sqrt(list[[2]][3]))
sim72h = rnorm(n = list[[3]][1], mean = list[[3]][2], sd = sqrt(list[[3]][3]))
sim144h = rnorm(n = list[[4]][1], mean = list[[4]][2], sd = sqrt(list[[4]][3]))

sim_AGER <- c(sim24h, sim48h, sim72h, sim144h)
ID <- 1:length(sim_AGER)
Time <- c (rep(24, length(sim24h)), rep(48, length(sim48h)), rep(72, length(sim72h)), rep(144, length(sim144h))) # Assigning Time categories to the rnorm() generated data

sim_df <- data.frame(ID, Time, sim_AGER) # Create a dataframe from the simulated Time and AGER data. 
  
return(sim_df)
}
##############################################################################################


##############################################################################################
# FUNCTION: Reg_Stat()
# Description: Performs linear regression on a dataset
# Input Arguments: 'df' (a dataframe containing the variable the linear regression will be ran on)
# Output: ‘a’ (an atomic vector containing the coefficient 'b' (Response = a + bPredictor), the p-value and conclusion (significant or insignificant))
#___________________________________________________________________________________________

Reg_Stat <- function(df = DF){
  
   z <- lm(sim_AGER ~ Time, data = df)
   coeff<-summary(z)$coefficients[2,1] # pulls out the coeffiecient of the relationship btw the response and predictor variables
   pVal <- summary(z)$coefficients[2,4] # pulls out the p-value of the relationship btw the response and predictor variables
  
   a <-c(signif(coeff, 4), signif(pVal, 4), ifelse(pVal < 0.05, "Significant", "Non-significant"))
   names(a) <- c("Coefficient", "P-Value", "Conclusion")
   return(a)
   }
##############################################################################################


##############################################################################################
# FUNCTION: Reg_Plot()
# Description: Plot the datset
# Input Arguments: ‘df’ (a dataframe containing the variable to be plotted)
# Output:'plot'
#___________________________________________________________________________________________

Reg_Plot <- function(df = DF){
  
   ### Plot of the data
   plot <- ggplot(df) +
           aes(x = Time, y = sim_AGER) +
           geom_point() +
           stat_smooth(method=lm,se=0.99) # default se=0.95 
   
  return(plot)
 }
##############################################################################################

Running Linear Regression on Different Randomly Generated Datasets with the Same Means, Variance and Size

Getting group means, variances and sizes that will yield significant effects on AGER response

Instead of guessing the sample sizes, means and variances for each ‘Time’ group that would yield significant effect in AGER expression, the sample sizes, means and variances of AGER variable for each Time group from my lab’s dataset will be used; as a quick linear regression analysis on AGER variable with Time as the predictor yields a significant p-value (p = 0.0002489788). See code chuck:

df <- data.frame(read.csv("RawData.csv")) # read in the original csv dataset into a dataframe

a = lm(df$AGER ~ df$Time, data = df)
coeff<-summary(a)$coefficients[2,1]
pVal <- summary(a)$coefficients[2,4]
cat("The coefficient of the linear regression is: ", coeff, "\n",
    "The p-value is: ", pVal, "\n")

## The coefficient of the linear regression is:  0.01042223 
##  The p-value is:  0.0002489788

The hypothesized mean, variance and sizes for each Time group are shown below in Table 1.

Subset <- df %>% group_by(Time) %>% 
                 summarize(Mean = mean(AGER, na.rm = T), Variance = sd(AGER, na.rm = T)^2, Size = n()) # Get the size, mean and variance of each Time group

kable(Subset, caption = "*Table 1 - Mean, Variance and Size for Each Time Group*", row.names = NA, align = "c")

*Table 1 - Mean, Variance and Size for Each Time Group*
Time	Mean	Variance	Size
24	0.8439886	0.0604894	84
48	2.2456403	3.0218083	86
72	2.7526489	7.8850666	84
144	2.4733090	9.2351024	82

Creating a random dataset using the hypothesized group means, variances and sizes that will yield significant effects on AGER response

Using the hypothesized means, variance and sizes for each group (Table 1) a simulated dataset containing simulated ‘AGER’ values (simulated from a normal distribution) for each time point was created using the ‘sim_rnorm()’ function’. The linear regression analysis was done on the simulated dataset and the coefficient and p-value was returned using the ‘Reg_Stats()’ function. Lastly, the dataset was plotted using the ‘Reg_Plot()’ function. The coefficients and p-values for each linear regression model from the 4 test are displayed on table 2.

## Create input variables needed to create simulated dataset

n_24h = 84; mean_24h = 0.8439886; var_24h = 0.0604894;
n_48h = 86; mean_48h = 2.2456403; var_48h = 3.0218083;
n_72h = 84; mean_72h = 2.7526489; var_72h = 7.8850666;
n_144h = 82; mean_144h = 2.4733090; var_144h = 9.2351024;

# Create a list with all input data

input <- list(sim24h = c(n_24h, mean_24h,var_24h), 
              sim48h = c(n_48h, mean_48h,var_48h),
              sim72h = c(n_72h, mean_72h,var_72h),
              sim144h = c(n_144h, mean_144h,var_144h))


df1 <- sim_rnorm(input) # create a randomly generated Time and AGER dataset
LR1 <- Reg_Stat(df1) # computes the linear regression analysis of AGER(response) and Time (predictor)

df2 <- sim_rnorm(input) # create a randomly generated Time and AGER dataset
LR2 <- Reg_Stat(df2) # computes the linear regression analysis of AGER(response) and Time (predictor)

df3 <- sim_rnorm(input) # create a randomly generated Time and AGER dataset
LR3 <- Reg_Stat(df3) # computes the linear regression analysis of AGER(response) and Time (predictor)

df4 <- sim_rnorm(input) # create a randomly generated Time and AGER dataset
LR4 <- Reg_Stat(df4) # computes the linear regression analysis of AGER(response) and Time (predictor)

Test <- c("1", "2", "3", "4")
RegStats1 <- bind_rows(LR1, LR2, LR3, LR4)
s <- cbind(Test,RegStats1)

kable(s, caption = "**Table 2 - Coefficient, p-Value and Conclusion from Linear Regression Analysis on 4 Different Randomly Generated Dataset**", align = "c")

**Table 2 - Coefficient, p-Value and Conclusion from Linear Regression Analysis on 4 Different Randomly Generated Dataset**
Test	Coefficient	P-Value	Conclusion
1	0.003713	0.2165	Non-significant
2	0.01369	2.178e-06	Significant
3	0.01003	0.0007388	Significant
4	0.01271	3.844e-05	Significant

plotList <-list(Reg_Plot(df1), Reg_Plot(df2), Reg_Plot(df3), Reg_Plot(df4)) # compress all the plots into a list

caption <- "Fig 1 - Plot of AGER (simulated from normal distribution with the same means, variance and sample size) and Time."

suppressMessages(grid.arrange(grobs = plotList, bottom = caption)) # plot all the plots on a grid

Running Linear-Regression While Adjusting The Group Means And Sample Sizes

In order to observe how the means and sample sizes affect linear regression model, the means and sample sizes for the response variable was adjusted to see how small the difference between the groups need to be for there to still be significant pattern (p <0.05).

To isolate the effects of means and sample size differences, the mean or sample size (one at a time) was adjusted while keeping the other constant.

Adjusting mean differences while keeping variance and sample sizes constant

To clearly observe the p_value pattern, due to varying group mean differences the linear regression model was repeated multiple times using the group sample sizes and variances from my dataset, however the means were set to 2.5 and one the group mean varied from 1.5 to 2.5 to study the effect of mean differences. To accomplish this, a for loop was used to compute the p_values for different datasets. Note: the seed was set to 99 to limit variation in each LR model repetition.

See code chunk below:

Reg_output <- NULL
trial_mean <- NULL

h <- seq(from = 1.5, to = 2.5, by = 0.05) # create a sequence of means for one group

for(i in seq_along(h)){
   
    n_24h = 84; mean_24h = h[i]; var_24h = 0.0604894;
    n_48h = 86; mean_48h = 2.5; var_48h = 3.0218083;
    n_72h = 84; mean_72h = 2.5; var_72h = 7.8850666;
    n_144h = 82; mean_144h = 2.5; var_144h = 9.2351024;
    
    # Create a list with all input data
    input2 <- list(sim24h = c(n_24h, mean_24h,var_24h), 
                  sim48h = c(n_48h, mean_48h,var_48h),
                  sim72h = c(n_72h, mean_72h,var_72h),
                  sim144h = c(n_144h, mean_144h,var_144h))
    
    mean_diff <- mean_48h - mean_24h # compute mean difference
    set.seed(99)
    dfz <- sim_rnorm(input2) # create a randomly generated Time and AGER dataset
    
    RegOutput <- data.frame(t(Reg_Stat(dfz))) # computes the linear regression analysis of AGER(response) and Time
    tf <- data.frame(t(c(i, mean_diff))) # create a dataframe with the trial no. and the mean diff. for the current trial
    
    trial_mean <- rbind(trial_mean, tf)    # update the 'trial_mean' dataframe
    Reg_output <-rbind(Reg_output, RegOutput) # update the 'Reg_output' dataframe
    
}

names(trial_mean) <- c("Trial No", "Mean Difference")
RegResults <- cbind(trial_mean,Reg_output)
output <- RegResults %>% select(-Coefficient)

kable(output, caption = "**Table 3 - LR Output with Varying Group Means Difference **", align = "c")

**Table 3 - LR Output with Varying Group Means Difference**
Trial No	Mean Difference	P.Value	Conclusion
1	1.00	0.02324	Significant
2	0.95	0.02997	Significant
3	0.90	0.03836	Significant
4	0.85	0.04869	Significant
5	0.80	0.06132	Non-significant
6	0.75	0.0766	Non-significant
7	0.70	0.09491	Non-significant
8	0.65	0.1166	Non-significant
9	0.60	0.1422	Non-significant
10	0.55	0.1719	Non-significant
11	0.50	0.2061	Non-significant
12	0.45	0.245	Non-significant
13	0.40	0.289	Non-significant
14	0.35	0.3381	Non-significant
15	0.30	0.3923	Non-significant
16	0.25	0.4515	Non-significant
17	0.20	0.5156	Non-significant
18	0.15	0.5841	Non-significant
19	0.10	0.6567	Non-significant
20	0.05	0.7326	Non-significant
21	0.00	0.8113	Non-significant

As shown in table 4, the smallest mean difference that still yield a significant p-value is 0.85.

Adjusting sample size while keeping mean and variance constant

To see the effect of sample size differences in the p-value, the hypothesized means and variances from my dataset was used, however all the group sample sizes was changed to 84 (except on that was varied from 0 to 84). A for loop was used to simulate the dataset and run the LR analysis multiple times (22 times) with the varying sample size. To limit data variation, the seed was set to 99.

See code chunk:

Reg_output <- NULL
trial_size <- NULL

h <- seq(from = 0, to = 84, by = 2) # create a sequence of means for one group

for(i in seq_along(h)){
   
    n_24h = h[i]; mean_24h = 0.8439886; var_24h = 0.0604894;
    n_48h = 84; mean_48h = 2.2456403; var_48h = 3.0218083;
    n_72h = 84; mean_72h = 2.7526489; var_72h = 7.8850666;
    n_144h = 84; mean_144h = 2.4733090; var_144h = 9.2351024;
        
    # Create a list with all input data
    input2 <- list(sim24h = c(n_24h, mean_24h,var_24h), 
                  sim48h = c(n_48h, mean_48h,var_48h),
                  sim72h = c(n_72h, mean_72h,var_72h),
                  sim144h = c(n_144h, mean_144h,var_144h))
    
    size_diff <- n_48h - n_24h # compute mean difference
    set.seed(99)
    dfz <- sim_rnorm(input2) # create a randomly generated Time and AGER dataset
    
    RegOutput <- data.frame(t(Reg_Stat(dfz))) # computes the linear regression analysis of AGER(response) and Time
    tf <- data.frame(t(c(i, size_diff))) # create a dataframe with the trial no. and the mean diff. for the current trial
    
    trial_size <- rbind(trial_size, tf)    # update the 'trial_mean' dataframe
    Reg_output <-rbind(Reg_output, RegOutput) # update the 'Reg_output' dataframe
    
}

names(trial_size) <- c("Trial No", "Size Difference")
RegResults <- cbind(trial_size,Reg_output)
output <- RegResults %>% select(-Coefficient)

kable(output, caption = "**Table 4 - LR Output with Varying Group Size Difference (with seed set to 99)**", align = "c")

**Table 4 - LR Output with Varying Group Size Difference (with seed set to 99)**
Trial No	Size Difference	P.Value	Conclusion
1	84	0.8201	Non-significant
2	82	0.8848	Non-significant
3	80	0.9478	Non-significant
4	78	0.9666	Non-significant
5	76	0.8834	Non-significant
6	74	0.8853	Non-significant
7	72	0.8877	Non-significant
8	70	0.8607	Non-significant
9	68	0.8233	Non-significant
10	66	0.631	Non-significant
11	64	0.6505	Non-significant
12	62	0.9932	Non-significant
13	60	0.911	Non-significant
14	58	0.9459	Non-significant
15	56	0.6313	Non-significant
16	54	0.3729	Non-significant
17	52	0.2806	Non-significant
18	50	0.1532	Non-significant
19	48	0.04651	Significant
20	46	0.01779	Significant
21	44	0.00582	Significant
22	42	0.004681	Significant
23	40	0.005257	Significant
24	38	0.006602	Significant
25	36	0.001318	Significant
26	34	0.001231	Significant
27	32	0.0005218	Significant
28	30	0.001147	Significant
29	28	0.0004707	Significant
30	26	0.001102	Significant
31	24	0.001791	Significant
32	22	0.0009856	Significant
33	20	0.0004924	Significant
34	18	0.001818	Significant
35	16	0.0003167	Significant
36	14	0.0001868	Significant
37	12	0.0001722	Significant
38	10	0.0003901	Significant
39	8	0.0002476	Significant
40	6	8.304e-05	Significant
41	4	7.856e-05	Significant
42	2	0.0002776	Significant
43	0	0.0002533	Significant

The smallest sample size difference for my hypothesized group means and variance that still yields significance is 48 as shown in table 4. However, this was based on randomly generated data with the seed set to eliminate variation in the random number generation. To visualize the effect sample size with random variation in the data, the code was re-ran without the seed set.

See code chunk:

Reg_output <- NULL
trial_size <- NULL

h <- seq(from = 0, to = 84, by = 2) # create a sequence of means for one group

for(i in seq_along(h)){
   
    n_24h = h[i]; mean_24h = 0.8439886; var_24h = 0.0604894;
    n_48h = 84; mean_48h = 2.2456403; var_48h = 3.0218083;
    n_72h = 84; mean_72h = 2.7526489; var_72h = 7.8850666;
    n_144h = 84; mean_144h = 2.4733090; var_144h = 9.2351024;
        
    # Create a list with all input data
    input2 <- list(sim24h = c(n_24h, mean_24h,var_24h), 
                  sim48h = c(n_48h, mean_48h,var_48h),
                  sim72h = c(n_72h, mean_72h,var_72h),
                  sim144h = c(n_144h, mean_144h,var_144h))
    
    size_diff <- n_48h - n_24h # compute mean difference
    #set.seed(99)
    dfz <- sim_rnorm(input2) # create a randomly generated Time and AGER dataset
    
    RegOutput <- data.frame(t(Reg_Stat(dfz))) # computes the linear regression analysis of AGER(response) and Time
    tf <- data.frame(t(c(i, size_diff))) # create a dataframe with the trial no. and the mean diff. for the current trial
    
    trial_size <- rbind(trial_size, tf)    # update the 'trial_mean' dataframe
    Reg_output <-rbind(Reg_output, RegOutput) # update the 'Reg_output' dataframe
    
}

names(trial_size) <- c("Trial No", "Size Difference")
RegResults <- cbind(trial_size,Reg_output)
output <- RegResults %>% select(-Coefficient)

kable(output, caption = "**Table 5 - LR Output with Varying Group Size Difference (without seed set) **", align = "c")

**Table 5 - LR Output with Varying Group Size Difference (without seed set)**
Trial No	Size Difference	P.Value	Conclusion
1	84	0.5468	Non-significant
2	82	0.05544	Non-significant
3	80	0.9721	Non-significant
4	78	0.2186	Non-significant
5	76	0.5577	Non-significant
6	74	0.2129	Non-significant
7	72	0.612	Non-significant
8	70	0.4714	Non-significant
9	68	0.1283	Non-significant
10	66	1.497e-06	Significant
11	64	0.03588	Significant
12	62	0.03044	Significant
13	60	0.1047	Non-significant
14	58	0.9671	Non-significant
15	56	0.9432	Non-significant
16	54	0.6843	Non-significant
17	52	0.0007491	Significant
18	50	0.07374	Non-significant
19	48	0.9746	Non-significant
20	46	0.1672	Non-significant
21	44	0.004454	Significant
22	42	0.0001467	Significant
23	40	0.8272	Non-significant
24	38	0.008273	Significant
25	36	0.0001807	Significant
26	34	0.0004801	Significant
27	32	0.001485	Significant
28	30	2.657e-05	Significant
29	28	2.095e-05	Significant
30	26	0.04369	Significant
31	24	0.06534	Non-significant
32	22	0.001869	Significant
33	20	0.07304	Non-significant
34	18	5.435e-06	Significant
35	16	9.15e-06	Significant
36	14	0.0002092	Significant
37	12	1.326e-09	Significant
38	10	3.67e-05	Significant
39	8	2.319e-05	Significant
40	6	4.337e-05	Significant
41	4	0.02904	Significant
42	2	2.284e-06	Significant
43	0	0.01453	Significant

As seen from table 5, there is no clear distinctive point where the significance changes due to the effect of random variation in the data between all the trial runs.