--- title: "Baldur Yeast Tutorial" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Baldur Yeast Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # 1. Setup This tutorial is quite fast and on a very simple data set (2 conditions only), for a more complicated tutorial on the setup please see `vignette('baldur_ups_tutorial')`. First we load `baldur` and setup the model dependent variables we need, then normalize the data and add the mean-variance trends. ```r library(baldur) # Setup design matrix yeast_design <- model.matrix(~0+factor(rep(1:2, each = 3))) colnames(yeast_design) <- paste0('ng', c(50, 100)) # Compare the first and second column of the design matrix # with the following contrast matrix yeast_contrast <- matrix(c(-1, 1), nrow = 2) # Set id column id_col <- colnames(yeast)[1] # "identifier" # Define the number of parallel workers to use workers <- floor(parallel::detectCores()/2) # Since baldur itself does not deal with missing data we remove the # rows that have missing values for the purpose of the tutorial. # Else, one would replace the filtering step with imputation but that is outside # the scope of baldur yeast_norm <- yeast %>% # Remove missing data tidyr::drop_na() %>% # Normalize data (this might already have been done if imputation was performed) psrn(id_col) %>% # Add mean-variance trends calculate_mean_sd_trends(yeast_design) ``` Importantly, note that the column names of the design matrix are unique subsets of the names of the columns within the conditions: ```r colnames(yeast) #> [1] "identifier" "ng50_1" "ng50_2" "ng50_3" "ng100_1" "ng100_2" "ng100_3" colnames(yeast_design) #> [1] "ng50" "ng100" ``` This is essential for `baldur` to know which columns to use in calculations and to perform transformations. # 2. Mean-Variance trends and Gamma Regression fitting Next is to infer the mixture of the data and to estimate the parameters needed for `baldur`. First we will setup the needed variables for using `baldur` without partitioning the data. Then, partitioning and setting up `baldur` after trend-partitioning ```r # Fit the gamma regression gr_model <- fit_gamma_regression(yeast_norm, sd ~ mean) # Estimate the uncertainty unc_gr <- estimate_uncertainty(gr_model, yeast_norm, id_col, yeast_design) ``` # 3. Run the sampling procedure Finally we sample the posterior of each row in the data. First we sample assuming a single trend, then using the partitioning. ```r # Single trend gr_results <- gr_model %>% # Add hyper-priors for sigma estimate_gamma_hyperparameters(yeast_norm) %>% infer_data_and_decision_model( id_col, yeast_design, yeast_contrast, unc_gr, clusters = workers # I highly recommend using parallel workers/clusters ) # this will greatly reduce the time of running baldur # The top hits then looks as follows: gr_results %>% dplyr::arrange(err) #> # A tibble: 1,802 × 22 #> identifier comparison err lfc lfc_025 lfc_50 lfc_975 lfc_eff lfc_rhat sigma sigma_025 sigma_50 sigma_975 sigma_eff sigma_rhat lp #> #> 1 Cre09.g40… ng100 vs … 3.78e-213 6.18 5.79 6.18 6.56 1780. 1.00 0.123 0.0664 0.112 0.244 986. 1.00 14.5 #> 2 Cre12.g55… ng100 vs … 1.05e-178 1.61 1.50 1.61 1.72 3090. 1.00 0.0492 0.0265 0.0446 0.0979 1115. 1.00 28.9 #> 3 sp|P37302… ng100 vs … 4.15e-156 1.51 1.40 1.51 1.63 3107. 1.00 0.0464 0.0249 0.0421 0.0958 1158. 1.00 29.7 #> 4 sp|P38788… ng100 vs … 3.49e-151 1.07 0.994 1.07 1.16 3109. 1.00 0.0359 0.0190 0.0328 0.0715 1345. 1.00 32.4 #> 5 Cre14.g61… ng100 vs … 5.51e-143 -4.54 -4.89 -4.54 -4.17 2439. 1.00 0.142 0.0774 0.130 0.276 1361. 1.00 15.6 #> 6 Cre10.g42… ng100 vs … 9.10e-134 4.16 3.82 4.16 4.51 2959. 1.00 0.147 0.0828 0.135 0.275 1468. 1.00 17.7 #> 7 Cre12.g53… ng100 vs … 1.52e- 90 1.41 1.28 1.41 1.55 2580. 1.00 0.0587 0.0313 0.0533 0.117 1029. 1.00 26.8 #> 8 sp|P07259… ng100 vs … 4.38e- 87 1.14 1.02 1.14 1.26 3264. 1.00 0.0518 0.0281 0.0475 0.102 1299. 1.00 27.7 #> 9 Cre06.g30… ng100 vs … 5.85e- 86 4.20 3.80 4.21 4.62 2471. 1.00 0.150 0.0818 0.136 0.311 964. 1.00 13.9 #> 10 sp|P19882… ng100 vs … 2.18e- 85 0.883 0.794 0.882 0.976 2997. 1.00 0.0412 0.0225 0.0377 0.0794 1530. 1.00 33.1 #> # ℹ 1,792 more rows #> # ℹ 6 more variables: lp_025 , lp_50 , lp_975 , lp_eff , lp_rhat , warnings ``` Here `err` is the probability of error, i.e., the two tail-density supporting the null-hypothesis, `lfc` is the estimated log$_2$-fold change, `sigma` is the common variance, and `lp` is the log-posterior. Columns without suffix shows the mean estimate from the posterior, while the suffixes `_025`, `_50`, and `_975`, are the 2.5, 50.0, and 97.5, percentiles, respectively. The suffixes `_eff` and `_rhat` are the diagnostic variables returned by `rstan` (please see the Stan manual for details). In general, a larger `_eff` indicates a better sampling efficiency, and `_rhat` compares the mixing within chains against between the chains and should be smaller than 1.05. # 4. Running Baldur with Latent Gamma Mixture Regression estimation First we fit the LGMR model: ```r yeast_lgmr <- fit_lgmr(yeast_norm, id_col, lgmr_model, cores = min(5, workers)) ``` We can print the model with `print` and extract parameters of interest with `coef`: ```r print(yeast_lgmr, pars = c("coef", "aux")) #> #> LGMR Model #> μ = exp(-1.847 - 0.325 f(ȳ)) + κ exp(θ(7.52 - 0.474 f(ȳ))) #> #> Coefficients: #> mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat #> γ_0L 7.520 0.000783 0.0504 7.424 7.486 7.520 7.554 7.621 4141 1 #> γ_ȳ 0.325 0.000361 0.0246 0.277 0.308 0.325 0.342 0.374 4659 1 #> γ_ȳL 0.474 0.000542 0.0453 0.386 0.444 0.474 0.505 0.563 6999 1 #> γ_0 -1.847 0.000731 0.0295 -1.904 -1.867 -1.847 -1.827 -1.788 1629 1 #> #> #> Auxiliary: #> mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat #> α 4.48 0.007693 0.2980 3.918 4.269 4.466 4.673 5.079 1500 1 #> NRMSE 0.52 0.000405 0.0248 0.475 0.503 0.519 0.536 0.572 3761 1 # Extract the regression, alpha, and theta parameters and the NRMSE. yeast_lgmr_coef <- coef(yeast_lgmr, pars = "all") ``` Baldur allows for two ways to plot the LGMR model, `plot_lgmr_regression`, and `plot_regression_field`. The first plots lines of three cases of $\theta$, `0`, `0.5`, and `1`, and colors each peptide according to their infered $\theta$. They can be plotted accordingly: ```r plot_lgmr_regression(yeast_lgmr) plot_regression_field(yeast_lgmr, rng = 25) ``` ![plot of chunk plotting_lgmr_yeast](plotting_lgmr_yeast-1.png)![plot of chunk plotting_lgmr_yeast](plotting_lgmr_yeast-2.png) In generall, a good fit spreads out and captures the overall M-V trend. The main M-V density is captured by the common trend while the sparser part is captured by the latent trend. We can then estimate the uncertainty similar to the GR case: ```r unc_lgmr <- estimate_uncertainty(yeast_lgmr, yeast_norm, id_col, yeast_design) ``` Then running the data and decision model: ```r # Single trend lgmr_results <- yeast_lgmr %>% # Add hyper-priors for sigma estimate_gamma_hyperparameters(yeast_norm, id_col) %>% infer_data_and_decision_model( id_col, yeast_design, yeast_contrast, unc_lgmr, clusters = workers ) ``` # 5. Visualization of the results `baldur` have two ways of visualizing the results 1) plotting sigma vs LFC and 2) Volcano plots. To plot sigma against LFC we use `plot_sa`: ```r gr_results %>% plot_sa( alpha = .05, # Level of significance lfc = 1 # Add LFC lines ) lgmr_results %>% plot_sa( alpha = .05, # Level of significance lfc = 1 # Add LFC lines ) ``` ![plot of chunk plotting_sa_yeast](plotting_sa_yeast-1.png)![plot of chunk plotting_sa_yeast](plotting_sa_yeast-2.png) While it is hard to see with this few examples, in general a good decision is indicated by a lack of a trend between $\sigma$ and LFC. To make a volcano plot one uses `plot_volcano` in a similar fashion to `plot_sa`: ```r gr_results %>% plot_volcano( alpha = .05, # Level of significance lfc = 1 # Add LFC lines ) lgmr_results %>% plot_volcano( alpha = .05, # Level of significance lfc = 1 # Add LFC lines ) ``` ![plot of chunk plotting_volc_yeast](plotting_volc_yeast-1.png)![plot of chunk plotting_volc_yeast](plotting_volc_yeast-2.png)