Title: | Bayesian Hierarchical Modeling for Label-Free Proteomics |
---|---|
Description: | Statistical decision in proteomics data using a hierarchical Bayesian model. There are two regression models for describing the mean-variance trend, a gamma regression or a latent gamma mixture regression. The regression model is then used as an Empirical Bayes estimator for the prior on the variance in a peptide. Further, it assumes that each measurement has an uncertainty (increased variance) associated with it that is also inferred. Finally, it tries to estimate the posterior distribution (by Hamiltonian Monte Carlo) for the differences in means for each peptide in the data. Once the posterior is inferred, it integrates the tails to estimate the probability of error from which a statistical decision can be made. See Berg and Popescu for details (<doi:10.1101/2023.05.11.540411>). |
Authors: | Philip Berg [aut, cre] |
Maintainer: | Philip Berg <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.3.9000 |
Built: | 2025-03-13 06:05:50 UTC |
Source: | https://github.com/philipberg/baldur |
baldur
is a Bayesian hierarchical model for statistical decision
in proteomics data. It models the mean-variance trend with the option of
two different regression models, a gamma regression or a latent gamma
mixture regression. It then the regression model as en Empirical Bayes
estimator for the prior on the variance. Further, it assumes that
each measurement has an uncertainty (increased variance) associated with it
that it also infers. Finally, it tries to estimate the posterior
distribution (by Hamiltonian Monte Carlo) for the differences in means for
each peptide in the data. Once the posterior is inferred, it integrates the
tails to estimate the probability of error from which a statistical
decision can be made.
Maintainer: Philip Berg [email protected] (ORCID)
Berg George Popescu (2023) "Baldur: Bayesian Hierarchical Modeling for Label-Free Proteomics with Gamma Regressing Mean-Variance Trends" Molecular & Cellular Proteomics: 2023-12. https://doi.org/10.1016/j.mcpro.2023.100658
Stan Development Team (2022). RStan: the R interface to Stan. R package version 2.21.5. https://mc-stan.org
Useful links:
Calculates the mean and standard deviation of each row (peptide) and adds them as new columns. Assumes that the condition names are the names in the design matrix.
calculate_mean_sd_trends(data, design_matrix, auxiliary_columns = c())
calculate_mean_sd_trends(data, design_matrix, auxiliary_columns = c())
data |
A |
design_matrix |
A design matrix for the data (see example). |
auxiliary_columns |
Names of columns in the design matrix that does not have corresponding data in the data set. For example, this can be co-founding variables such as bashes. |
A tibble
or data.frame
with the mean and sd vectors
# Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) yeast %>% # Normalize data psrn("identifier") %>% # Get mean-variance trends calculate_mean_sd_trends(design)
# Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) yeast %>% # Normalize data psrn("identifier") %>% # Get mean-variance trends calculate_mean_sd_trends(design)
Here we assume that the sample mean of each condition is an estimator for the center of the mean prior. In addition, it assumes that the confidence in the prior is proportional to the variance of the peptide.
A stanmodel
that can be used in infer_data_and_decision_model.
The Stan
code for this model is given by:
empirical_bayes S4 class stanmodel 'empirical_bayes' coded as follows: data { int<lower=0> N; // number of data items int<lower=0> K; // number of conditions int C; // number of comparisons to perform matrix[N, K] x; // design matrix vector[N] y; // data matrix[K, C] c; // contrast matrix real alpha; // alpha prior for gamma real beta; // beta prior for gamma vector[N] u; // uncertainty vector[K] mu_not; // prior mu } transformed data{ vector[K] n_k; // per condition reciprocal measurements row_vector[C] n_c; // per comparison measurements matrix[K, C] abs_c; // abs of C for n_c calculation for (i in 1:K) { for (j in 1:C) { abs_c[i, j] = abs(c[i, j]); } } for (i in 1:K) { n_k[i] = 1/sum(x[,i]); } n_c = n_k' * abs_c; n_c = sqrt(n_c); n_k = sqrt(2*n_k); } parameters { vector[K] mu; // mean vector real<lower=0> sigma; // error scale array[C] real y_diff; // difference in coefficients vector[K] eta; // Error in mean vector[K] prior_mu_not; // Estimation error } transformed parameters{ row_vector[C] mu_diff = mu' * c; // differences in means vector[K] sigma_mu_not = sigma * n_k; // variance of ybars vector[C] sigma_lfc = sigma * n_c'; // variance of y_diff } model { sigma ~ gamma(alpha, beta); // variance eta ~ normal(0, 1); // NCP auxilary variable prior_mu_not ~ normal(mu_not, sigma_mu_not); // Prior mean mu ~ normal(prior_mu_not + sigma * eta, sigma); // mean y ~ normal(x * mu, sigma * u); // data model y_diff ~ normal(mu_diff, sigma_lfc); // difference statistic }
Estimates the hyperparameters for the Bayesian data and decision
model. estimate_gamma_hyperparameters
is a wrapper that adds new columns
to the data (one for alpha and one for betas). Note that for lgmr
objects, the estimate_beta
function assumes that the data is ordered as
when the model was fitted. If this is not the case, theta's will be
incorrectly matched with peptides—resulting in wrong estimates of beta
parameters. On the other hand, estimate_gamma_hyperparameters
will
temporarily sort the data as when fitted and the sort it back as it
was input to the function.
estimate_gamma_hyperparameters(reg, data, ...) ## S3 method for class 'glm' estimate_gamma_hyperparameters(reg, data, ...) ## S3 method for class 'lgmr' estimate_gamma_hyperparameters(reg, data, id_col, ...) estimate_beta(reg, mean, ...) ## S3 method for class 'glm' estimate_beta(reg, mean, alpha, ...) ## S3 method for class 'lgmr' estimate_beta(reg, mean, m, s, ...)
estimate_gamma_hyperparameters(reg, data, ...) ## S3 method for class 'glm' estimate_gamma_hyperparameters(reg, data, ...) ## S3 method for class 'lgmr' estimate_gamma_hyperparameters(reg, data, id_col, ...) estimate_beta(reg, mean, ...) ## S3 method for class 'glm' estimate_beta(reg, mean, alpha, ...) ## S3 method for class 'lgmr' estimate_beta(reg, mean, m, s, ...)
reg |
A |
data |
A |
... |
Currently not in use |
id_col |
A character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.) |
mean |
The mean value of the peptide |
alpha |
The alpha parameter of the peptide |
m |
The mean of the means |
s |
The sd of the means |
estimate_gamma_hyperparameters
returns a tibble
or data.frame
with the alpha,beta hyperparameters estimates as new columns.
estimate_beta
returns estimates of the beta parameter(s)
# Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) # Normalize data yeast_norm <- yeast %>% psrn("identifier") %>% # Get mean-variance trends calculate_mean_sd_trends(design) # Fit gamma regression (could also have been a lgmr model) gam_reg <- fit_gamma_regression(yeast_norm, sd ~ mean) # Estimate priors gam_reg %>% estimate_gamma_hyperparameters(yeast_norm) # Can also explicitly estimate the beta parameters # Note this is order sensitive. estimate_beta(gam_reg, yeast_norm$mean, 1/summary(gam_reg)$dispersion)
# Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) # Normalize data yeast_norm <- yeast %>% psrn("identifier") %>% # Get mean-variance trends calculate_mean_sd_trends(design) # Fit gamma regression (could also have been a lgmr model) gam_reg <- fit_gamma_regression(yeast_norm, sd ~ mean) # Estimate priors gam_reg %>% estimate_gamma_hyperparameters(yeast_norm) # Can also explicitly estimate the beta parameters # Note this is order sensitive. estimate_beta(gam_reg, yeast_norm$mean, 1/summary(gam_reg)$dispersion)
Estimates the measurement uncertainty for each data point using a Gamma regression. Calculated as the expected standard deviation for each measurement:
where are the regression parameters and
is a function
describing the mean relationship between
and
.
estimate_uncertainty(reg, data, id_col, design_matrix) ## S3 method for class 'glm' estimate_uncertainty(reg, data, id_col, design_matrix) ## S3 method for class 'lgmr' estimate_uncertainty(reg, data, id_col, design_matrix)
estimate_uncertainty(reg, data, id_col, design_matrix) ## S3 method for class 'glm' estimate_uncertainty(reg, data, id_col, design_matrix) ## S3 method for class 'lgmr' estimate_uncertainty(reg, data, id_col, design_matrix)
reg |
A |
data |
A |
id_col |
A character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.) |
design_matrix |
Cell mean design matrix for the data |
A matrix with the uncertainty
# Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) yeast_norm <- yeast %>% # Remove missing data tidyr::drop_na() %>% # Normalize data psrn("identifier") %>% # Add mean-variance trends calculate_mean_sd_trends(design) # Fit the gamma regression gam <- fit_gamma_regression(yeast_norm, sd ~ mean) # Estimate each data point's uncertainty estimate_uncertainty(gam, yeast_norm, 'identifier', design)
# Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) yeast_norm <- yeast %>% # Remove missing data tidyr::drop_na() %>% # Normalize data psrn("identifier") %>% # Add mean-variance trends calculate_mean_sd_trends(design) # Fit the gamma regression gam <- fit_gamma_regression(yeast_norm, sd ~ mean) # Estimate each data point's uncertainty estimate_uncertainty(gam, yeast_norm, 'identifier', design)
fit_gamma_regression
returns a glm
object containing the
gamma regression for the mean-variance trend.
fit_gamma_regression(data, formula = sd ~ mean, ...)
fit_gamma_regression(data, formula = sd ~ mean, ...)
data |
a |
formula |
a formula describing the model |
... |
only for compatibility with other functions |
fit_gamma_regression
returns a glm object
# Generate a design matrix for the data design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) # Set correct colnames, this is important for calculate_mean_sd_trends colnames(design) <- paste0("ng", c(50, 100)) # Normalize and log-transform the data yeast_norm <- psrn(yeast, "identifier") %>% # Add row means and variances calculate_mean_sd_trends(design) # Fit gamma regression model for the mean-variance trends gamma_model <- fit_gamma_regression(yeast_norm, sd ~ mean)
# Generate a design matrix for the data design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) # Set correct colnames, this is important for calculate_mean_sd_trends colnames(design) <- paste0("ng", c(50, 100)) # Normalize and log-transform the data yeast_norm <- psrn(yeast, "identifier") %>% # Add row means and variances calculate_mean_sd_trends(design) # Fit gamma regression model for the mean-variance trends gamma_model <- fit_gamma_regression(yeast_norm, sd ~ mean)
See lgmr_model for model details.
fit_lgmr( data, id_col, model = lgmr_model, iter = 6000, warmup = 1500, chains = 5, cores = 1, return_stanfit = FALSE, simplify = FALSE, ... ) ## S3 method for class 'lgmr' print( x, simplify = x$simplify, pars = c("auxiliary", "coefficients"), digits = 3, ... ) ## S3 method for class 'lgmr' coef(object, simplify = FALSE, pars = c("coefficients", "auxiliary"), ...)
fit_lgmr( data, id_col, model = lgmr_model, iter = 6000, warmup = 1500, chains = 5, cores = 1, return_stanfit = FALSE, simplify = FALSE, ... ) ## S3 method for class 'lgmr' print( x, simplify = x$simplify, pars = c("auxiliary", "coefficients"), digits = 3, ... ) ## S3 method for class 'lgmr' coef(object, simplify = FALSE, pars = c("coefficients", "auxiliary"), ...)
data |
A |
id_col |
A character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.). Has to be a unique identifier for each feature. |
model |
Defaults to lgmr_model (see it for details on the model), can
also be an user supplied |
iter |
Total number of samples to draw |
warmup |
Number of warm-up samples to draw |
chains |
Number of chains to run |
cores |
Number of cores to use per chain |
return_stanfit |
Should the |
simplify |
Should only the mean estimates of the posterior be returned? |
... |
Additional arguments to |
x , object
|
An |
pars |
If you want to print/extract the regression coefficients, theta, auxiliary (alpha and NRMSE), or all |
digits |
Number of digits to print |
A fitted lgmr
model.
# Define design matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) # Normalize data, calculate M-V trend, and fit LGMR model yeast_lgmr <- yeast %>% # Remove missing values tidyr::drop_na() %>% # Normalize psrn("identifier") %>% # Add the mean-variance trends calculate_mean_sd_trends(design) %>% # Fit the model fit_lgmr("identifier") # Print everything except thetas print(yeast_lgmr, pars = c("coefficients", "auxiliary")) # Extract the mean of the model parameters posterior yeast_lgmr_pars <- coef(yeast_lgmr, pars = 'all', simplify = TRUE)
# Define design matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) # Normalize data, calculate M-V trend, and fit LGMR model yeast_lgmr <- yeast %>% # Remove missing values tidyr::drop_na() %>% # Normalize psrn("identifier") %>% # Add the mean-variance trends calculate_mean_sd_trends(design) %>% # Fit the model fit_lgmr("identifier") # Print everything except thetas print(yeast_lgmr, pars = c("coefficients", "auxiliary")) # Extract the mean of the model parameters posterior yeast_lgmr_pars <- coef(yeast_lgmr, pars = 'all', simplify = TRUE)
Function to sample the posterior of the Bayesian data and
decision model. It first produces the needed inputs for Stan's sampling()
for each peptide (or protein, PTM, etc.). It then runs the sampling for the
data and decision model. From the posterior, it then collects estimates and
sampling statistics from the posterior of data model and integrates the
decision distribution D. It then returns a tibble()
with all the
information for each peptide's posterior (see Value below). There are
major time gains to be made by running this procedure in parallel.
infer_data_and_decision_model
has an efficient wrapper around
multidplyr
. This will let you to evenly distribute all peptides evenly to
each worker. E.g., two workers will each run half of the peptides in
parallel.
infer_data_and_decision_model( data, id_col, design_matrix, contrast_matrix, uncertainty_matrix, stan_model = empirical_bayes, clusters = 1, h_not = 0, auxiliary_columns = c(), ... )
infer_data_and_decision_model( data, id_col, design_matrix, contrast_matrix, uncertainty_matrix, stan_model = empirical_bayes, clusters = 1, h_not = 0, auxiliary_columns = c(), ... )
data |
A |
id_col |
A character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.) |
design_matrix |
A design matrix for the data. For the empirical_bayes prior only mean models are allowed (see example). For the weakly_informative prior more complicated design can be used. |
contrast_matrix |
A contrast matrix of the decisions to test. Columns
should sum to |
uncertainty_matrix |
A matrix of observation specific uncertainties |
stan_model |
Which Bayesian model to use. Defaults to empirical_bayes but also allows weakly_informative, or an user supplied function. |
clusters |
The number of parallel threads/workers to run on. |
h_not |
The value of the null hypothesis for the difference in means |
auxiliary_columns |
Names of columns in the design matrix that does not have corresponding data in the data set. For example, this can be co-founding variables such as bashes. |
... |
Additional arguments passed to
|
The data model of Baldur assumes that the observations of a peptide,
, is a normally distributed with a standard deviation,
, common to all measurements. In addition, it assumes that each
measurement has a unique uncertainty
. It then models all
measurements in the same condition with a common mean
. It then
assumes that the common variation of the peptide is caused by the variation
in the
As such, it models
with the common variance
and a non-centered parametrization for condition level
effects.
It then assumes to be gamma distributed with hyperparameters
infered from either a gamma regression fit_gamma_regression or a latent
gamma mixture regression fit_lgmr.
For details on the two priors for see empirical_bayes or
weakly_informative. Baldur then builds a posterior distribution of the
difference(s) in means for contrasts of interest. In addition, Baldur
increases the precision of the decision as the number of measurements
increase. This is done by weighting the sample size with the contrast
matrix. To this end, Baldur limits the possible contrast of interest such
that each column must sum to zero, and the absolute value of each column
must sum to two. That is, only mean comparisons are allowed.
where is a contrast matrix of interest and
is the contrast of the c:th condition in the m:th contrast of
interest, and
is the number of measurements in the c:th
condition. Baldur then integrates the tails of
to
determine the probability of error.
where is the null hypothesis for the difference in means,
is the CDF of the standard normal,
is the posterior mean of
,
is the
posterior precision of
, and
is the
Hadamard product.
A tibble()
or data.frame()
annotated with statistics of the
posterior and p(error). err
is the probability of error, i.e., the two
tail-density supporting the null-hypothesis, lfc
is the estimated
-fold change,
sigma
is the common variance, and lp
is the
log-posterior. Columns without suffix shows the mean estimate from the
posterior, while the suffixes _025
, _50
, and _975
, are the 2.5, 50.0,
and 97.5, percentiles, respectively. The suffixes _eff
and _rhat
are
the diagnostic variables returned by Stan
. In general, a larger _eff
indicates effective sample size and _rhat
indicates the mixing within
chains and between the chains and should be smaller than 1.05 (please see
the Stan manual for more details). In addition it returns a column
warnings
with potential warnings given by the sampler.
# (Please see the vignette for a tutorial) # Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) yeast_norm <- yeast %>% # Remove missing data tidyr::drop_na() %>% # Normalize data psrn('identifier') %>% # Add mean-variance trends calculate_mean_sd_trends(design) # Fit the gamma regression gam <- fit_gamma_regression(yeast_norm, sd ~ mean) # Estimate each data point's uncertainty unc <- estimate_uncertainty(gam, yeast_norm, 'identifier', design) yeast_norm <- gam %>% # Add hyper-priors for sigma estimate_gamma_hyperparameters(yeast_norm) # Setup contrast matrix contrast <- matrix(c(-1, 1), 2) yeast_norm %>% head() %>% # Just running a few for the example infer_data_and_decision_model( 'identifier', design, contrast, unc, clusters = 1 # I highly recommend increasing the number of parallel workers/clusters # this will greatly reduce the time of running baldur )
# (Please see the vignette for a tutorial) # Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) yeast_norm <- yeast %>% # Remove missing data tidyr::drop_na() %>% # Normalize data psrn('identifier') %>% # Add mean-variance trends calculate_mean_sd_trends(design) # Fit the gamma regression gam <- fit_gamma_regression(yeast_norm, sd ~ mean) # Estimate each data point's uncertainty unc <- estimate_uncertainty(gam, yeast_norm, 'identifier', design) yeast_norm <- gam %>% # Add hyper-priors for sigma estimate_gamma_hyperparameters(yeast_norm) # Setup contrast matrix contrast <- matrix(c(-1, 1), 2) yeast_norm %>% head() %>% # Just running a few for the example infer_data_and_decision_model( 'identifier', design, contrast, unc, clusters = 1 # I highly recommend increasing the number of parallel workers/clusters # this will greatly reduce the time of running baldur )
Baldur has the option to use the Latent Gamma Regression Model
(LGMR). This model attempts to estimate local Mean-Variance (M-V) trend for
each peptide (or protein, PTM, etc.). To this end is describes the M-V
trend as a regression with a common link function and a latent one. While
there may not be a latent trend in real data, it allows for a mathematical
description that can estimate the local trends of each peptide. The model
describes the standard deviation (S) as a gamma random variable with mean
dependency described by the sample mean ():
Here, is the common trend of the
regression. Then, the mixing variable,
, describes
the proportion of the mixture that each peptide has, and
is
just some small constant such that when
is zero the latent
term is small.
Next we will describe the hierarchical prior setup for the
regression variables.
For
baldur
uses a half-Cauchy as a weakly informative prior:
For the latent contribution to the i:th peptide, , has an uniform distribution:
Further, it can be seen that Baldur assumes
that S always decreases (on average) with the sample mean. To this end, both
slopes () are defined on the real
positive line. Hence, we used Half-Normal (HN)
distributions to describe the slope parameters:
For the intercepts, we assume a
standard normal prior for the common intercept. Then, we use a skew-normal to
model the latent intercept. The reason for this is two-fold, one,
will decrease the value of the latent term and, two, to push the
latent trend upwards. The latter reason is so that the latent intercept is
larger than the common and so that the priors prioritize a shift in intercept
over a increase in slope.
For the intercepts, Baldur uses a standard normal prior for the common intercept.
While for the latent trend, it uses a skew-normal (SN) to push up the second
trend and to counteract the shrinkage of .
A stanmodel
that can be used in fit_lgmr.
The Stan
code for this model is given by:
lgmr_model S4 class stanmodel 'lgmr_model' coded as follows: functions { // mu vector reg_function( vector x, vector theta, real I, real I_L, real S, real S_L, int N ) { vector[N] exp_beta = .001*exp(theta .* (I_L - S_L*x)); exp_beta += exp(I - S*x); return exp_beta; } } data { int<lower=0> N; // Number of observations vector<lower=0>[N] y; // sd vector[N] x; // mean } transformed data { real v_y = variance(y); // for NRMSE vector[N] x_star = (x - mean(x))/sd(x); // f(y_bar) vector[3] lb = [0, 0, negative_infinity()]'; // Boundaries normal coefs. } parameters { real<lower = 0> alpha; // Shape vector<lower = lb>[3] eta; // For S, S_L, I vector<lower = 0, upper = 1>[N] theta; // Mixture parameter real I_L; // Intercept Latent } transformed parameters { real<lower = 0> S = eta[1]; // Slope common real<lower = 0> S_L = eta[2]; // Slope Latent real I = eta[3]; // Intercep common } model { //Priors alpha ~ cauchy(0, 25); eta ~ std_normal(); I_L ~ skew_normal(2, 15, 35); theta ~ beta(1, 1); { vector[N] exp_beta = reg_function(x_star, theta, I, I_L, S, S_L, N); // Likelihood y ~ gamma(alpha, alpha ./ exp_beta); } } generated quantities { // NRMSE calculations real nrmse; { vector[N] se = reg_function(x_star, theta, I, I_L, S, S_L, N); se -= y; se = square(se); nrmse = mean(se) / v_y; } nrmse = sqrt(nrmse); }
Options to plot the LGMR model.
plot_lgmr_regression
will plot the data
colored by the amount of latent trend they have as well as the two extreme
regression cases when is zero or one.
plot_regression_field
will plot the local regression trend for each data point as a vector field and
color the vector based on the derivative at the mean of the peptide.
plot_lgmr_regression(model) plot_regression_field(model, n = 10, rng = 10)
plot_lgmr_regression(model) plot_regression_field(model, n = 10, rng = 10)
model |
An LGMR model object |
n |
Number of interpolation points for each peptides regression |
rng |
The proportional range of each peptides regression. E.g., a value of 10 will make each line span 1 % of the x-axis. |
A ggplot object
#' # Define design matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) # Normalize data, calculate M-V trend, and fit LGMR model yeast_lgmr <- yeast %>% # Remove missing values tidyr::drop_na() %>% # Normalize psrn("identifier") %>% # Add the mean-variance trends calculate_mean_sd_trends(design) %>% # Fit the model fit_lgmr("identifier") # Print everything except thetas print(yeast_lgmr, pars = c("coefficients", "auxiliary")) # Extract the mean of the model parameters posterior yeast_lgmr_pars <- coef(yeast_lgmr, pars = 'all', simplify = TRUE) plot_lgmr_regression(yeast_lgmr) plot_regression_field(yeast_lgmr)
#' # Define design matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) # Normalize data, calculate M-V trend, and fit LGMR model yeast_lgmr <- yeast %>% # Remove missing values tidyr::drop_na() %>% # Normalize psrn("identifier") %>% # Add the mean-variance trends calculate_mean_sd_trends(design) %>% # Fit the model fit_lgmr("identifier") # Print everything except thetas print(yeast_lgmr, pars = c("coefficients", "auxiliary")) # Extract the mean of the model parameters posterior yeast_lgmr_pars <- coef(yeast_lgmr, pars = 'all', simplify = TRUE) plot_lgmr_regression(yeast_lgmr) plot_regression_field(yeast_lgmr)
Generates a scatter plot with the gamma regressions of the mean-variance trends without partitioning
plot_gamma(data)
plot_gamma(data)
data |
The data to use for producing the plots. |
a plot with the estimated mean-variance trend
# Produce a design matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) # Normalize and log transform the data yeast_norm <- psrn(yeast, "identifier") # Generate the plot yeast_norm %>% calculate_mean_sd_trends(design) %>% plot_gamma()
# Produce a design matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) # Normalize and log transform the data yeast_norm <- psrn(yeast, "identifier") # Generate the plot yeast_norm %>% calculate_mean_sd_trends(design) %>% plot_gamma()
Generates a scatter plot with the gamma regressions of the mean-variance
trend. Can either be ran directly with plot_gamma_regression
and inputing
a design matrix, or with plot_gamma
if the M-V trend has been added to the
data with calculate_mean_sd_trends()
.
plot_gamma_regression(data, design)
plot_gamma_regression(data, design)
data |
The data to use for producing the plots. |
design |
A design matrix as produced by
|
a plot with the mean-variance trend before partitioning on the left side, and the right side after.
# Produce a design matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) # Normalize and log transform the data yeast %>% # Remove missing data # Note that this could be replaced with imputation tidyr::drop_na() %>% # Normalize psrn("identifier") %>% plot_gamma_regression(design)
# Produce a design matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) # Normalize and log transform the data yeast %>% # Remove missing data # Note that this could be replaced with imputation tidyr::drop_na() %>% # Normalize psrn("identifier") %>% plot_gamma_regression(design)
plot_sa
returns a ggplot
with a graphical representation between the log
fold-change and sigma.
plot_sa(results, alpha = 0.05, lfc = NULL)
plot_sa(results, alpha = 0.05, lfc = NULL)
results |
Output generated by
|
alpha |
Significance cut-off; used to draw a line indicating where significance starts |
lfc |
LFC cut-off; used to draw lines for |
plot_sa
returns a ggplot
object
# Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) yeast_norm <- yeast %>% # Remove missing data tidyr::drop_na() %>% # Normalize data psrn('identifier') %>% # Add mean-variance trends calculate_mean_sd_trends(design) # Fit the gamma regression gam <- fit_gamma_regression(yeast_norm, sd ~ mean) # Estimate each data point's uncertainty unc <- estimate_uncertainty(gam, yeast_norm, "identifier", design) yeast_norm <- gam %>% # Add hyper-priors for sigma estimate_gamma_hyperparameters(yeast_norm) # Setup contrast matrix contrast <- matrix(c(-1, 1), 2) results <- yeast_norm %>% head() %>% # Just run a few for the example infer_data_and_decision_model( 'identifier', design, contrast, unc, clusters = 1 # I highly recommend increasing the number of parallel workers/clusters # this will greatly reduce the speed of running baldur ) # Plot with alpha = 0.05 plot_sa(results, alpha = 0.05) # Plot with alpha = 0.01 and show LFC = 1 plot_sa(results, alpha = 0.01, 1)
# Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) yeast_norm <- yeast %>% # Remove missing data tidyr::drop_na() %>% # Normalize data psrn('identifier') %>% # Add mean-variance trends calculate_mean_sd_trends(design) # Fit the gamma regression gam <- fit_gamma_regression(yeast_norm, sd ~ mean) # Estimate each data point's uncertainty unc <- estimate_uncertainty(gam, yeast_norm, "identifier", design) yeast_norm <- gam %>% # Add hyper-priors for sigma estimate_gamma_hyperparameters(yeast_norm) # Setup contrast matrix contrast <- matrix(c(-1, 1), 2) results <- yeast_norm %>% head() %>% # Just run a few for the example infer_data_and_decision_model( 'identifier', design, contrast, unc, clusters = 1 # I highly recommend increasing the number of parallel workers/clusters # this will greatly reduce the speed of running baldur ) # Plot with alpha = 0.05 plot_sa(results, alpha = 0.05) # Plot with alpha = 0.01 and show LFC = 1 plot_sa(results, alpha = 0.01, 1)
plot_volcano
returns a ggplot
with a graphical representation between the
-log10(err) and LFC.
plot_volcano(results, alpha = 0.05, lfc = NULL)
plot_volcano(results, alpha = 0.05, lfc = NULL)
results |
Output generated by
|
alpha |
Significance cut-off; used to draw a line indicating where significance starts |
lfc |
LFC cut-off; used to draw lines for |
plot_volcano
returns a ggplot
object
# Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) yeast_norm <- yeast %>% # Remove missing data tidyr::drop_na() %>% # Normalize data psrn('identifier') %>% # Add mean-variance trends calculate_mean_sd_trends(design) # Fit the gamma regression gam <- fit_gamma_regression(yeast_norm, sd ~ mean) # Estimate each data point's uncertainty unc <- estimate_uncertainty(gam, yeast_norm, 'identifier', design) yeast_norm <- gam %>% # Add hyper-priors for sigma estimate_gamma_hyperparameters(yeast_norm) # Setup contrast matrix contrast <- matrix(c(-1, 1), 2) results <- yeast_norm %>% head() %>% # Just run a few for the example infer_data_and_decision_model( 'identifier', design, contrast, unc, clusters = 1 # I highly recommend increasing the number of parallel workers/clusters # this will greatly reduce the speed of running baldur ) # Plot indicating where alpha = 0.05 plot_volcano(results, alpha = 0.05) # Plot indicating where alpha = 0.01 and where LFC = 1 plot_volcano(results, alpha = 0.01, 1)
# Setup model matrix design <- model.matrix(~ 0 + factor(rep(1:2, each = 3))) colnames(design) <- paste0("ng", c(50, 100)) yeast_norm <- yeast %>% # Remove missing data tidyr::drop_na() %>% # Normalize data psrn('identifier') %>% # Add mean-variance trends calculate_mean_sd_trends(design) # Fit the gamma regression gam <- fit_gamma_regression(yeast_norm, sd ~ mean) # Estimate each data point's uncertainty unc <- estimate_uncertainty(gam, yeast_norm, 'identifier', design) yeast_norm <- gam %>% # Add hyper-priors for sigma estimate_gamma_hyperparameters(yeast_norm) # Setup contrast matrix contrast <- matrix(c(-1, 1), 2) results <- yeast_norm %>% head() %>% # Just run a few for the example infer_data_and_decision_model( 'identifier', design, contrast, unc, clusters = 1 # I highly recommend increasing the number of parallel workers/clusters # this will greatly reduce the speed of running baldur ) # Plot indicating where alpha = 0.05 plot_volcano(results, alpha = 0.05) # Plot indicating where alpha = 0.01 and where LFC = 1 plot_volcano(results, alpha = 0.01, 1)
This function generates a pseudo-reference by taking the geometric mean of each peptide across all samples. Each peptide in each sample is then divided by the pseudo-reference. Then, the median ratio of all ratios is used as an estimate to use for normalizing differences in loading concentration. All features in each sample is then divided by their corresponding estimate. All estimates are based on features without missing values. For details see Anders and Huber (2010).
psrn(data, id_col, log = TRUE, load_info = FALSE, target = NULL)
psrn(data, id_col, log = TRUE, load_info = FALSE, target = NULL)
data |
data.frame |
id_col |
a character for the name of the column containing the name of the features in data (e.g., peptides, proteins, etc.) |
log |
boolean variable indicating if the data should be log transformed after normalization |
load_info |
logical; should the load information be output? |
target |
target columns to normalize, supports
|
data frame with normalized values if load_info=FALSE
, if it is TRUE
then it returns a list with two tibbles. One tibble containing the
normalized data and one containing the loading info as well as the
estimated normalization factors.
https://www.nature.com/articles/npre.2010.4282.1
Simon Anders, Wolfgang Huber (2010). “Differential expression analysis for sequence count data.” Nature Precedings, 1–1.
yeast_psrn <- psrn(yeast, "identifier") yeast_psrn_with_load <- psrn(yeast, "identifier", load_info = TRUE) yeast_ng50_only <- psrn(yeast, "identifier", target = matches('ng50'))
yeast_psrn <- psrn(yeast, "identifier") yeast_psrn_with_load <- psrn(yeast, "identifier", load_info = TRUE) yeast_ng50_only <- psrn(yeast, "identifier", target = matches('ng50'))
A dataset containing quantification of peptides using Progenesis. True
positives peptides spiked-in from the Universal Proteomics Standard Set 1
(UPS1) at three different concentrations and true negatives from
Chlamydomonas reinhardtii with the same concentration in all samples.
You can find true positives with stringr::str_detect(ups$identifier, 'UPS')
. For details see Berg et al. (2019) and
if you use this dataset please cite the same paper.
ups
ups
A data frame with 10599 rows and 13 variables:
id column for features, true positives contains UPS and true negatives contains Cre
Technical replicates with true positives spiked-in from 25 fmol UPS1 peptides
Technical replicate with true positives spiked-in from 50 fmol UPS1 peptides
Technical replicate with true positives spiked-in from 100 fmol UPS1 peptides
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2619-6
Philip Berg, Evan W McConnell, Leslie M Hicks, Sorina C Popescu, George V Popescu (2019). “Evaluation of linear models and missing value imputation for the analysis of peptide-centric proteomics.” BMC bioinformatics, 20(2), 7–16.
Here we will model the mean of the prior with a weakly
informative (WI) prior. We will assume that, in essence, nothing is know
about the mean. As such, for the WI prior, we use a normal prior on
centered at zero and with a very large variance.
A stanmodel
that can be used in infer_data_and_decision_model.
The Stan
code for this model is given by:
weakly_informative S4 class stanmodel 'weakly_informative' coded as follows: data { int<lower=0> N; // number of data items int<lower=0> K; // number of conditions int C; // number of comparisons to perform matrix[N, K] x; // design matrix vector[N] y; // data matrix[K, C] c; // contrast matrix real alpha; // alpha prior for gamma real beta; // beta prior for gamma vector[N] u; // uncertainty } transformed data{ vector[K] n_k; // per condition measurements row_vector[C] n_c; // per comparison measurements matrix[K, C] abs_c; // abs of C for n_c calculation for (i in 1:K) { for (j in 1:C) { abs_c[i, j] = abs(c[i, j]); } } for (i in 1:K) { n_k[i] = 1/sum(x[,i]); } n_c = n_k' * abs_c; n_c = sqrt(n_c); } parameters { vector[K] mu; // coefficients for predictors real<lower=0> sigma; // error scale array[C] real y_diff; // difference in coefficients vector[K] eta; // Error in mean vector[K] prior_mu_not; // Estimation error } transformed parameters{ row_vector[C] mu_diff = mu' * c; // differences in means vector[C] sigma_lfc = sigma * n_c'; // variance of y_diff } model { sigma ~ gamma(alpha, beta); // variance eta ~ normal(0, 1); // NCP auxilary variable prior_mu_not ~ normal(0, 10); // prior mean mu ~ normal(prior_mu_not + sigma*eta, sigma); // mean y ~ normal(x * mu, sigma*u); // data model y_diff ~ normal(mu_diff, sigma_lfc); // difference statistic }
A dataset containing quantification of reversibly oxidized cysteines using
Progenesis. True positives cysteines spiked-in from yeast at two different
concentrations and true negatives from Chlamydomonas reinhardtii with
the same concentration in all samples. To identify true positives one can use
stringr::str_detect(yeast$identifier, 'YEAST')
. For details see
Berg et al. (2019) and if you use this dataset
please cite the same paper.
yeast
yeast
A data frame with 2235 rows and 7 variables:
id column for features, true positives contains YEAST and true negatives contains Cre
Biological replicates with true positives spiked-in from 50 ng yeast cells
Biological replicates with true positives spiked-in from 100 ng yeast cells
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2619-6
Philip Berg, Evan W McConnell, Leslie M Hicks, Sorina C Popescu, George V Popescu (2019). “Evaluation of linear models and missing value imputation for the analysis of peptide-centric proteomics.” BMC bioinformatics, 20(2), 7–16.