Reference
remove-background
Command Line Options
Remove background RNA from count matrix.
usage: cellbender remove-background [-h] --input INPUT_FILE --output
OUTPUT_FILE [--cuda]
[--checkpoint INPUT_CHECKPOINT_TARBALL]
[--force-use-checkpoint]
[--expected-cells EXPECTED_CELL_COUNT]
[--total-droplets-included TOTAL_DROPLETS]
[--force-cell-umi-prior FORCE_CELL_UMI_PRIOR]
[--force-empty-umi-prior FORCE_EMPTY_UMI_PRIOR]
[--model {naive,simple,ambient,swapping,full}]
[--epochs EPOCHS]
[--low-count-threshold LOW_COUNT_THRESHOLD]
[--z-dim Z_DIM]
[--z-layers Z_HIDDEN_DIMS [Z_HIDDEN_DIMS ...]]
[--training-fraction TRAINING_FRACTION]
[--empty-drop-training-fraction FRACTION_EMPTIES]
[--ignore-features BLACKLISTED_GENES [BLACKLISTED_GENES ...]]
[--fpr FPR [FPR ...]]
[--exclude-feature-types EXCLUDE_FEATURES [EXCLUDE_FEATURES ...]]
[--projected-ambient-count-threshold AMBIENT_COUNTS_IN_CELLS_LOW_LIMIT]
[--learning-rate LEARNING_RATE]
[--checkpoint-mins CHECKPOINT_MIN]
[--final-elbo-fail-fraction FINAL_ELBO_FAIL_FRACTION]
[--epoch-elbo-fail-fraction EPOCH_ELBO_FAIL_FRACTION]
[--num-training-tries NUM_TRAINING_TRIES]
[--learning-rate-retry-mult LEARNING_RATE_RETRY_MULT]
[--posterior-batch-size POSTERIOR_BATCH_SIZE]
[--posterior-regularization {PRq,PRmu,PRmu_gene}]
[--alpha PRQ_ALPHA] [--q CDF_THRESHOLD_Q]
[--estimator {map,mean,cdf,sample,mckp}]
[--estimator-multiple-cpu]
[--constant-learning-rate]
[--cpu-threads N_THREADS] [--debug]
[--truth TRUTH_FILE]
Named Arguments
- --input
Data file on which to run tool. Data must be un-filtered: it should include empty droplets. The following input formats are supported: CellRanger v2 and v3 (.h5 or the directory that contains the .mtx file), Dropseq DGE (.txt or .txt.gz), BD Rhapsody (.csv or .csv.gz), AnnData (.h5ad), and Loom (.loom).
- --output
Output file location (the path must exist, and the file name must have .h5 extension).
- --cuda
Including the flag –cuda will run the inference on a GPU.
Default: False
- --checkpoint
Checkpoint tarball produced by v0.3.0+ of CellBender remove-background. If present, and the workflow hashes match, training will restart from this checkpoint.
Default: “ckpt.tar.gz”
- --force-use-checkpoint
Normally, checkpoints can only be used if the CellBender code and certain input args match exactly. This flag allows you to bypass this requirement. An example use would be to create a new output using a checkpoint from a run of v0.3.1, a redacted version with faulty output count matrices. If you use this flag, ensure that the input file and the checkpoint match, because CellBender will not check.
Default: False
- --expected-cells
Number of cells expected in the dataset (a rough estimate within a factor of 2 is sufficient).
- --total-droplets-included
The number of droplets from the rank-ordered UMI plot that will have their cell probabilities inferred as an output. Include the droplets which might contain cells. Droplets beyond TOTAL_DROPLETS_INCLUDED should be ‘surely empty’ droplets.
- --force-cell-umi-prior
Ignore CellBender’s heuristic prior estimation, and use this prior for UMI counts in cells.
- --force-empty-umi-prior
Ignore CellBender’s heuristic prior estimation, and use this prior for UMI counts in empty droplets.
- --model
Possible choices: naive, simple, ambient, swapping, full
Which model is being used for count data. ‘naive’ subtracts the estimated ambient profile. ‘simple’ does not model either ambient RNA or random barcode swapping (for debugging purposes – not recommended). ‘ambient’ assumes background RNA is incorporated into droplets. ‘swapping’ assumes background RNA comes from random barcode swapping (via PCR chimeras). ‘full’ uses a combined ambient and swapping model.
Default: “full”
- --epochs
Number of epochs to train.
Default: 150
- --low-count-threshold
Droplets with UMI counts below this number are completely excluded from the analysis. This can help identify the correct prior for empty droplet counts in the rare case where empty counts are extremely high (over 200).
Default: 5
- --z-dim
Dimension of latent variable z.
Default: 64
- --z-layers
Dimension of hidden layers in the encoder for z.
Default: [512]
- --training-fraction
Training detail: the fraction of the data used for training. The rest is never seen by the inference algorithm. Speeds up learning.
Default: 0.9
- --empty-drop-training-fraction
Training detail: the fraction of the training data each epoch that is drawn (randomly sampled) from surely empty droplets.
Default: 0.2
- --ignore-features
Integer indices of features to ignore entirely. In the output count matrix, the counts for these features will be unchanged.
Default: []
- --fpr
Target ‘delta’ false positive rate in [0, 1). Use 0 for a cohort of samples which will be jointly analyzed for differential expression. A false positive is a true signal count that is erroneously removed. More background removal is accompanied by more signal removal at high values of FPR. You can specify multiple values, which will create multiple output files.
Default: [0.01]
- --exclude-feature-types
Feature types to ignore during the analysis. These features will be left unchanged in the output file.
Default: []
- --projected-ambient-count-threshold
Controls how many features are included in the analysis, which can lead to a large speedup. If a feature is expected to have less than PROJECTED_AMBIENT_COUNT_THRESHOLD counts total in all cells (summed), then that gene is excluded, and it will be unchanged in the output count matrix. For example, PROJECTED_AMBIENT_COUNT_THRESHOLD = 0 will include all features which have even a single count in any empty droplet.
Default: 0.1
- --learning-rate
Training detail: lower learning rate for inference. A OneCycle learning rate schedule is used, where the upper learning rate is ten times this value. (For this value, probably do not exceed 1e-3).
Default: 0.0001
- --checkpoint-mins
Checkpoint file will be saved periodically, with this many minutes between each checkpoint.
Default: 7.0
- --final-elbo-fail-fraction
Training is considered to have failed if (best_test_ELBO - final_test_ELBO)/(best_test_ELBO - initial_test_ELBO) > FINAL_ELBO_FAIL_FRACTION. Training will automatically re-run if –num-training-tries > 1. By default, will not fail training based on final_training_ELBO.
- --epoch-elbo-fail-fraction
Training is considered to have failed if (previous_epoch_test_ELBO - current_epoch_test_ELBO)/(previous_epoch_test_ELBO - initial_train_ELBO) > EPOCH_ELBO_FAIL_FRACTION. Training will automatically re-run if –num-training-tries > 1. By default, will not fail training based on epoch_training_ELBO.
- --num-training-tries
Number of times to attempt to train the model. At each subsequent attempt, the learning rate is multiplied by LEARNING_RATE_RETRY_MULT.
Default: 1
- --learning-rate-retry-mult
Learning rate is multiplied by this amount each time a new training attempt is made. (This parameter is only used if training fails based on EPOCH_ELBO_FAIL_FRACTION or FINAL_ELBO_FAIL_FRACTION and NUM_TRAINING_TRIES is > 1.)
Default: 0.2
- --posterior-batch-size
Training detail: size of batches when creating the posterior. Reduce this to avoid running out of GPU memory creating the posterior (will be slower).
Default: 128
- --posterior-regularization
Possible choices: PRq, PRmu, PRmu_gene
Posterior regularization method. (For experts: not required for normal usage, see documentation). PRq is approximate quantile-targeting. PRmu is approximate mean-targeting aggregated over genes (behavior of v0.2.0). PRmu_gene is approximate mean-targeting per gene.
- --alpha
Tunable parameter alpha for the PRq posterior regularization method (not normally used: see documentation).
- --q
Tunable parameter q for the CDF threshold estimation method (not normally used: see documentation).
- --estimator
Possible choices: map, mean, cdf, sample, mckp
Output denoised count estimation method. (For experts: not required for normal usage, see documentation).
Default: “mckp”
- --estimator-multiple-cpu
Including the flag –estimator-multiple-cpu will use more than one CPU to compute the MCKP output count estimator in parallel (does nothing for other estimators).
Default: False
- --constant-learning-rate
Including the flag –constant-learning-rate will use the ClippedAdam optimizer instead of the OneCycleLR learning rate schedule, which is the default. Learning is faster with the OneCycleLR schedule. However, training can easily be continued from a checkpoint for more epochs than the initial command specified when using ClippedAdam. On the other hand, if using the OneCycleLR schedule with 150 epochs specified, it is not possible to pick up from that final checkpoint and continue training until 250 epochs.
Default: False
- --cpu-threads
Number of threads to use when pytorch is run on CPU. Defaults to the number of logical cores.
- --debug
Including the flag –debug will log extra messages useful for debugging.
Default: False
- --truth
This is only used by developers for report generation. Truth h5 file (for simulated data only).
WDL Workflow Options
A WDL script is available as cellbender_remove_background.wdl,
and it can be used to run cellbender remove-background
from
Terra or from your own
Cromwell instance. The WDL is designed to
make use of an Nvidia Tesla T4 GPU on Google Cloud architecture.
As of v0.3.0, the WDL uses preemptible instances that are a fraction of the cost, and uses automatic restarting from checkpoints so that work is not lost.
In addition to the above command line options, the workflow has its own set of input options, which are described in detail here.
Output h5 file format
An h5 output file (this one is from the tutorial) can be examined in detail using PyTables in python:
# import PyTables
import tables
# open the file and take a look at its contents
with tables.open_file('tiny_output.h5', 'r') as f:
print(f)
/home/jupyter/tiny_output.h5 (File) 'CellBender remove-background output'
Last modif.: 'Wed May 4 19:46:16 2022'
Object Tree:
/ (RootGroup) 'CellBender remove-background output'
/droplet_latents (Group) 'Latent variables per droplet'
/droplet_latents/background_fraction (CArray(1000,), shuffle, zlib(1)) ''
/droplet_latents/barcode_indices_for_latents (CArray(1000,), shuffle, zlib(1)) ''
/droplet_latents/cell_probability (CArray(1000,), shuffle, zlib(1)) ''
/droplet_latents/cell_size (CArray(1000,), shuffle, zlib(1)) ''
/droplet_latents/droplet_efficiency (CArray(1000,), shuffle, zlib(1)) ''
/droplet_latents/gene_expression_encoding (CArray(1000, 100), shuffle, zlib(1)) ''
/global_latents (Group) 'Global latent variables'
/global_latents/ambient_expression (Array(100,)) ''
/global_latents/cell_size_lognormal_std (Array()) ''
/global_latents/empty_droplet_size_lognormal_loc (Array()) ''
/global_latents/empty_droplet_size_lognormal_scale (Array()) ''
/global_latents/posterior_regularization_lambda (Array()) ''
/global_latents/swapping_fraction_dist_params (Array(2,)) ''
/global_latents/target_false_positive_rate (Array()) ''
/matrix (Group) 'Counts after background correction'
/matrix/barcodes (CArray(50500,), zlib(1)) ''
/matrix/data (CArray(44477,), shuffle, zlib(1)) ''
/matrix/indices (CArray(44477,), shuffle, zlib(1)) ''
/matrix/indptr (CArray(50501,), shuffle, zlib(1)) ''
/matrix/shape (CArray(2,), shuffle, zlib(1)) ''
/metadata (Group) 'Metadata'
/metadata/barcodes_analyzed (Array(1000,)) ''
/metadata/barcodes_analyzed_inds (Array(1000,)) ''
/metadata/features_analyzed_inds (Array(100,)) ''
/metadata/fraction_data_used_for_testing (Array()) ''
/metadata/test_elbo (Array(30,)) ''
/metadata/test_epoch (Array(30,)) ''
/metadata/train_elbo (Array(150,)) ''
/metadata/train_epoch (Array(150,)) ''
/matrix/features (Group) 'Genes and other features measured'
/matrix/features/feature_type (CArray(100,), shuffle, zlib(1)) ''
/matrix/features/genome (CArray(100,), shuffle, zlib(1)) ''
/matrix/features/id (CArray(100,), shuffle, zlib(1)) ''
/matrix/features/name (CArray(100,), shuffle, zlib(1)) ''
Any of these bits of data can be accessed directly with PyTables, or by using
python functions from cellbender.remove_background.downstream
such as
anndata_from_h5()
that load the metadata as well as the count matrix into
AnnData format
(compatible with analysis in scanpy).
The names of variables in the h5 file are meant to explain their contents.
All data in the /matrix
group is formatted as CellRanger v3 h5 data.
The other root groups in the h5 file [droplet_latents, global_latents, metadata]
contain detailed information inferred by CellBender as part of its latent
variable model of the data generative process (details in our paper).
The /droplet_latents
are variables that have a value inferred for each
droplet, such as cell_probability
and cell_size
. The /global_latents
are
variables that have a fixed value for the whole experiment, such as
ambient_expression
, which is the inferred profile of ambient RNA in the
sample (normalized so it sums to one). Finally, /metadata
includes other things
such as the learning curve (train_elbo
and train_epoch
) and which
features were analyzed during the run (features_analyzed_inds
, as
integer indices that index /matrix/features
).
Loading and using outputs
As mentioned in the tutorial, output h5 files can
be opened natively by scanpy
(scanpy.read_10x_h5())
and (with some effort) by Seurat
(see here), though
there is extra cellbender
-specific information in the output files that
scanpy
and Seurat
will not load automatically.
Additional loading functionality is distrubuted as part of the cellbender
package and is described below.
Suggested use cases:
Load just the CellBender output:
anndata_from_h5()
Load the raw input together with the CellBender output (this is great for downstream analysis, because it allows you to plot the effect of
cellbender
, and you can always look back at the raw data):load_anndata_from_input_and_output()
Power users who want to run
cellbender
at multiple FPR values and make comparisons:load_anndata_from_input_and_outputs()
Example:
from cellbender.remove_background.downstream import load_anndata_from_input_and_output
adata = load_anndata_from_input_and_output(
input_file='my_raw_10x_file.h5',
output_file='my_cellbender_output_file.h5',
input_layer_key='raw', # this will be the raw data layer
)
adata
AnnData object with n_obs × n_vars = 6983 × 37468
obs: 'background_fraction', 'cell_probability', 'cell_size', 'droplet_efficiency',
'n_raw', 'n_cellbender'
var: 'ambient_expression', 'feature_type', 'genome', 'gene_id', 'cellbender_analyzed',
'n_raw', 'n_cellbender'
uns: 'cell_size_lognormal_std', 'empty_droplet_size_lognormal_loc',
'empty_droplet_size_lognormal_scale', 'swapping_fraction_dist_params',
'target_false_positive_rate', 'features_analyzed_inds',
'fraction_data_used_for_testing', 'test_elbo', 'test_epoch',
'train_elbo', 'train_epoch'
obsm: 'cellbender_embedding'
layers: 'raw', 'cellbender'
Functions for downstream work with outputs of remove-background.
- cellbender.remove_background.downstream.anndata_from_h5(file, analyzed_barcodes_only=True)[source]
Load an output h5 file into an AnnData object for downstream work.
- Parameters:
file (
str
) – The h5 fileanalyzed_barcodes_only (
bool
) – False to load all barcodes, so that the size of the AnnData object will match the size of the input raw count matrix. True to load a limited set of barcodes: only those analyzed by the algorithm. This allows relevant latent variables to be loaded properly into adata.obs and adata.obsm, rather than adata.uns.
- Returns:
- The anndata object, populated with inferred latent variables
and metadata.
- Return type:
anndata.AnnData
- cellbender.remove_background.downstream.dict_from_h5(file)[source]
Read in everything from an h5 file and put into a dictionary.
- Parameters:
file (
str
) – The h5 file- Return type:
Dict
[str
,ndarray
]- Returns:
Dictionary containing all the information from the h5 file
- cellbender.remove_background.downstream.load_anndata_from_input(input_file)[source]
Load an input file into an AnnData object (used in report generation). Equivalent to something like scanpy.read(), but uses cellbender’s io.
- Parameters:
input_file (
str
) – The raw data file- Returns:
The anndata object
- Return type:
adata.AnnData
- cellbender.remove_background.downstream.load_anndata_from_input_and_output(input_file, output_file, analyzed_barcodes_only=True, input_layer_key='cellranger', retain_input_metadata=False, gene_expression_encoding_key='cellbender_embedding', truth_file=None)[source]
Load remove-background output count matrix into an anndata object, together with remove-background metadata and the raw input counts.
- Parameters:
input_file (
str
) – Raw h5 file (or other compatible remove-background input) used as input for remove-background.output_file (
str
) – Output h5 file created by remove-background (can be filtered or not).analyzed_barcodes_only (
bool
) – Argument passed to anndata_from_h5(). False to load all barcodes, so that the size of the AnnData object will match the size of the input raw count matrix. True to load a limited set of barcodes: only those analyzed by the algorithm. This allows relevant latent variables to be loaded properly into adata.obs and adata.obsm, rather than adata.uns.input_layer_key (
str
) – Key of the anndata.layer that is created for the raw input count matrix.retain_input_metadata (
bool
) – In addition to loading the CellBender metadata, which happens automatically, set this to True to retain all the metadata from the raw input file as well.gene_expression_encoding_key (
str
) – The CellBender gene expression embedding will be loaded into adata.obsm[gene_expression_encoding_key]truth_file (
Optional
[str
]) – File containing truth data if this is a simulation
- Returns:
- AnnData object with counts before and after remove-background,
as well as inferred latent variables from remove-background.
- Return type:
anndata.AnnData
- cellbender.remove_background.downstream.load_anndata_from_input_and_outputs(input_file, output_files, analyzed_barcodes_only=True, input_layer_key='cellranger', gene_expression_encoding_key='cellbender_embedding', truth_file=None)[source]
Load remove-background output count matrices into an anndata object, together with remove-background metadata and the raw input counts.
The use case would typically be cellbender runs with multiple output files at different FPRs, which we want to compare.
- Parameters:
input_file (
str
) – Raw h5 file (or other compatible remove-background input) used as input for remove-background.output_files (
Dict
[str
,str
]) – Output h5 files created by remove-background (can be filtered or not) or some other method. Dict whose keys are layer keys and whose values are file names.analyzed_barcodes_only (
bool
) – Argument passed to anndata_from_h5(). False to load all barcodes, so that the size of the AnnData object will match the size of the input raw count matrix. True to load a limited set of barcodes: only those analyzed by the algorithm. This allows relevant latent variables to be loaded properly into adata.obs and adata.obsm, rather than adata.uns.input_layer_key (
str
) – Key of the anndata.layer that is created for the raw input count matrix.gene_expression_encoding_key (
str
) – The CellBender gene expression embedding will be loaded into adata.obsm[gene_expression_encoding_key]truth_file (
Optional
[str
]) – File containing truth data if this is a simulation
- Returns:
- AnnData object with counts before and after remove-background,
as well as inferred latent variables from remove-background.
- Return type:
anndata.AnnData
The posterior
There is an additional output file called posterior.h5
which contains the
full posterior. The structure of this data is a sparse matrix whose shape is
[count matrix entries, number of potential noise counts]. Typically the number
of potential noise counts in any entry of the count matrix is truncated at 50.
The values are log probabilities of the specified number of noise counts in the
given count matrix entry.
The information contained in the posterior can be used to quantitatively answer questions such as “What is the probability that the number of viral gene counts in this cell is nonzero?” For help with these kinds of computations, please open a github issue.