Reference

remove-background

Command Line Options

Remove background RNA from count matrix.

usage: cellbender remove-background [-h] --input INPUT_FILE --output
                                    OUTPUT_FILE [--cuda]
                                    [--checkpoint INPUT_CHECKPOINT_TARBALL]
                                    [--force-use-checkpoint]
                                    [--expected-cells EXPECTED_CELL_COUNT]
                                    [--total-droplets-included TOTAL_DROPLETS]
                                    [--force-cell-umi-prior FORCE_CELL_UMI_PRIOR]
                                    [--force-empty-umi-prior FORCE_EMPTY_UMI_PRIOR]
                                    [--model {naive,simple,ambient,swapping,full}]
                                    [--epochs EPOCHS]
                                    [--low-count-threshold LOW_COUNT_THRESHOLD]
                                    [--z-dim Z_DIM]
                                    [--z-layers Z_HIDDEN_DIMS [Z_HIDDEN_DIMS ...]]
                                    [--training-fraction TRAINING_FRACTION]
                                    [--empty-drop-training-fraction FRACTION_EMPTIES]
                                    [--ignore-features BLACKLISTED_GENES [BLACKLISTED_GENES ...]]
                                    [--fpr FPR [FPR ...]]
                                    [--exclude-feature-types EXCLUDE_FEATURES [EXCLUDE_FEATURES ...]]
                                    [--projected-ambient-count-threshold AMBIENT_COUNTS_IN_CELLS_LOW_LIMIT]
                                    [--learning-rate LEARNING_RATE]
                                    [--checkpoint-mins CHECKPOINT_MIN]
                                    [--final-elbo-fail-fraction FINAL_ELBO_FAIL_FRACTION]
                                    [--epoch-elbo-fail-fraction EPOCH_ELBO_FAIL_FRACTION]
                                    [--num-training-tries NUM_TRAINING_TRIES]
                                    [--learning-rate-retry-mult LEARNING_RATE_RETRY_MULT]
                                    [--posterior-batch-size POSTERIOR_BATCH_SIZE]
                                    [--posterior-regularization {PRq,PRmu,PRmu_gene}]
                                    [--alpha PRQ_ALPHA] [--q CDF_THRESHOLD_Q]
                                    [--estimator {map,mean,cdf,sample,mckp}]
                                    [--estimator-multiple-cpu]
                                    [--constant-learning-rate]
                                    [--cpu-threads N_THREADS] [--debug]
                                    [--truth TRUTH_FILE]

Named Arguments

--input

Data file on which to run tool. Data must be un-filtered: it should include empty droplets. The following input formats are supported: CellRanger v2 and v3 (.h5 or the directory that contains the .mtx file), Dropseq DGE (.txt or .txt.gz), BD Rhapsody (.csv or .csv.gz), AnnData (.h5ad), and Loom (.loom).

--output

Output file location (the path must exist, and the file name must have .h5 extension).

--cuda

Including the flag –cuda will run the inference on a GPU.

Default: False

--checkpoint

Checkpoint tarball produced by v0.3.0+ of CellBender remove-background. If present, and the workflow hashes match, training will restart from this checkpoint.

Default: “ckpt.tar.gz”

--force-use-checkpoint

Normally, checkpoints can only be used if the CellBender code and certain input args match exactly. This flag allows you to bypass this requirement. An example use would be to create a new output using a checkpoint from a run of v0.3.1, a redacted version with faulty output count matrices. If you use this flag, ensure that the input file and the checkpoint match, because CellBender will not check.

Default: False

--expected-cells

Number of cells expected in the dataset (a rough estimate within a factor of 2 is sufficient).

--total-droplets-included

The number of droplets from the rank-ordered UMI plot that will have their cell probabilities inferred as an output. Include the droplets which might contain cells. Droplets beyond TOTAL_DROPLETS_INCLUDED should be ‘surely empty’ droplets.

--force-cell-umi-prior

Ignore CellBender’s heuristic prior estimation, and use this prior for UMI counts in cells.

--force-empty-umi-prior

Ignore CellBender’s heuristic prior estimation, and use this prior for UMI counts in empty droplets.

--model

Possible choices: naive, simple, ambient, swapping, full

Which model is being used for count data. ‘naive’ subtracts the estimated ambient profile. ‘simple’ does not model either ambient RNA or random barcode swapping (for debugging purposes – not recommended). ‘ambient’ assumes background RNA is incorporated into droplets. ‘swapping’ assumes background RNA comes from random barcode swapping (via PCR chimeras). ‘full’ uses a combined ambient and swapping model.

Default: “full”

--epochs

Number of epochs to train.

Default: 150

--low-count-threshold

Droplets with UMI counts below this number are completely excluded from the analysis. This can help identify the correct prior for empty droplet counts in the rare case where empty counts are extremely high (over 200).

Default: 5

--z-dim

Dimension of latent variable z.

Default: 64

--z-layers

Dimension of hidden layers in the encoder for z.

Default: [512]

--training-fraction

Training detail: the fraction of the data used for training. The rest is never seen by the inference algorithm. Speeds up learning.

Default: 0.9

--empty-drop-training-fraction

Training detail: the fraction of the training data each epoch that is drawn (randomly sampled) from surely empty droplets.

Default: 0.2

--ignore-features

Integer indices of features to ignore entirely. In the output count matrix, the counts for these features will be unchanged.

Default: []

--fpr

Target ‘delta’ false positive rate in [0, 1). Use 0 for a cohort of samples which will be jointly analyzed for differential expression. A false positive is a true signal count that is erroneously removed. More background removal is accompanied by more signal removal at high values of FPR. You can specify multiple values, which will create multiple output files.

Default: [0.01]

--exclude-feature-types

Feature types to ignore during the analysis. These features will be left unchanged in the output file.

Default: []

--projected-ambient-count-threshold

Controls how many features are included in the analysis, which can lead to a large speedup. If a feature is expected to have less than PROJECTED_AMBIENT_COUNT_THRESHOLD counts total in all cells (summed), then that gene is excluded, and it will be unchanged in the output count matrix. For example, PROJECTED_AMBIENT_COUNT_THRESHOLD = 0 will include all features which have even a single count in any empty droplet.

Default: 0.1

--learning-rate

Training detail: lower learning rate for inference. A OneCycle learning rate schedule is used, where the upper learning rate is ten times this value. (For this value, probably do not exceed 1e-3).

Default: 0.0001

--checkpoint-mins

Checkpoint file will be saved periodically, with this many minutes between each checkpoint.

Default: 7.0

--final-elbo-fail-fraction

Training is considered to have failed if (best_test_ELBO - final_test_ELBO)/(best_test_ELBO - initial_test_ELBO) > FINAL_ELBO_FAIL_FRACTION. Training will automatically re-run if –num-training-tries > 1. By default, will not fail training based on final_training_ELBO.

--epoch-elbo-fail-fraction

Training is considered to have failed if (previous_epoch_test_ELBO - current_epoch_test_ELBO)/(previous_epoch_test_ELBO - initial_train_ELBO) > EPOCH_ELBO_FAIL_FRACTION. Training will automatically re-run if –num-training-tries > 1. By default, will not fail training based on epoch_training_ELBO.

--num-training-tries

Number of times to attempt to train the model. At each subsequent attempt, the learning rate is multiplied by LEARNING_RATE_RETRY_MULT.

Default: 1

--learning-rate-retry-mult

Learning rate is multiplied by this amount each time a new training attempt is made. (This parameter is only used if training fails based on EPOCH_ELBO_FAIL_FRACTION or FINAL_ELBO_FAIL_FRACTION and NUM_TRAINING_TRIES is > 1.)

Default: 0.2

--posterior-batch-size

Training detail: size of batches when creating the posterior. Reduce this to avoid running out of GPU memory creating the posterior (will be slower).

Default: 128

--posterior-regularization

Possible choices: PRq, PRmu, PRmu_gene

Posterior regularization method. (For experts: not required for normal usage, see documentation). PRq is approximate quantile-targeting. PRmu is approximate mean-targeting aggregated over genes (behavior of v0.2.0). PRmu_gene is approximate mean-targeting per gene.

--alpha

Tunable parameter alpha for the PRq posterior regularization method (not normally used: see documentation).

--q

Tunable parameter q for the CDF threshold estimation method (not normally used: see documentation).

--estimator

Possible choices: map, mean, cdf, sample, mckp

Output denoised count estimation method. (For experts: not required for normal usage, see documentation).

Default: “mckp”

--estimator-multiple-cpu

Including the flag –estimator-multiple-cpu will use more than one CPU to compute the MCKP output count estimator in parallel (does nothing for other estimators).

Default: False

--constant-learning-rate

Including the flag –constant-learning-rate will use the ClippedAdam optimizer instead of the OneCycleLR learning rate schedule, which is the default. Learning is faster with the OneCycleLR schedule. However, training can easily be continued from a checkpoint for more epochs than the initial command specified when using ClippedAdam. On the other hand, if using the OneCycleLR schedule with 150 epochs specified, it is not possible to pick up from that final checkpoint and continue training until 250 epochs.

Default: False

--cpu-threads

Number of threads to use when pytorch is run on CPU. Defaults to the number of logical cores.

--debug

Including the flag –debug will log extra messages useful for debugging.

Default: False

--truth

This is only used by developers for report generation. Truth h5 file (for simulated data only).

WDL Workflow Options

A WDL script is available as cellbender_remove_background.wdl, and it can be used to run cellbender remove-background from Terra or from your own Cromwell instance. The WDL is designed to make use of an Nvidia Tesla T4 GPU on Google Cloud architecture.

As of v0.3.0, the WDL uses preemptible instances that are a fraction of the cost, and uses automatic restarting from checkpoints so that work is not lost.

In addition to the above command line options, the workflow has its own set of input options, which are described in detail here.

Output h5 file format

An h5 output file (this one is from the tutorial) can be examined in detail using PyTables in python:

# import PyTables
import tables

# open the file and take a look at its contents
with tables.open_file('tiny_output.h5', 'r') as f:
    print(f)
/home/jupyter/tiny_output.h5 (File) 'CellBender remove-background output'
Last modif.: 'Wed May  4 19:46:16 2022'
Object Tree:
/ (RootGroup) 'CellBender remove-background output'
/droplet_latents (Group) 'Latent variables per droplet'
/droplet_latents/background_fraction (CArray(1000,), shuffle, zlib(1)) ''
/droplet_latents/barcode_indices_for_latents (CArray(1000,), shuffle, zlib(1)) ''
/droplet_latents/cell_probability (CArray(1000,), shuffle, zlib(1)) ''
/droplet_latents/cell_size (CArray(1000,), shuffle, zlib(1)) ''
/droplet_latents/droplet_efficiency (CArray(1000,), shuffle, zlib(1)) ''
/droplet_latents/gene_expression_encoding (CArray(1000, 100), shuffle, zlib(1)) ''
/global_latents (Group) 'Global latent variables'
/global_latents/ambient_expression (Array(100,)) ''
/global_latents/cell_size_lognormal_std (Array()) ''
/global_latents/empty_droplet_size_lognormal_loc (Array()) ''
/global_latents/empty_droplet_size_lognormal_scale (Array()) ''
/global_latents/posterior_regularization_lambda (Array()) ''
/global_latents/swapping_fraction_dist_params (Array(2,)) ''
/global_latents/target_false_positive_rate (Array()) ''
/matrix (Group) 'Counts after background correction'
/matrix/barcodes (CArray(50500,), zlib(1)) ''
/matrix/data (CArray(44477,), shuffle, zlib(1)) ''
/matrix/indices (CArray(44477,), shuffle, zlib(1)) ''
/matrix/indptr (CArray(50501,), shuffle, zlib(1)) ''
/matrix/shape (CArray(2,), shuffle, zlib(1)) ''
/metadata (Group) 'Metadata'
/metadata/barcodes_analyzed (Array(1000,)) ''
/metadata/barcodes_analyzed_inds (Array(1000,)) ''
/metadata/features_analyzed_inds (Array(100,)) ''
/metadata/fraction_data_used_for_testing (Array()) ''
/metadata/test_elbo (Array(30,)) ''
/metadata/test_epoch (Array(30,)) ''
/metadata/train_elbo (Array(150,)) ''
/metadata/train_epoch (Array(150,)) ''
/matrix/features (Group) 'Genes and other features measured'
/matrix/features/feature_type (CArray(100,), shuffle, zlib(1)) ''
/matrix/features/genome (CArray(100,), shuffle, zlib(1)) ''
/matrix/features/id (CArray(100,), shuffle, zlib(1)) ''
/matrix/features/name (CArray(100,), shuffle, zlib(1)) ''

Any of these bits of data can be accessed directly with PyTables, or by using python functions from cellbender.remove_background.downstream such as anndata_from_h5() that load the metadata as well as the count matrix into AnnData format (compatible with analysis in scanpy).

The names of variables in the h5 file are meant to explain their contents. All data in the /matrix group is formatted as CellRanger v3 h5 data.

The other root groups in the h5 file [droplet_latents, global_latents, metadata] contain detailed information inferred by CellBender as part of its latent variable model of the data generative process (details in our paper). The /droplet_latents are variables that have a value inferred for each droplet, such as cell_probability and cell_size. The /global_latents are variables that have a fixed value for the whole experiment, such as ambient_expression, which is the inferred profile of ambient RNA in the sample (normalized so it sums to one). Finally, /metadata includes other things such as the learning curve (train_elbo and train_epoch) and which features were analyzed during the run (features_analyzed_inds, as integer indices that index /matrix/features).

Loading and using outputs

As mentioned in the tutorial, output h5 files can be opened natively by scanpy (scanpy.read_10x_h5()) and (with some effort) by Seurat (see here), though there is extra cellbender-specific information in the output files that scanpy and Seurat will not load automatically.

Additional loading functionality is distrubuted as part of the cellbender package and is described below.

Suggested use cases:

  • Load just the CellBender output: anndata_from_h5()

  • Load the raw input together with the CellBender output (this is great for downstream analysis, because it allows you to plot the effect of cellbender, and you can always look back at the raw data): load_anndata_from_input_and_output()

  • Power users who want to run cellbender at multiple FPR values and make comparisons: load_anndata_from_input_and_outputs()

Example:

from cellbender.remove_background.downstream import load_anndata_from_input_and_output

adata = load_anndata_from_input_and_output(
    input_file='my_raw_10x_file.h5',
    output_file='my_cellbender_output_file.h5',
    input_layer_key='raw',  # this will be the raw data layer
)
adata
AnnData object with n_obs × n_vars = 6983 × 37468
obs: 'background_fraction', 'cell_probability', 'cell_size', 'droplet_efficiency',
    'n_raw', 'n_cellbender'
var: 'ambient_expression', 'feature_type', 'genome', 'gene_id', 'cellbender_analyzed',
    'n_raw', 'n_cellbender'
uns: 'cell_size_lognormal_std', 'empty_droplet_size_lognormal_loc',
    'empty_droplet_size_lognormal_scale', 'swapping_fraction_dist_params',
    'target_false_positive_rate', 'features_analyzed_inds',
    'fraction_data_used_for_testing', 'test_elbo', 'test_epoch',
    'train_elbo', 'train_epoch'
obsm: 'cellbender_embedding'
layers: 'raw', 'cellbender'

Functions for downstream work with outputs of remove-background.

cellbender.remove_background.downstream.anndata_from_h5(file, analyzed_barcodes_only=True)[source]

Load an output h5 file into an AnnData object for downstream work.

Parameters:
  • file (str) – The h5 file

  • analyzed_barcodes_only (bool) – False to load all barcodes, so that the size of the AnnData object will match the size of the input raw count matrix. True to load a limited set of barcodes: only those analyzed by the algorithm. This allows relevant latent variables to be loaded properly into adata.obs and adata.obsm, rather than adata.uns.

Returns:

The anndata object, populated with inferred latent variables

and metadata.

Return type:

anndata.AnnData

cellbender.remove_background.downstream.dict_from_h5(file)[source]

Read in everything from an h5 file and put into a dictionary.

Parameters:

file (str) – The h5 file

Return type:

Dict[str, ndarray]

Returns:

Dictionary containing all the information from the h5 file

cellbender.remove_background.downstream.load_anndata_from_input(input_file)[source]

Load an input file into an AnnData object (used in report generation). Equivalent to something like scanpy.read(), but uses cellbender’s io.

Parameters:

input_file (str) – The raw data file

Returns:

The anndata object

Return type:

adata.AnnData

cellbender.remove_background.downstream.load_anndata_from_input_and_output(input_file, output_file, analyzed_barcodes_only=True, input_layer_key='cellranger', retain_input_metadata=False, gene_expression_encoding_key='cellbender_embedding', truth_file=None)[source]

Load remove-background output count matrix into an anndata object, together with remove-background metadata and the raw input counts.

Parameters:
  • input_file (str) – Raw h5 file (or other compatible remove-background input) used as input for remove-background.

  • output_file (str) – Output h5 file created by remove-background (can be filtered or not).

  • analyzed_barcodes_only (bool) – Argument passed to anndata_from_h5(). False to load all barcodes, so that the size of the AnnData object will match the size of the input raw count matrix. True to load a limited set of barcodes: only those analyzed by the algorithm. This allows relevant latent variables to be loaded properly into adata.obs and adata.obsm, rather than adata.uns.

  • input_layer_key (str) – Key of the anndata.layer that is created for the raw input count matrix.

  • retain_input_metadata (bool) – In addition to loading the CellBender metadata, which happens automatically, set this to True to retain all the metadata from the raw input file as well.

  • gene_expression_encoding_key (str) – The CellBender gene expression embedding will be loaded into adata.obsm[gene_expression_encoding_key]

  • truth_file (Optional[str]) – File containing truth data if this is a simulation

Returns:

AnnData object with counts before and after remove-background,

as well as inferred latent variables from remove-background.

Return type:

anndata.AnnData

cellbender.remove_background.downstream.load_anndata_from_input_and_outputs(input_file, output_files, analyzed_barcodes_only=True, input_layer_key='cellranger', gene_expression_encoding_key='cellbender_embedding', truth_file=None)[source]

Load remove-background output count matrices into an anndata object, together with remove-background metadata and the raw input counts.

The use case would typically be cellbender runs with multiple output files at different FPRs, which we want to compare.

Parameters:
  • input_file (str) – Raw h5 file (or other compatible remove-background input) used as input for remove-background.

  • output_files (Dict[str, str]) – Output h5 files created by remove-background (can be filtered or not) or some other method. Dict whose keys are layer keys and whose values are file names.

  • analyzed_barcodes_only (bool) – Argument passed to anndata_from_h5(). False to load all barcodes, so that the size of the AnnData object will match the size of the input raw count matrix. True to load a limited set of barcodes: only those analyzed by the algorithm. This allows relevant latent variables to be loaded properly into adata.obs and adata.obsm, rather than adata.uns.

  • input_layer_key (str) – Key of the anndata.layer that is created for the raw input count matrix.

  • gene_expression_encoding_key (str) – The CellBender gene expression embedding will be loaded into adata.obsm[gene_expression_encoding_key]

  • truth_file (Optional[str]) – File containing truth data if this is a simulation

Returns:

AnnData object with counts before and after remove-background,

as well as inferred latent variables from remove-background.

Return type:

anndata.AnnData

The posterior

There is an additional output file called posterior.h5 which contains the full posterior. The structure of this data is a sparse matrix whose shape is [count matrix entries, number of potential noise counts]. Typically the number of potential noise counts in any entry of the count matrix is truncated at 50. The values are log probabilities of the specified number of noise counts in the given count matrix entry.

The information contained in the posterior can be used to quantitatively answer questions such as “What is the probability that the number of viral gene counts in this cell is nonzero?” For help with these kinds of computations, please open a github issue.