[ PROMPT_NODE_27393 ]
Scvi Tools – Workflows
[ SKILL_DOCUMENTATION ]
# Common Workflows and Best Practices
This document covers common workflows, best practices, and advanced usage patterns for scvi-tools.
## Standard Analysis Workflow
### 1. Data Loading and Preparation
```python
import scvi
import scanpy as sc
import numpy as np
# Load data (AnnData format required)
adata = sc.read_h5ad("data.h5ad")
# Or load from other formats
# adata = sc.read_10x_mtx("filtered_feature_bc_matrix/")
# adata = sc.read_csv("counts.csv")
# Basic QC metrics
sc.pp.calculate_qc_metrics(adata, inplace=True)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
```
### 2. Quality Control
```python
# Filter cells
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_cells(adata, max_genes=5000)
# Filter genes
sc.pp.filter_genes(adata, min_cells=3)
# Filter by mitochondrial content
adata = adata[adata.obs['pct_counts_mt'] best_score:
best_score = val_elbo
best_params = {"n_latent": n_latent, "n_layers": n_layers}
print(f"Best params: {best_params}")
```
### Using Optuna for Hyperparameter Optimization
```python
import optuna
def objective(trial):
n_latent = trial.suggest_int("n_latent", 10, 50)
n_layers = trial.suggest_int("n_layers", 1, 3)
n_hidden = trial.suggest_categorical("n_hidden", [64, 128, 256])
model = scvi.model.SCVI(
adata,
n_latent=n_latent,
n_layers=n_layers,
n_hidden=n_hidden
)
model.train(max_epochs=200, early_stopping=True)
return model.history["elbo_validation"][-1]
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)
print(f"Best parameters: {study.best_params}")
```
## GPU Acceleration
### Enable GPU Training
```python
# Automatic GPU detection
model = scvi.model.SCVI(adata)
model.train(accelerator="auto") # Uses GPU if available
# Force GPU
model.train(accelerator="gpu")
# Multi-GPU
model.train(accelerator="gpu", devices=2)
# Check if GPU is being used
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
```
### GPU Memory Management
```python
# Reduce batch size if OOM
model.train(batch_size=64) # Instead of default 128
# Mixed precision training (saves memory)
model.train(precision=16)
# Clear cache between runs
import torch
torch.cuda.empty_cache()
```
## Batch Integration Strategies
### Strategy 1: Simple Batch Key
```python
# For standard batch correction
scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
model = scvi.model.SCVI(adata)
```
### Strategy 2: Multiple Covariates
```python
# Correct for multiple technical factors
scvi.model.SCVI.setup_anndata(
adata,
batch_key="sequencing_batch",
categorical_covariate_keys=["donor", "tissue"],
continuous_covariate_keys=["percent_mito"]
)
```
### Strategy 3: Hierarchical Batches
```python
# When batches have hierarchical structure
# E.g., samples within studies
adata.obs["batch_hierarchy"] = (
adata.obs["study"].astype(str) + "_" +
adata.obs["sample"].astype(str)
)
scvi.model.SCVI.setup_anndata(adata, batch_key="batch_hierarchy")
```
## Reference Mapping (scArches)
### Training Reference Model
```python
# Train on reference dataset
scvi.model.SCVI.setup_anndata(ref_adata, batch_key="batch")
ref_model = scvi.model.SCVI(ref_adata)
ref_model.train()
# Save reference
ref_model.save("reference_model")
```
### Mapping Query to Reference
```python
# Load reference
ref_model = scvi.model.SCVI.load("reference_model", adata=ref_adata)
# Setup query with same parameters
scvi.model.SCVI.setup_anndata(query_adata, batch_key="batch")
# Transfer learning
query_model = scvi.model.SCVI.load_query_data(
query_adata,
"reference_model"
)
# Fine-tune on query (optional)
query_model.train(max_epochs=200)
# Get query embeddings
query_latent = query_model.get_latent_representation()
# Transfer labels using KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(ref_model.get_latent_representation(), ref_adata.obs["cell_type"])
query_adata.obs["predicted_cell_type"] = knn.predict(query_latent)
```
## Model Minification
Reduce model size for faster inference:
```python
# Train full model
model = scvi.model.SCVI(adata)
model.train()
# Minify for deployment
minified = model.minify_adata(adata)
# Save minified version
minified.write("minified_data.h5ad")
model.save("minified_model")
# Load and use (much faster)
mini_model = scvi.model.SCVI.load("minified_model", adata=minified)
```
## Memory-Efficient Data Loading
### Using AnnDataLoader
```python
from scvi.data import AnnDataLoader
# For very large datasets
dataloader = AnnDataLoader(
adata,
batch_size=128,
shuffle=True,
drop_last=False
)
# Custom training loop (advanced)
for batch in dataloader:
# Process batch
pass
```
### Using Backed AnnData
```python
# For data too large for memory
adata = sc.read_h5ad("huge_dataset.h5ad", backed='r')
# scvi-tools works with backed mode
scvi.model.SCVI.setup_anndata(adata)
model = scvi.model.SCVI(adata)
model.train()
```
## Model Interpretation
### Feature Importance with SHAP
```python
import shap
# Get SHAP values for interpretability
explainer = shap.DeepExplainer(model.module, background_data)
shap_values = explainer.shap_values(test_data)
# Visualize
shap.summary_plot(shap_values, feature_names=adata.var_names)
```
### Gene Correlation Analysis
```python
# Get gene-gene correlation matrix
correlation = model.get_feature_correlation_matrix(
adata,
transform_batch="batch1"
)
# Visualize top correlated genes
import seaborn as sns
sns.heatmap(correlation[:50, :50], cmap="coolwarm")
```
## Troubleshooting Common Issues
### Issue: NaN Loss During Training
**Causes**:
- Learning rate too high
- Unnormalized input (must use raw counts)
- Data quality issues
**Solutions**:
```python
# Reduce learning rate
model.train(lr=0.0001)
# Check data
assert adata.X.min() >= 0 # No negative values
assert np.isnan(adata.X).sum() == 0 # No NaNs
# Use more stable likelihood
model = scvi.model.SCVI(adata, gene_likelihood="nb")
```
### Issue: Poor Batch Correction
**Solutions**:
```python
# Increase batch effect modeling
model = scvi.model.SCVI(
adata,
encode_covariates=True, # Encode batch in encoder
deeply_inject_covariates=False
)
# Or try opposite
model = scvi.model.SCVI(adata, deeply_inject_covariates=True)
# Use more latent dimensions
model = scvi.model.SCVI(adata, n_latent=50)
```
### Issue: Model Not Training (ELBO Not Decreasing)
**Solutions**:
```python
# Increase learning rate
model.train(lr=0.005)
# Increase network capacity
model = scvi.model.SCVI(adata, n_hidden=256, n_layers=2)
# Train longer
model.train(max_epochs=500)
```
### Issue: Out of Memory (OOM)
**Solutions**:
```python
# Reduce batch size
model.train(batch_size=64)
# Use mixed precision
model.train(precision=16)
# Reduce model size
model = scvi.model.SCVI(adata, n_latent=10, n_hidden=64)
# Use backed AnnData
adata = sc.read_h5ad("data.h5ad", backed='r')
```
## Performance Benchmarking
```python
import time
# Time training
start = time.time()
model.train(max_epochs=400)
training_time = time.time() - start
print(f"Training time: {training_time:.2f}s")
# Time inference
start = time.time()
latent = model.get_latent_representation()
inference_time = time.time() - start
print(f"Inference time: {inference_time:.2f}s")
# Memory usage
import psutil
import os
process = psutil.Process(os.getpid())
memory_gb = process.memory_info().rss / 1024**3
print(f"Memory usage: {memory_gb:.2f} GB")
```
## Best Practices Summary
1. **Always use raw counts**: Never log-normalize before scvi-tools
2. **Feature selection**: Use highly variable genes for efficiency
3. **Batch correction**: Register all known technical covariates
4. **Early stopping**: Use validation set to prevent overfitting
5. **Model saving**: Always save trained models
6. **GPU usage**: Use GPU for large datasets (>10k cells)
7. **Hyperparameter tuning**: Start with defaults, tune if needed
8. **Validation**: Check batch correction visually (UMAP colored by batch)
9. **Documentation**: Keep track of preprocessing steps
10. **Reproducibility**: Set random seeds (`scvi.settings.seed = 0`)
Source: claude-code-templates (MIT). See About Us for full credits.