[ PROMPT_NODE_26283 ]

Cellxgene Census

[ SKILL_DOCUMENTATION ]

# CZ CELLxGENE Census ## Overview The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets. The Census includes: - **61+ million cells** from human and mouse - **Standardized metadata** (cell types, tissues, diseases, donors) - **Raw gene expression** matrices - **Pre-calculated embeddings** and statistics - **Integration with PyTorch, scanpy, and other analysis tools** ## When to Use This Skill This skill should be used when: - Querying single-cell expression data by cell type, tissue, or disease - Exploring available single-cell datasets and metadata - Training machine learning models on single-cell data - Performing large-scale cross-dataset analyses - Integrating Census data with scanpy or other analysis frameworks - Computing statistics across millions of cells - Accessing pre-calculated embeddings or model predictions ## Installation and Setup Install the Census API: ```bash uv pip install cellxgene-census ``` For machine learning workflows, install additional dependencies: ```bash uv pip install cellxgene-census[experimental] ``` ## Core Workflow Patterns ### 1. Opening the Census Always use the context manager to ensure proper resource cleanup: ```python import cellxgene_census # Open latest stable version with cellxgene_census.open_soma() as census: # Work with census data # Open specific version for reproducibility with cellxgene_census.open_soma(census_version="2023-07-25") as census: # Work with census data ``` **Key points:** - Use context manager (`with` statement) for automatic cleanup - Specify `census_version` for reproducible analyses - Default opens latest "stable" release ### 2. Exploring Census Information Before querying expression data, explore available datasets and metadata. **Access summary information:** ```python # Get summary statistics summary = census["census_info"]["summary"].read().concat().to_pandas() print(f"Total cells: {summary['total_cell_count'][0]}") # Get all datasets datasets = census["census_info"]["datasets"].read().concat().to_pandas() # Filter datasets by criteria covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)] ``` **Query cell metadata to understand available data:** ```python # Get unique cell types in a tissue cell_metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'brain' and is_primary_data == True", column_names=["cell_type"] ) unique_cell_types = cell_metadata["cell_type"].unique() print(f"Found {len(unique_cell_types)} cell types in brain") # Count cells by tissue tissue_counts = cell_metadata.groupby("tissue_general").size() ``` **Important:** Always filter for `is_primary_data == True` to avoid counting duplicate cells unless specifically analyzing duplicates. ### 3. Querying Expression Data (Small to Medium Scale) For queries returning 100k), use out-of-core processing ``` ### Use tissue_general for Broader Groupings The `tissue_general` field provides coarser categories than `tissue`, useful for cross-tissue analyses: ```python # Broader grouping obs_value_filter="tissue_general == 'immune system'" # Specific tissue obs_value_filter="tissue == 'peripheral blood mononuclear cell'" ``` ### Select Only Needed Columns Minimize data transfer by specifying only required metadata columns: ```python obs_column_names=["cell_type", "tissue_general", "disease"] # Not all columns ``` ### Check Dataset Presence for Gene-Specific Queries When analyzing specific genes, verify which datasets measured them: ```python presence = cellxgene_census.get_presence_matrix( census, "homo_sapiens", var_value_filter="feature_name in ['CD4', 'CD8A']" ) ``` ### Two-Step Workflow: Explore Then Query First explore metadata to understand available data, then query expression: ```python # Step 1: Explore what's available metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="disease == 'COVID-19' and is_primary_data == True", column_names=["cell_type", "tissue_general"] ) print(metadata.value_counts()) # Step 2: Query based on findings adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True", ) ``` ## Available Metadata Fields ### Cell Metadata (obs) Key fields for filtering: - `cell_type`, `cell_type_ontology_term_id` - `tissue`, `tissue_general`, `tissue_ontology_term_id` - `disease`, `disease_ontology_term_id` - `assay`, `assay_ontology_term_id` - `donor_id`, `sex`, `self_reported_ethnicity` - `development_stage`, `development_stage_ontology_term_id` - `dataset_id` - `is_primary_data` (Boolean: True = unique cell) ### Gene Metadata (var) - `feature_id` (Ensembl gene ID, e.g., "ENSG00000161798") - `feature_name` (Gene symbol, e.g., "FOXP2") - `feature_length` (Gene length in base pairs) ## Reference Documentation This skill includes detailed reference documentation: ### references/census_schema.md Comprehensive documentation of: - Census data structure and organization - All available metadata fields - Value filter syntax and operators - SOMA object types - Data inclusion criteria **When to read:** When you need detailed schema information, full list of metadata fields, or complex filter syntax. ### references/common_patterns.md Examples and patterns for: - Exploratory queries (metadata only) - Small-to-medium queries (AnnData) - Large queries (out-of-core processing) - PyTorch integration - Scanpy integration workflows - Multi-dataset integration - Best practices and common pitfalls **When to read:** When implementing specific query patterns, looking for code examples, or troubleshooting common issues. ## Common Use Cases ### Use Case 1: Explore Cell Types in a Tissue ```python with cellxgene_census.open_soma() as census: cells = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'lung' and is_primary_data == True", column_names=["cell_type"] ) print(cells["cell_type"].value_counts()) ``` ### Use Case 2: Query Marker Gene Expression ```python with cellxgene_census.open_soma() as census: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']", obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True", ) ``` ### Use Case 3: Train Cell Type Classifier ```python from cellxgene_census.experimental.ml import experiment_dataloader with cellxgene_census.open_soma() as census: dataloader = experiment_dataloader( census["census_data"]["homo_sapiens"], measurement_name="RNA", X_name="raw", obs_value_filter="is_primary_data == True", obs_column_names=["cell_type"], batch_size=128, shuffle=True, ) # Train model for epoch in range(epochs): for batch in dataloader: # Training logic pass ``` ### Use Case 4: Cross-Tissue Analysis ```python with cellxgene_census.open_soma() as census: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True", ) # Analyze macrophage differences across tissues sc.tl.rank_genes_groups(adata, groupby="tissue_general") ``` ## Troubleshooting ### Query Returns Too Many Cells - Add more specific filters to reduce scope - Use `tissue` instead of `tissue_general` for finer granularity - Filter by specific `dataset_id` if known - Switch to out-of-core processing for large queries ### Memory Errors - Reduce query scope with more restrictive filters - Select fewer genes with `var_value_filter` - Use out-of-core processing with `axis_query()` - Process data in batches ### Duplicate Cells in Results - Always include `is_primary_data == True` in filters - Check if intentionally querying across multiple datasets ### Gene Not Found - Verify gene name spelling (case-sensitive) - Try Ensembl ID with `feature_id` instead of `feature_name` - Check dataset presence matrix to see if gene was measured - Some genes may have been filtered during Census construction ### Version Inconsistencies - Always specify `census_version` explicitly - Use same version across all analyses - Check release notes for version-specific changes

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI