[ PROMPT_NODE_26557 ]

Exploratory Data Analysis

[ SKILL_DOCUMENTATION ]

# Exploratory Data Analysis ## Overview Perform comprehensive exploratory data analysis (EDA) on scientific data files across multiple domains. This skill provides automated file type detection, format-specific analysis, data quality assessment, and generates detailed markdown reports suitable for documentation and downstream analysis planning. **Key Capabilities:** - Automatic detection and analysis of 200+ scientific file formats - Comprehensive format-specific metadata extraction - Data quality and integrity assessment - Statistical summaries and distributions - Visualization recommendations - Downstream analysis suggestions - Markdown report generation ## When to Use This Skill Use this skill when: - User provides a path to a scientific data file for analysis - User asks to "explore", "analyze", or "summarize" a data file - User wants to understand the structure and content of scientific data - User needs a comprehensive report of a dataset before analysis - User wants to assess data quality or completeness - User asks what type of analysis is appropriate for a file ## Supported File Categories The skill has comprehensive coverage of scientific file formats organized into six major categories: ### 1. Chemistry and Molecular Formats (60+ extensions) Structure files, computational chemistry outputs, molecular dynamics trajectories, and chemical databases. **File types include:** `.pdb`, `.cif`, `.mol`, `.mol2`, `.sdf`, `.xyz`, `.smi`, `.gro`, `.log`, `.fchk`, `.cube`, `.dcd`, `.xtc`, `.trr`, `.prmtop`, `.psf`, and more. **Reference file:** `references/chemistry_molecular_formats.md` ### 2. Bioinformatics and Genomics Formats (50+ extensions) Sequence data, alignments, annotations, variants, and expression data. **File types include:** `.fasta`, `.fastq`, `.sam`, `.bam`, `.vcf`, `.bed`, `.gff`, `.gtf`, `.bigwig`, `.h5ad`, `.loom`, `.counts`, `.mtx`, and more. **Reference file:** `references/bioinformatics_genomics_formats.md` ### 3. Microscopy and Imaging Formats (45+ extensions) Microscopy images, medical imaging, whole slide imaging, and electron microscopy. **File types include:** `.tif`, `.nd2`, `.lif`, `.czi`, `.ims`, `.dcm`, `.nii`, `.mrc`, `.dm3`, `.vsi`, `.svs`, `.ome.tiff`, and more. **Reference file:** `references/microscopy_imaging_formats.md` ### 4. Spectroscopy and Analytical Chemistry Formats (35+ extensions) NMR, mass spectrometry, IR/Raman, UV-Vis, X-ray, chromatography, and other analytical techniques. **File types include:** `.fid`, `.mzML`, `.mzXML`, `.raw`, `.mgf`, `.spc`, `.jdx`, `.xy`, `.cif` (crystallography), `.wdf`, and more. **Reference file:** `references/spectroscopy_analytical_formats.md` ### 5. Proteomics and Metabolomics Formats (30+ extensions) Mass spec proteomics, metabolomics, lipidomics, and multi-omics data. **File types include:** `.mzML`, `.pepXML`, `.protXML`, `.mzid`, `.mzTab`, `.sky`, `.mgf`, `.msp`, `.h5ad`, and more. **Reference file:** `references/proteomics_metabolomics_formats.md` ### 6. General Scientific Data Formats (30+ extensions) Arrays, tables, hierarchical data, compressed archives, and common scientific formats. **File types include:** `.npy`, `.npz`, `.csv`, `.xlsx`, `.json`, `.hdf5`, `.zarr`, `.parquet`, `.mat`, `.fits`, `.nc`, `.xml`, and more. **Reference file:** `references/general_scientific_formats.md` ## Workflow ### Step 1: File Type Detection When a user provides a file path, first identify the file type: 1. Extract the file extension 2. Look up the extension in the appropriate reference file 3. Identify the file category and format description 4. Load format-specific information **Example:** ``` User: "Analyze data.fastq" → Extension: .fastq → Category: bioinformatics_genomics → Format: FASTQ Format (sequence data with quality scores) → Reference: references/bioinformatics_genomics_formats.md ``` ### Step 2: Load Format-Specific Information Based on the file type, read the corresponding reference file to understand: - **Typical Data:** What kind of data this format contains - **Use Cases:** Common applications for this format - **Python Libraries:** How to read the file in Python - **EDA Approach:** What analyses are appropriate for this data type Search the reference file for the specific extension (e.g., search for "### .fastq" in `bioinformatics_genomics_formats.md`). ### Step 3: Perform Data Analysis Use the `scripts/eda_analyzer.py` script OR implement custom analysis: **Option A: Use the analyzer script** ```python # The script automatically: # 1. Detects file type # 2. Loads reference information # 3. Performs format-specific analysis # 4. Generates markdown report python scripts/eda_analyzer.py [output.md] ``` **Option B: Custom analysis in the conversation** Based on the format information from the reference file, perform appropriate analysis: For tabular data (CSV, TSV, Excel): - Load with pandas - Check dimensions, data types - Analyze missing values - Calculate summary statistics - Identify outliers - Check for duplicates For sequence data (FASTA, FASTQ): - Count sequences - Analyze length distributions - Calculate GC content - Assess quality scores (FASTQ) For images (TIFF, ND2, CZI): - Check dimensions (X, Y, Z, C, T) - Analyze bit depth and value range - Extract metadata (channels, timestamps, spatial calibration) - Calculate intensity statistics For arrays (NPY, HDF5): - Check shape and dimensions - Analyze data type - Calculate statistical summaries - Check for missing/invalid values ### Step 4: Generate Comprehensive Report Create a markdown report with the following sections: #### Required Sections: 1. **Title and Metadata** - Filename and timestamp - File size and location 2. **Basic Information** - File properties - Format identification 3. **File Type Details** - Format description from reference - Typical data content - Common use cases - Python libraries for reading 4. **Data Analysis** - Structure and dimensions - Statistical summaries - Quality assessment - Data characteristics 5. **Key Findings** - Notable patterns - Potential issues - Quality metrics 6. **Recommendations** - Preprocessing steps - Appropriate analyses - Tools and methods - Visualization approaches #### Template Location Use `assets/report_template.md` as a guide for report structure. ### Step 5: Save Report Save the markdown report with a descriptive filename: - Pattern: `{original_filename}_eda_report.md` - Example: `experiment_data.fastq` → `experiment_data_eda_report.md` ## Detailed Format References Each reference file contains comprehensive information for dozens of file types. To find information about a specific format: 1. Identify the category from the extension 2. Read the appropriate reference file 3. Search for the section heading matching the extension (e.g., "### .pdb") 4. Extract the format information ### Reference File Structure Each format entry includes: - **Description:** What the format is - **Typical Data:** What it contains - **Use Cases:** Common applications - **Python Libraries:** How to read it (with code examples) - **EDA Approach:** Specific analyses to perform **Example lookup:** ```markdown ### .pdb - Protein Data Bank **Description:** Standard format for 3D structures of biological macromolecules **Typical Data:** Atomic coordinates, residue information, secondary structure **Use Cases:** Protein structure analysis, molecular visualization, docking **Python Libraries:** - `Biopython`: `Bio.PDB` - `MDAnalysis`: `MDAnalysis.Universe('file.pdb')` **EDA Approach:** - Structure validation (bond lengths, angles) - B-factor distribution - Missing residues detection - Ramachandran plots ``` ## Best Practices ### Reading Reference Files Reference files are large (10,000+ words each). To efficiently use them: 1. **Search by extension:** Use grep to find the specific format ```python import re with open('references/chemistry_molecular_formats.md', 'r') as f: content = f.read() pattern = r'### .pdb[^#]*?(?=###|Z)' match = re.search(pattern, content, re.IGNORECASE | re.DOTALL) ``` 2. **Extract relevant sections:** Don't load entire reference files into context unnecessarily 3. **Cache format info:** If analyzing multiple files of the same type, reuse the format information ### Data Analysis 1. **Sample large files:** For files with millions of records, analyze a representative sample 2. **Handle errors gracefully:** Many scientific formats require specific libraries; provide clear installation instructions 3. **Validate metadata:** Cross-check metadata consistency (e.g., stated dimensions vs actual data) 4. **Consider data provenance:** Note instrument, software versions, processing steps ### Report Generation 1. **Be comprehensive:** Include all relevant information for downstream analysis 2. **Be specific:** Provide concrete recommendations based on the file type 3. **Be actionable:** Suggest specific next steps and tools 4. **Include code examples:** Show how to load and work with the data ## Examples ### Example 1: Analyzing a FASTQ file ```python # User provides: "Analyze reads.fastq" # 1. Detect file type extension = '.fastq' category = 'bioinformatics_genomics' # 2. Read reference info # Search references/bioinformatics_genomics_formats.md for "### .fastq" # 3. Perform analysis from Bio import SeqIO sequences = list(SeqIO.parse('reads.fastq', 'fastq')) # Calculate: read count, length distribution, quality scores, GC content # 4. Generate report # Include: format description, analysis results, QC recommendations # 5. Save as: reads_eda_report.md ``` ### Example 2: Analyzing a CSV dataset ```python # User provides: "Explore experiment_results.csv" # 1. Detect: .csv → general_scientific # 2. Load reference for CSV format # 3. Analyze import pandas as pd df = pd.read_csv('experiment_results.csv') # Dimensions, dtypes, missing values, statistics, correlations # 4. Generate report with: # - Data structure # - Missing value patterns # - Statistical summaries # - Correlation matrix # - Outlier detection results # 5. Save report ``` ### Example 3: Analyzing microscopy data ```python # User provides: "Analyze cells.nd2" # 1. Detect: .nd2 → microscopy_imaging (Nikon format) # 2. Read reference for ND2 format # Learn: multi-dimensional (XYZCT), requires nd2reader # 3. Analyze from nd2reader import ND2Reader with ND2Reader('cells.nd2') as images: # Extract: dimensions, channels, timepoints, metadata # Calculate: intensity statistics, frame info # 4. Generate report with: # - Image dimensions (XY, Z-stacks, time, channels) # - Channel wavelengths # - Pixel size and calibration # - Recommendations for image analysis # 5. Save report ``` ## Troubleshooting ### Missing Libraries Many scientific formats require specialized libraries: **Problem:** Import error when trying to read a file **Solution:** Provide clear installation instructions ```python try: from Bio import SeqIO except ImportError: print("Install Biopython: uv pip install biopython") ``` Common requirements by category: - **Bioinformatics:** `biopython`, `pysam`, `pyBigWig` - **Chemistry:** `rdkit`, `mdanalysis`, `cclib` - **Microscopy:** `tifffile`, `nd2reader`, `aicsimageio`, `pydicom` - **Spectroscopy:** `nmrglue`, `pymzml`, `pyteomics` - **General:** `pandas`, `numpy`, `h5py`, `scipy` ### Unknown File Types If a file extension is not in the references: 1. Ask the user about the file format 2. Check if it's a vendor-specific variant 3. Attempt generic analysis based on file structure (text vs binary) 4. Provide general recommendations ### Large Files For very large files: 1. Use sampling strategies (first N records) 2. Use memory-mapped access (for HDF5, NPY) 3. Process in chunks (for CSV, FASTQ) 4. Provide estimates based on samples ## Script Usage The `scripts/eda_analyzer.py` can be used directly: ```bash # Basic usage python scripts/eda_analyzer.py data.csv # Specify output file python scripts/eda_analyzer.py data.csv output_report.md # The script will: # 1. Auto-detect file type # 2. Load format references # 3. Perform appropriate analysis # 4. Generate markdown report ``` The script supports automatic analysis for many common formats, but custom analysis in the conversation provides more flexibility and domain-specific insights. ## Advanced Usage ### Multi-File Analysis When analyzing multiple related files: 1. Perform individual EDA on each file 2. Create a summary comparison report 3. Identify relationships and dependencies 4. Suggest integration strategies ### Quality Control For data quality assessment: 1. Check format compliance 2. Validate metadata consistency 3. Assess completeness 4. Identify outliers and anomalies 5. Compare to expected ranges/distributions ### Preprocessing Recommendations Based on data characteristics, recommend: 1. Normalization strategies 2. Missing value imputation 3. Outlier handling 4. Batch correction 5. Format conversions ## Resources ### scripts/ - `eda_analyzer.py`: Comprehensive analysis script that can be run directly or imported ### references/ - `chemistry_molecular_formats.md`: 60+ chemistry/molecular file formats - `bioinformatics_genomics_formats.md`: 50+ bioinformatics formats - `microscopy_imaging_formats.md`: 45+ imaging formats - `spectroscopy_analytical_formats.md`: 35+ spectroscopy formats - `proteomics_metabolomics_formats.md`: 30+ omics formats - `general_scientific_formats.md`: 30+ general formats ### assets/ - `report_template.md`: Comprehensive markdown template for EDA reports

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI