[ PROMPT_NODE_26169 ]

Protein Optimization

[ SKILL_DOCUMENTATION ]

# Protein Sequence Optimization ## Overview Before submitting protein sequences for experimental testing, use computational tools to optimize sequences for improved expression, solubility, and stability. This pre-screening reduces experimental costs and increases success rates. ## Common Protein Expression Problems ### 1. Unpaired Cysteines **Problem:** - Unpaired cysteines form unwanted disulfide bonds - Leads to aggregation and misfolding - Reduces expression yield and stability **Solution:** - Remove unpaired cysteines unless functionally necessary - Pair cysteines appropriately for structural disulfides - Replace with serine or alanine in non-critical positions **Example:** ```python # Check for cysteine pairs from Bio.Seq import Seq def check_cysteines(sequence): cys_count = sequence.count('C') if cys_count % 2 != 0: print(f"Warning: Odd number of cysteines ({cys_count})") return cys_count ``` ### 2. Excessive Hydrophobicity **Problem:** - Long hydrophobic patches promote aggregation - Exposed hydrophobic residues drive protein clumping - Poor solubility in aqueous buffers **Solution:** - Maintain balanced hydropathy profiles - Use short, flexible linkers between domains - Reduce surface-exposed hydrophobic residues **Metrics:** - Kyte-Doolittle hydropathy plots - GRAVY score (Grand Average of Hydropathy) - pSAE (percent Solvent-Accessible hydrophobic residues) ### 3. Low Solubility **Problem:** - Proteins precipitate during expression or purification - Inclusion body formation - Difficult downstream processing **Solution:** - Use solubility prediction tools for pre-screening - Apply sequence optimization algorithms - Add solubilizing tags if needed ## Computational Tools for Optimization ### NetSolP - Initial Solubility Screening **Purpose:** Fast solubility prediction for filtering sequences. **Method:** Machine learning model trained on E. coli expression data. **Usage:** ```python # Install: uv pip install requests import requests def predict_solubility_netsolp(sequence): """Predict protein solubility using NetSolP web service""" url = "https://services.healthtech.dtu.dk/services/NetSolP-1.0/api/predict" data = { "sequence": sequence, "format": "fasta" } response = requests.post(url, data=data) return response.json() # Example sequence = "MKVLWAALLGLLGAAA..." result = predict_solubility_netsolp(sequence) print(f"Solubility score: {result['score']}") ``` **Interpretation:** - Score > 0.5: Likely soluble - Score 0.6 }) return results # Example sequences = { 'variant_1': 'MKVLW...', 'variant_2': 'MATGV...' } results = screen_variants_soluprot(sequences) soluble_variants = [r for r in results if r['predicted_soluble']] ``` **Interpretation:** - Score > 0.6: High solubility confidence - Score 0.4-0.6: Uncertain, may need optimization - Score 0.7: print("High confidence interface") elif stability['ipTM'] > 0.5: print("Moderate confidence interface") else: print("Low confidence interface - may need redesign") ``` **Interpretation:** - ipTM > 0.7: Strong predicted interface - ipTM 0.5-0.7: Moderate interface confidence - ipTM < 0.5: Weak interface, consider redesign **When to use:** - Antibody-antigen design - Protein-protein interaction engineering - Validating binding interfaces - Comparing interface variants ### pSAE - Solvent-Accessible Hydrophobic Residues **Purpose:** Quantify exposed hydrophobic residues that promote aggregation. **Method:** Calculates percentage of solvent-accessible surface area (SASA) occupied by hydrophobic residues. **Usage:** ```python # Requires structure (PDB file or AlphaFold prediction) # Install: uv pip install biopython from Bio.PDB import PDBParser, DSSP import numpy as np def calculate_psae(pdb_file): """ Calculate percent Solvent-Accessible hydrophobic residues (pSAE) Lower pSAE = better solubility """ parser = PDBParser(QUIET=True) structure = parser.get_structure('protein', pdb_file) # Run DSSP to get solvent accessibility model = structure[0] dssp = DSSP(model, pdb_file, acc_array='Wilke') hydrophobic = ['ALA', 'VAL', 'ILE', 'LEU', 'MET', 'PHE', 'TRP', 'PRO'] total_sasa = 0 hydrophobic_sasa = 0 for residue in dssp: res_name = residue[1] rel_accessibility = residue[3] total_sasa += rel_accessibility if res_name in hydrophobic: hydrophobic_sasa += rel_accessibility psae = (hydrophobic_sasa / total_sasa) * 100 return psae # Example pdb_file = "protein_structure.pdb" psae_score = calculate_psae(pdb_file) print(f"pSAE: {psae_score:.2f}%") # Interpretation if psae_score < 25: print("Good solubility expected") elif psae_score < 35: print("Moderate solubility") else: print("High aggregation risk") ``` **Interpretation:** - pSAE 35%: High aggregation risk **When to use:** - Analyzing designed structures - Post-AlphaFold validation - Identifying aggregation hotspots - Guiding surface mutations ## Recommended Optimization Workflow ### Step 1: Initial Screening (Fast) ```python def initial_screening(sequences): """ Quick first-pass filtering using NetSolP Filters out obviously problematic sequences """ passed = [] for name, seq in sequences.items(): netsolp_score = predict_solubility_netsolp(seq) if netsolp_score > 0.5: passed.append((name, seq)) return passed ``` ### Step 2: Detailed Assessment (Moderate) ```python def detailed_assessment(filtered_sequences): """ More thorough analysis with SoluProt and ESM Ranks sequences by multiple criteria """ results = [] for name, seq in filtered_sequences: soluprot_score = predict_solubility(seq) esm_score = score_sequence_esm(seq) combined_score = soluprot_score * 0.7 + esm_score * 0.3 results.append({ 'name': name, 'sequence': seq, 'soluprot': soluprot_score, 'esm': esm_score, 'combined': combined_score }) results.sort(key=lambda x: x['combined'], reverse=True) return results ``` ### Step 3: Sequence Optimization (If needed) ```python def optimize_problematic_sequences(sequences_needing_optimization): """ Use SolubleMPNN to redesign problematic sequences Returns improved variants """ optimized = [] for name, seq in sequences_needing_optimization: # Generate multiple variants variants = optimize_sequence( sequence=seq, num_variants=10, temperature=0.2 ) # Score variants with ESM for variant in variants: variant['esm_score'] = score_sequence_esm(variant['sequence']) # Keep best variants variants.sort( key=lambda x: x['solubility_score'] * x['esm_score'], reverse=True ) optimized.extend(variants[:3]) # Top 3 variants per sequence return optimized ``` ### Step 4: Structure-Based Validation (For critical sequences) ```python def structure_validation(top_candidates): """ Predict structures and calculate pSAE for top candidates Final validation before experimental testing """ validated = [] for candidate in top_candidates: # Predict structure with AlphaFold structure_pdb = predict_structure_alphafold(candidate['sequence']) # Calculate pSAE psae = calculate_psae(structure_pdb) candidate['psae'] = psae candidate['pass_structure_check'] = psae 0.6] needs_optimization = [s for s in assessed if s['soluprot'] {seq_data['name']}n{seq_data['sequence']}n" # Submit to Adaptyv import requests response = requests.post( "https://kq5jp7qj7wdqklhsxmovkzn4l40obksv.lambda-url.eu-central-1.on.aws/experiments", headers={"Authorization": f"Bearer {api_key}"}, json={ "sequences": fasta_content, "experiment_type": "expression", "metadata": { "optimization_method": "SolubleMPNN_ESM_pipeline", "computational_scores": [s['combined'] for s in optimized_sequences[:50]] } } ) ``` ## Troubleshooting **Issue: All sequences score poorly on solubility predictions** - Check if sequences contain unusual amino acids - Verify FASTA format is correct - Consider if protein family is naturally low-solubility - May need experimental validation despite predictions **Issue: SolubleMPNN changes functionally important residues** - Provide structure file to preserve spatial constraints - Mask critical residues from mutation - Lower temperature parameter for conservative changes - Manually revert problematic mutations **Issue: ESM scores are low after optimization** - Optimization may be too aggressive - Try lower temperature in SolubleMPNN - Balance between solubility and naturalness - Consider that some optimization may require non-natural mutations **Issue: Predictions don't match experimental results** - Predictions are probabilistic, not deterministic - Host system and conditions affect expression - Some proteins may need experimental validation - Use predictions as enrichment, not absolute filters

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI