[ PROMPT_NODE_24905 ]

Backup Recovery

[ SKILL_DOCUMENTATION ]

# Backup and Disaster Recovery Comprehensive guide to backup strategies, disaster recovery planning, business continuity, and data protection for IT operations. ## Table of Contents - [Backup Strategy](#backup-strategy) - [Backup Types](#backup-types) - [Backup Tools](#backup-tools) - [Disaster Recovery Planning](#disaster-recovery-planning) - [Business Continuity](#business-continuity) - [Recovery Testing](#recovery-testing) - [Cloud Backup Solutions](#cloud-backup-solutions) - [Database Backups](#database-backups) - [Backup Monitoring](#backup-monitoring) ## Backup Strategy ### 3-2-1 Backup Rule ```yaml The Gold Standard: 3-2-1 Rule 3 Copies of Data: - 1 Production copy - 2 Backup copies 2 Different Media Types: - Local storage (NAS, SAN) - Cloud storage (S3, Azure Blob) - Or: Disk + Tape 1 Offsite Copy: - Geographic separation - Protection against site disasters - Cloud storage or remote datacenter Example Implementation: Production: Database server (primary) Backup 1: Local NAS (hourly snapshots) Backup 2: Cloud storage S3 (daily backups) Result: 3 copies, 2 media (disk + cloud), 1 offsite (cloud) ``` ### Backup Policy Framework ```yaml RPO (Recovery Point Objective): Definition: Maximum acceptable data loss (time) Question: "How much data can we afford to lose?" Examples: Critical databases: RPO = 15 minutes (need transaction log backups) File servers: RPO = 24 hours (daily backups acceptable) Development servers: RPO = 7 days (weekly backups) RTO (Recovery Time Objective): Definition: Maximum acceptable downtime (time) Question: "How quickly must we recover?" Examples: E-commerce site: RTO = 1 hour (hot standby, fast recovery) Internal tools: RTO = 8 hours (restore from backup) Archive data: RTO = 72 hours (restore from tape/glacier) Retention Policy: Daily backups: Keep 7 days Weekly backups: Keep 4 weeks Monthly backups: Keep 12 months Yearly backups: Keep 7 years (compliance) Grandfather-Father-Son (GFS) Rotation: Son: Daily backups (7 days) Father: Weekly backups (4 weeks) Grandfather: Monthly backups (12 months) ``` ### Backup Matrix | System | Criticality | RPO | RTO | Backup Frequency | Retention | Method | |--------|-------------|-----|-----|------------------|-----------|--------| | Production Database | Critical | 15 min | 1 hour | Continuous (transaction logs) + Daily full | 30 days | Replication + Snapshots | | Application Servers | High | 1 hour | 4 hours | Hourly incremental | 7 days | Agent-based | | File Servers | Medium | 24 hours | 8 hours | Daily | 30 days | Filesystem snapshots | | Development | Low | 7 days | 24 hours | Weekly | 14 days | Full backup | | Workstations | Low | N/A | N/A | User responsibility | N/A | Cloud sync | ## Backup Types ### Full Backup ```yaml Description: - Complete copy of all data - Self-contained (no dependencies) Pros: - Simplest to restore (single backup set) - Fastest restore time - No dependency on other backups Cons: - Slowest backup time - Largest storage requirement - Most network bandwidth Use Case: - Weekly or monthly baseline - Small datasets (< 1 TB) - High-priority systems Time Required: - 1 TB database: 2-4 hours (to disk) - 10 TB file server: 20-40 hours ``` ### Incremental Backup ```yaml Description: - Only backs up changes since last backup (full or incremental) - Creates chain of dependencies Pros: - Fastest backup time - Smallest storage requirement - Least network bandwidth Cons: - Slowest restore (need full + all incrementals) - Higher restore complexity - Chain dependency (missing link = data loss) Use Case: - Daily/hourly backups - Large datasets with small changes - Bandwidth-constrained environments Time Required: - Daily changes (10 GB): 5-15 minutes Restore Process: 1. Restore full backup (baseline) 2. Apply incremental 1 3. Apply incremental 2 4. ... apply all incrementals in order ``` ### Differential Backup ```yaml Description: - Backs up changes since last FULL backup - Each differential is cumulative Pros: - Faster restore than incremental (only need full + latest differential) - Simpler dependency chain - Easier to manage than incremental Cons: - Slower than incremental (growing backup size) - More storage than incremental Use Case: - Compromise between full and incremental - Weekly full + daily differentials Time Required: - Day 1 differential: 10 GB (15 min) - Day 2 differential: 20 GB (30 min) - Day 6 differential: 60 GB (90 min) Restore Process: 1. Restore full backup 2. Apply latest differential only ``` ### Snapshot Backup ```yaml Description: - Point-in-time copy using storage features - Copy-on-write or redirect-on-write - Nearly instantaneous Pros: - Very fast to create (seconds) - Minimal performance impact - Multiple snapshots (hourly, daily) - Fast rollback Cons: - Depends on source storage (not offsite) - Storage overhead grows over time - Limited retention (storage capacity) Use Case: - Frequent recovery points (hourly) - VM backups - Database consistency points - Pre-change snapshots Examples: - LVM snapshots (Linux) - ZFS snapshots - VMware snapshots - AWS EBS snapshots ``` ### Continuous Data Protection (CDP) ```yaml Description: - Real-time or near real-time replication - Every change is captured - Can recover to any point in time Pros: - RPO near zero ( 4 hours) - Catastrophic system failure (ransomware, data corruption) - Prolonged network outage (> 2 hours) - Legal/safety order to evacuate Decision Maker: CTO or VP Engineering ## 5. Recovery Procedures ### Phase 1: Assessment (0-30 minutes) 1. Assess extent of disaster 2. Activate DR team (conference call) 3. Declare disaster (DR Coordinator) 4. Notify stakeholders 5. Update status page ### Phase 2: Failover to DR Site (30 minutes - 2 hours) 1. Verify DR site accessibility 2. Restore latest backups to DR site - Database: Restore from S3 (1 hour) - Application: Deploy from Git (30 minutes) 3. Update DNS to point to DR site (5 minutes + TTL propagation) 4. Validate connectivity and functionality ### Phase 3: Service Validation (2-3 hours) 1. Run smoke tests 2. Verify database integrity 3. Test critical user workflows 4. Monitor error rates and performance ### Phase 4: Operations at DR Site (3-4 hours) 1. Begin normal operations from DR site 2. Continuous monitoring 3. Communicate to users: "Services restored" ### Phase 5: Return to Primary (Days/Weeks) 1. Repair/rebuild primary site 2. Replicate data back to primary 3. Scheduled failback (low-traffic window) 4. Validate primary site 5. Return to normal operations ## 6. Step-by-Step Recovery ### Database Recovery ```bash # 1. Restore database from S3 backup aws s3 cp s3://backups/db-latest.sql.gz /tmp/ gunzip /tmp/db-latest.sql.gz # 2. Create new database createdb production # 3. Restore data psql production ${BACKUP_DIR}/full_${DATE}.dump.gz # Upload to S3 aws s3 cp ${BACKUP_DIR}/full_${DATE}.dump.gz ${S3_BUCKET}/full/ # Continuous archiving (transaction logs) # In postgresql.conf: # wal_level = replica # archive_mode = on # archive_command = 'aws s3 cp %p s3://my-db-backups/wal/%f' # Restore process: # 1. Restore from full backup # pg_restore -U postgres -d production /backup/full_latest.dump.gz # # 2. Create recovery.conf for PITR # cat > /var/lib/postgresql/data/recovery.conf < ${BACKUP_DIR}/full_${DATE}.sql.gz # Binary log backups (continuous) # In my.cnf: # log_bin = /var/log/mysql/mysql-bin # expire_logs_days = 7 # sync_binlog = 1 # Backup binary logs mysqlbinlog /var/log/mysql/mysql-bin.* | gzip > ${BACKUP_DIR}/binlog_${DATE}.sql.gz # Point-in-Time Recovery: # 1. Restore full backup # gunzip max_age_hours: return [f"Latest backup is {age_hours:.1f} hours old (threshold: {max_age_hours})"] return [] def send_alert(subject, body): """Send email alert""" msg = MIMEText(body) msg['Subject'] = subject msg['From'] = '[email protected]' msg['To'] = '[email protected]' s = smtplib.SMTP('localhost') s.send_message(msg) s.quit() def main(): issues = [] # Check AWS Backup jobs failed_jobs = check_aws_backup_jobs() if failed_jobs: issues.append(f"Failed backup jobs: {len(failed_jobs)}") for job in failed_jobs: issues.append(f" - {job['ResourceArn']}: {job['StatusMessage']}") # Check backup age age_issues = check_backup_age('my-backups', 'database/') issues.extend(age_issues) # Alert if issues found if issues: send_alert( 'Backup Health Check FAILED', 'Backup issues detected:nn' + 'n'.join(issues) ) print('ALERT:', 'n'.join(issues)) else: print('All backup checks passed') if __name__ == '__main__': main() ``` This comprehensive backup and disaster recovery guide provides all the necessary knowledge and procedures for protecting data and ensuring business continuity.

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI