[ PROMPT_NODE_24905 ]
Backup Recovery
[ SKILL_DOCUMENTATION ]
# Backup and Disaster Recovery
Comprehensive guide to backup strategies, disaster recovery planning, business continuity, and data protection for IT operations.
## Table of Contents
- [Backup Strategy](#backup-strategy)
- [Backup Types](#backup-types)
- [Backup Tools](#backup-tools)
- [Disaster Recovery Planning](#disaster-recovery-planning)
- [Business Continuity](#business-continuity)
- [Recovery Testing](#recovery-testing)
- [Cloud Backup Solutions](#cloud-backup-solutions)
- [Database Backups](#database-backups)
- [Backup Monitoring](#backup-monitoring)
## Backup Strategy
### 3-2-1 Backup Rule
```yaml
The Gold Standard: 3-2-1 Rule
3 Copies of Data:
- 1 Production copy
- 2 Backup copies
2 Different Media Types:
- Local storage (NAS, SAN)
- Cloud storage (S3, Azure Blob)
- Or: Disk + Tape
1 Offsite Copy:
- Geographic separation
- Protection against site disasters
- Cloud storage or remote datacenter
Example Implementation:
Production: Database server (primary)
Backup 1: Local NAS (hourly snapshots)
Backup 2: Cloud storage S3 (daily backups)
Result: 3 copies, 2 media (disk + cloud), 1 offsite (cloud)
```
### Backup Policy Framework
```yaml
RPO (Recovery Point Objective):
Definition: Maximum acceptable data loss (time)
Question: "How much data can we afford to lose?"
Examples:
Critical databases: RPO = 15 minutes (need transaction log backups)
File servers: RPO = 24 hours (daily backups acceptable)
Development servers: RPO = 7 days (weekly backups)
RTO (Recovery Time Objective):
Definition: Maximum acceptable downtime (time)
Question: "How quickly must we recover?"
Examples:
E-commerce site: RTO = 1 hour (hot standby, fast recovery)
Internal tools: RTO = 8 hours (restore from backup)
Archive data: RTO = 72 hours (restore from tape/glacier)
Retention Policy:
Daily backups: Keep 7 days
Weekly backups: Keep 4 weeks
Monthly backups: Keep 12 months
Yearly backups: Keep 7 years (compliance)
Grandfather-Father-Son (GFS) Rotation:
Son: Daily backups (7 days)
Father: Weekly backups (4 weeks)
Grandfather: Monthly backups (12 months)
```
### Backup Matrix
| System | Criticality | RPO | RTO | Backup Frequency | Retention | Method |
|--------|-------------|-----|-----|------------------|-----------|--------|
| Production Database | Critical | 15 min | 1 hour | Continuous (transaction logs) + Daily full | 30 days | Replication + Snapshots |
| Application Servers | High | 1 hour | 4 hours | Hourly incremental | 7 days | Agent-based |
| File Servers | Medium | 24 hours | 8 hours | Daily | 30 days | Filesystem snapshots |
| Development | Low | 7 days | 24 hours | Weekly | 14 days | Full backup |
| Workstations | Low | N/A | N/A | User responsibility | N/A | Cloud sync |
## Backup Types
### Full Backup
```yaml
Description:
- Complete copy of all data
- Self-contained (no dependencies)
Pros:
- Simplest to restore (single backup set)
- Fastest restore time
- No dependency on other backups
Cons:
- Slowest backup time
- Largest storage requirement
- Most network bandwidth
Use Case:
- Weekly or monthly baseline
- Small datasets (< 1 TB)
- High-priority systems
Time Required:
- 1 TB database: 2-4 hours (to disk)
- 10 TB file server: 20-40 hours
```
### Incremental Backup
```yaml
Description:
- Only backs up changes since last backup (full or incremental)
- Creates chain of dependencies
Pros:
- Fastest backup time
- Smallest storage requirement
- Least network bandwidth
Cons:
- Slowest restore (need full + all incrementals)
- Higher restore complexity
- Chain dependency (missing link = data loss)
Use Case:
- Daily/hourly backups
- Large datasets with small changes
- Bandwidth-constrained environments
Time Required:
- Daily changes (10 GB): 5-15 minutes
Restore Process:
1. Restore full backup (baseline)
2. Apply incremental 1
3. Apply incremental 2
4. ... apply all incrementals in order
```
### Differential Backup
```yaml
Description:
- Backs up changes since last FULL backup
- Each differential is cumulative
Pros:
- Faster restore than incremental (only need full + latest differential)
- Simpler dependency chain
- Easier to manage than incremental
Cons:
- Slower than incremental (growing backup size)
- More storage than incremental
Use Case:
- Compromise between full and incremental
- Weekly full + daily differentials
Time Required:
- Day 1 differential: 10 GB (15 min)
- Day 2 differential: 20 GB (30 min)
- Day 6 differential: 60 GB (90 min)
Restore Process:
1. Restore full backup
2. Apply latest differential only
```
### Snapshot Backup
```yaml
Description:
- Point-in-time copy using storage features
- Copy-on-write or redirect-on-write
- Nearly instantaneous
Pros:
- Very fast to create (seconds)
- Minimal performance impact
- Multiple snapshots (hourly, daily)
- Fast rollback
Cons:
- Depends on source storage (not offsite)
- Storage overhead grows over time
- Limited retention (storage capacity)
Use Case:
- Frequent recovery points (hourly)
- VM backups
- Database consistency points
- Pre-change snapshots
Examples:
- LVM snapshots (Linux)
- ZFS snapshots
- VMware snapshots
- AWS EBS snapshots
```
### Continuous Data Protection (CDP)
```yaml
Description:
- Real-time or near real-time replication
- Every change is captured
- Can recover to any point in time
Pros:
- RPO near zero ( 4 hours)
- Catastrophic system failure (ransomware, data corruption)
- Prolonged network outage (> 2 hours)
- Legal/safety order to evacuate
Decision Maker: CTO or VP Engineering
## 5. Recovery Procedures
### Phase 1: Assessment (0-30 minutes)
1. Assess extent of disaster
2. Activate DR team (conference call)
3. Declare disaster (DR Coordinator)
4. Notify stakeholders
5. Update status page
### Phase 2: Failover to DR Site (30 minutes - 2 hours)
1. Verify DR site accessibility
2. Restore latest backups to DR site
- Database: Restore from S3 (1 hour)
- Application: Deploy from Git (30 minutes)
3. Update DNS to point to DR site (5 minutes + TTL propagation)
4. Validate connectivity and functionality
### Phase 3: Service Validation (2-3 hours)
1. Run smoke tests
2. Verify database integrity
3. Test critical user workflows
4. Monitor error rates and performance
### Phase 4: Operations at DR Site (3-4 hours)
1. Begin normal operations from DR site
2. Continuous monitoring
3. Communicate to users: "Services restored"
### Phase 5: Return to Primary (Days/Weeks)
1. Repair/rebuild primary site
2. Replicate data back to primary
3. Scheduled failback (low-traffic window)
4. Validate primary site
5. Return to normal operations
## 6. Step-by-Step Recovery
### Database Recovery
```bash
# 1. Restore database from S3 backup
aws s3 cp s3://backups/db-latest.sql.gz /tmp/
gunzip /tmp/db-latest.sql.gz
# 2. Create new database
createdb production
# 3. Restore data
psql production ${BACKUP_DIR}/full_${DATE}.dump.gz
# Upload to S3
aws s3 cp ${BACKUP_DIR}/full_${DATE}.dump.gz ${S3_BUCKET}/full/
# Continuous archiving (transaction logs)
# In postgresql.conf:
# wal_level = replica
# archive_mode = on
# archive_command = 'aws s3 cp %p s3://my-db-backups/wal/%f'
# Restore process:
# 1. Restore from full backup
# pg_restore -U postgres -d production /backup/full_latest.dump.gz
#
# 2. Create recovery.conf for PITR
# cat > /var/lib/postgresql/data/recovery.conf < ${BACKUP_DIR}/full_${DATE}.sql.gz
# Binary log backups (continuous)
# In my.cnf:
# log_bin = /var/log/mysql/mysql-bin
# expire_logs_days = 7
# sync_binlog = 1
# Backup binary logs
mysqlbinlog /var/log/mysql/mysql-bin.* | gzip > ${BACKUP_DIR}/binlog_${DATE}.sql.gz
# Point-in-Time Recovery:
# 1. Restore full backup
# gunzip max_age_hours:
return [f"Latest backup is {age_hours:.1f} hours old (threshold: {max_age_hours})"]
return []
def send_alert(subject, body):
"""Send email alert"""
msg = MIMEText(body)
msg['Subject'] = subject
msg['From'] = '[email protected]'
msg['To'] = '[email protected]'
s = smtplib.SMTP('localhost')
s.send_message(msg)
s.quit()
def main():
issues = []
# Check AWS Backup jobs
failed_jobs = check_aws_backup_jobs()
if failed_jobs:
issues.append(f"Failed backup jobs: {len(failed_jobs)}")
for job in failed_jobs:
issues.append(f" - {job['ResourceArn']}: {job['StatusMessage']}")
# Check backup age
age_issues = check_backup_age('my-backups', 'database/')
issues.extend(age_issues)
# Alert if issues found
if issues:
send_alert(
'Backup Health Check FAILED',
'Backup issues detected:nn' + 'n'.join(issues)
)
print('ALERT:', 'n'.join(issues))
else:
print('All backup checks passed')
if __name__ == '__main__':
main()
```
This comprehensive backup and disaster recovery guide provides all the necessary knowledge and procedures for protecting data and ensuring business continuity.
Source: claude-code-templates (MIT). See About Us for full credits.