What is a disaster recovery plan (DRP)?
A disaster recovery plan (DRP) is crucial to any organization's business continuity plan (BCP). It's a documented, structured approach that outlines how an organization can recover from an unplanned incident and resume its operations quickly. The plan includes regular data backups, redundant systems, and offsite data storage. It is designed to minimize the impact of the disaster and restore normalcy to the organization's processes as soon as possible. A DRP is significant for organizations that rely heavily on their IT infrastructure, as it focuses on recovering data and system functionality following a disaster.
Recovery plan considerations
When an unexpected disaster occurs, it is essential to have a well-thought-out recovery plan in place. The recovery plan should first consider which applications are most crucial to the organization's functioning at a business level. This information can be used to determine the Recovery Time Objective (RTO), which outlines how long the organization can afford for critical applications to be down without causing significant harm. Typically, the RTO is measured in hours, minutes, or seconds. The recovery plan should also consider the Recovery Point Objective (RPO), which refers to the age of the files that must be recovered from data backup storage to enable normal operations to resume. In other words, the RPO specifies when data must be restored to ensure that the organization can pick up where it left off before the disaster struck.
Timescale database DR on AWS EKS
The Postgres Operator simplifies setting up and managing highly available PostgreSQL clusters on Kubernetes. The clusters are powered by a tool called Patroni, which helps ensure that the database remains available in case of hardware or network failures. The Postgres Operator is designed to be easy to configure and integrate into automated CI/CD pipelines, which are used to streamline the process of deploying and managing applications. It does this by using Postgres manifests, custom resources that can be easily modified and updated instead of directly accessing the Kubernetes API. This approach helps promote the use of infrastructure as code, which emphasizes using automated processes rather than manual operations to manage infrastructure.
How to set up a time series database (TSDB) cluster on Kubernetes:
Step 1. Use the following method to deploy the Postgres operator:
# add repo for postgres-operator helm repo add postgres-operator-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator
Step 2. Clone the repo:
git clone https://github.com/zalando/postgres-operator.git
Step 3. Configure the logical backup in the value file (postgres-operator/charts/postgres-operator/values.yaml):
configLogicalBackup: logical_backup_docker_image: "registry.opensource.zalan.do/acid/logical-backup:v1.8.0" logical_backup_job_prefix: "logical-backup-" # storage provider - either "s3" or "gcs" logical_backup_provider: "s3" # S3 Access Key ID logical_backup_s3_access_key_id: "" # S3 bucket to store backup results logical_backup_s3_bucket: "BUCKET_NAME" # S3 region of bucket logical_backup_s3_region: "REGION" # S3 endpoint url when not using AWS logical_backup_s3_endpoint: "" # S3 Secret Access Key logical_backup_s3_secret_access_key: "" # S3 server side encryption logical_backup_s3_sse: "AES256" # S3 retention time for stored backups for example "2 week" or "7 days" logical_backup_s3_retention_time: "" # backup schedule in the cron format logical_backup_schedule: "00 2 * * *"
Step 4. Install the postgres-operator:
helm install postgres-operator postgres-operator-charts/postgres-operator
Step 5. Check if Postgres Operator is running. Starting the operator may take a few seconds. Check if the operator pod is running before applying a Postgres cluster manifest.
kubectl get pod -l app.kubernetes.io/name=postgres-operator
Step 6. Deploy the Postgres Cluster. If the operator pod is running, install the cluster.
kubectl create -f manifests/complete-postgres-manifest.yaml
These are the steps to deploy the main cluster.
Setting up the DR cluster on a different AWS account
Follow these steps to deploy the standby cluster on the remote account (Account B):
Step 1. Create a programmatic user on the base account with s3 bucket access that was created while deploying the operator on the base account (Account A)
Step 2. Create a secret in k8s on the remote account (account B)
apiVersion: v1 kind: Secret metadata: name: postgres-s3-wal-secrets namespace: core-db data: STANDBY_AWS_ACCESS_KEY_ID: KEY STANDBY_AWS_REGION: REGION STANDBY_AWS_SECRET_ACCESS_KEY: SECRET USE_WALG_RESTORE: false
Step 3. In the operator YAML file, mention the secret name in the “configKubernetes” section
configKubernetes: pod_environment_secret: "postgres-s3-wal-secrets"
Step 4. Prepare a standby cluster YAML file and mention the location of the s3 wal file from the primary account (Account A)
kind: "postgresql" apiVersion: "acid.zalan.do/v1" metadata: name: acid-timesacle-hh-sg-demo namespace: "core-db" labels: team: acid spec: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: NodeGroupType operator: In values: - database teamId: "acid" postgresql: version: "12" parameters: checkpoint_completion_target: '0.9' cron.database_name: apiendpoint default_statistics_target: '100' effective_cache_size: 6GB effective_io_concurrency: '200' log_connections: 'on' log_disconnections: 'on' log_duration: 'on' log_error_verbosity: VERBOSE log_line_prefix: '%m user=%u db=%d pid=%p ' log_lock_waits: 'ON' log_min_error_statement: WARNING log_min_messages: WARNING log_statement: all logging_collector: 'on' maintenance_work_mem: 1GB max_connections: '1500' max_locks_per_transaction: '10000' max_parallel_maintenance_workers: '4' max_parallel_workers: '8' max_parallel_workers_per_gather: '4' max_worker_processes: '8' min_wal_size: 4GB max_wal_size: 16GB pg_stat_statements.max: '10000' random_page_cost: '1.1' shared_buffers: 2GB shared_preload_libraries: timescaledb, pg_stat_statements, pg_cron, wal2json tcp_keepalives_count: '9' tcp_keepalives_idle: '240' tcp_keepalives_interval: '30' track_activity_query_size: '2048' wal_buffers: 16MB wal_level: logical work_mem: 174kB numberOfInstances: 1 volume: size: "600Gi" storageClass: "encrypted-gp3" patroni: pg_hba: - local all all trust - hostssl all all 0.0.0.0/0 md5 - host all all 0.0.0.0/0 md5 - host replication standby 10.0.0.0/8 trust - host replication standby 10.0.0.0/8 trust - host replication all 127.0.0.1/32 trust - host replication all ::1/128 trust serviceAnnotations: service.beta.kubernetes.io/aws-load-balancer-internal: 'true' allowedSourceRanges: # load balancers' source ranges for both master and replica services - 127.0.0.1/32 - 126.96.36.199/16 # IP ranges to access your cluster go here resources: requests: cpu: 6000m memory: 6Gi limits: cpu: 8000m memory: 8Gi standby: s3_wal_path: "S3_Bucket_wal_file_path"
Step 5. Once the standby cluster is deployed, it will become a read-only cluster. We can only write once we promote the DB.
This is how to promote the DB:
Login to the postgress standby pod using kubectl exec
kubectl exec -it POD_NAME -- bash ## Run following command in pod # patronictl edit-config And delete following 4 lines standby_cluster: create_replica_methods: - bootstrap_standby_with_wale - basebackup_fast_xlog restore_command: envdir "/run/etc/wal-e.d/env-standby" /scripts/restore_command.sh "%f" "%p"
This is how you can set a time series database (TSDB) cluster on Kubernetes.
Try it out yourself
The PostgreSQL Operator and Zalando PostgreSQL Operator offer powerful tools for achieving disaster recovery and high availability in a Kubernetes environment. By implementing a multi-cluster architecture, organizations can ensure that their critical databases remain operational in the event of a disaster or other unexpected event. With these tools at their disposal, IT teams can confidently deploy mission-critical applications on AWS EKS, knowing that they have the flexibility and resilience necessary to withstand unexpected disruptions. By leveraging the latest advances in database technology and cloud infrastructure, organizations can stay one step ahead of potential disasters and maintain the highest levels of performance and reliability for their applications.