TSDB cluster setup steps on Kubernetes

Pavan Yejare Fri, 24/02/2023 - 22:31

Posted By

Pavan Yejare

Date Posted

24-Feb-2023

Steps to set up a time series database (TSDB) cluster on Kubernetes

While DRPs are essential for any system, they're especially critical for time series databases (TSDBs). Unlike traditional databases, TSDBs deal with continuous data streams, making data loss impactful. High data volume, real-time analysis reliance, and historical trend preservation all emphasize the need for robust disaster recovery.

What is a disaster recovery plan (DRP)?

A disaster recovery plan (DRP) is crucial to any organization's business continuity plan (BCP). It's a documented, structured approach that outlines how an organization can recover from an unplanned incident and resume its operations quickly. The plan includes regular data backups, redundant systems, and offsite data storage. It is designed to minimize the impact of the disaster and restore normalcy to the organization's processes as soon as possible. A DRP is significant for organizations that rely heavily on their IT infrastructure, as it focuses on recovering data and system functionality following a disaster.

Recovery plan considerations

When an unexpected disaster occurs, it is essential to have a well-thought-out recovery plan in place. The recovery plan should first consider which applications are most crucial to the organization's functioning at a business level. This information can be used to determine the Recovery Time Objective (RTO), which outlines how long the organization can afford for critical applications to be down without causing significant harm. Typically, the RTO is measured in hours, minutes, or seconds. The recovery plan should also consider the Recovery Point Objective (RPO), which refers to the age of the files that must be recovered from data backup storage to enable normal operations to resume. In other words, the RPO specifies when data must be restored to ensure that the organization can pick up where it left off before the disaster struck.

Timescale database DR on AWS EKS

The Postgres Operator simplifies setting up and managing highly available PostgreSQL clusters on Kubernetes. The clusters are powered by a tool called Patroni, which helps ensure that the database remains available in case of hardware or network failures. The Postgres Operator is designed to be easy to configure and integrate into automated CI/CD pipelines, which are used to streamline the process of deploying and managing applications. It does this by using Postgres manifests, custom resources that can be easily modified and updated instead of directly accessing the Kubernetes API. This approach helps promote the use of infrastructure as code, which emphasizes using automated processes rather than manual operations to manage infrastructure.

How to set up a time series database (TSDB) cluster on Kubernetes:

Step 1. Use the following method to deploy the Postgres operator:

# add repo for postgres-operator

helm repo add postgres-operator-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator

Step 2. Clone the repo:

git clone https://github.com/zalando/postgres-operator.git

Step 3. Configure the logical backup in the value file (postgres-operator/charts/postgres-operator/values.yaml):

configLogicalBackup:
  logical_backup_docker_image: "registry.opensource.zalan.do/acid/logical-backup:v1.8.0"
  logical_backup_job_prefix: "logical-backup-"
  # storage provider - either "s3" or "gcs"
  logical_backup_provider: "s3"
  # S3 Access Key ID
  logical_backup_s3_access_key_id: ""
  # S3 bucket to store backup results
  logical_backup_s3_bucket: "BUCKET_NAME"
  # S3 region of bucket
  logical_backup_s3_region: "REGION"
  # S3 endpoint url when not using AWS
  logical_backup_s3_endpoint: ""
  # S3 Secret Access Key
  logical_backup_s3_secret_access_key: ""
  # S3 server side encryption
  logical_backup_s3_sse: "AES256"
  # S3 retention time for stored backups for example "2 week" or "7 days"
  logical_backup_s3_retention_time: ""
  # backup schedule in the cron format
  logical_backup_schedule: "00 2 * * *"

Step 4. Install the postgres-operator:

helm install postgres-operator postgres-operator-charts/postgres-operator

Step 5. Check if Postgres Operator is running. Starting the operator may take a few seconds. Check if the operator pod is running before applying a Postgres cluster manifest.

kubectl get pod -l app.kubernetes.io/name=postgres-operator

Step 6. Deploy the Postgres Cluster. If the operator pod is running, install the cluster.

kubectl create -f manifests/complete-postgres-manifest.yaml

These are the steps to deploy the main cluster.

Setting up the DR cluster on a different AWS account

Follow these steps to deploy the standby cluster on the remote account (Account B):

Step 1. Create a programmatic user on the base account with s3 bucket access that was created while deploying the operator on the base account (Account A)

Step 2. Create a secret in k8s on the remote account (account B)

E.g.

apiVersion: v1

kind: Secret

metadata:

name: postgres-s3-wal-secrets

namespace: core-db

data:

STANDBY_AWS_ACCESS_KEY_ID: KEY

STANDBY_AWS_REGION: REGION

STANDBY_AWS_SECRET_ACCESS_KEY: SECRET

USE_WALG_RESTORE: false

Step 3. In the operator YAML file, mention the secret name in the “configKubernetes” section
E.g.

configKubernetes:
   pod_environment_secret: "postgres-s3-wal-secrets"

Step 4. Prepare a standby cluster YAML file and mention the location of the s3 wal file from the primary account (Account A)
E.g.

kind: "postgresql"

apiVersion: "acid.zalan.do/v1"

metadata:

name: acid-timesacle-hh-sg-demo

namespace: "core-db"

labels:

   team: acid

spec:

nodeAffinity:

   requiredDuringSchedulingIgnoredDuringExecution:

     nodeSelectorTerms:

     - matchExpressions:

       - key: NodeGroupType

         operator: In

         values:

         - database

teamId: "acid"

postgresql:

   version: "12"

   parameters:

     checkpoint_completion_target: '0.9'

     cron.database_name: apiendpoint

     default_statistics_target: '100'

     effective_cache_size: 6GB

     effective_io_concurrency: '200'

     log_connections: 'on'

     log_disconnections: 'on'

     log_duration: 'on'

     log_error_verbosity: VERBOSE

     log_line_prefix: '%m user=%u db=%d pid=%p '

     log_lock_waits: 'ON'

     log_min_error_statement: WARNING

     log_min_messages: WARNING

     log_statement: all

     logging_collector: 'on'

     maintenance_work_mem: 1GB

     max_connections: '1500'

     max_locks_per_transaction: '10000'

     max_parallel_maintenance_workers: '4'

     max_parallel_workers: '8'

     max_parallel_workers_per_gather: '4'

     max_worker_processes: '8'

     min_wal_size: 4GB

     max_wal_size: 16GB

     pg_stat_statements.max: '10000'

     random_page_cost: '1.1'

     shared_buffers: 2GB

     shared_preload_libraries: timescaledb, pg_stat_statements, pg_cron, wal2json

     tcp_keepalives_count: '9'

     tcp_keepalives_idle: '240'

     tcp_keepalives_interval: '30'

     track_activity_query_size: '2048'

     wal_buffers: 16MB

     wal_level: logical

     work_mem: 174kB

numberOfInstances: 1

volume:

   size: "600Gi"

   storageClass: "encrypted-gp3"

patroni:

   pg_hba:

   - local all all   trust

   - hostssl all all 0.0.0.0/0 md5

   - host   all all 0.0.0.0/0 md5

   - host replication standby   10.0.0.0/8 trust

   - host replication standby   10.0.0.0/8 trust

   - host replication all 127.0.0.1/32 trust

   - host replication all ::1/128 trust

serviceAnnotations:

   service.beta.kubernetes.io/aws-load-balancer-internal: 'true'

allowedSourceRanges: # load balancers' source ranges for both master and replica services

- 127.0.0.1/32

- 172.1.0.0/16

   # IP ranges to access your cluster go here

resources:

   requests:

     cpu: 6000m

     memory: 6Gi

   limits:

     cpu: 8000m

     memory: 8Gi

standby:

   s3_wal_path: "S3_Bucket_wal_file_path"

Step 5. Once the standby cluster is deployed, it will become a read-only cluster. We can only write once we promote the DB.

This is how to promote the DB:
Login to the postgress standby pod using kubectl exec

kubectl exec -it POD_NAME -- bash

## Run following command in pod

# patronictl edit-config

   And delete following 4 lines

   standby_cluster:

     create_replica_methods:

     - bootstrap_standby_with_wale

     - basebackup_fast_xlog

     restore_command: envdir "/run/etc/wal-e.d/env-standby" /scripts/restore_command.sh "%f" "%p"

This is how you can set a time series database (TSDB) cluster on Kubernetes.

Try it out yourself

The PostgreSQL Operator and Zalando PostgreSQL Operator offer powerful tools for achieving disaster recovery and high availability in a Kubernetes environment. By implementing a multi-cluster architecture, organizations can ensure that their critical databases remain operational in the event of a disaster or other unexpected event. With these tools at their disposal, IT teams can confidently deploy mission-critical applications on AWS EKS, knowing that they have the flexibility and resilience necessary to withstand unexpected disruptions. By leveraging the latest advances in database technology and cloud infrastructure, organizations can stay one step ahead of potential disasters and maintain the highest levels of performance and reliability for their applications.