Mengurus Terraform OpenTofu State: Tips Praktikal

Background

Ni catatan dari investigation aku kat salah satu client — isu dengan Terraform/OpenTofu state atas AWS. Tiga masalah utama: state lock stuck, state tak sync dengan AWS sebenar, dan rolling upgrade tanpa downtime.

1. State Lock Gagal Dibuka

Error yang keluar:

Error acquiring the state lock
Error message: ConditionalCheckFailedException

State disimpan kat S3, DynamoDB handle locking. Kalau proses sebelum ni crash tanpa release lock, kita kena force unlock.

Steps:

Check ada proses tofu yang masih hidup: ps aux | grep tofu
Kalau takde, force unlock: tofu force-unlock <LOCK_ID>
Pastikan betul-betul takde proses lain sebelum force unlock — kalau tak, state boleh corrupt

2. State Tak Selari Dengan AWS

Config dalam kod kata c6i.xlarge, state file simpan c6i.large, AWS sebenar dah c6i.xlarge. Tapi terraform plan tunjuk no changes sebab dia percaya state.

Punca biasa:

Instance diubah manual kat AWS Console
Apply sebelum ni berjaya kat AWS tapi state gagal update (network putus/crash)
State file pernah restore dari backup lama

Fix: tofu refresh pakai profile CLIENTSUP. Boleh target specific module je kalau nak. Lepas tu verify dengan:

tofu state show module.stg_jira.aws_instance.app_ec2_cluster[0]
aws ec2 describe-instances --filters "Name=tag:Name,Values=client-stg-jira*"

Compare instance type dari kedua-dua output. Kalau dah match, run plan semula.

3. Rolling Upgrade Tanpa Downtime

Nak upgrade instance type untuk prod_jira, prod_conf, stg_jira etc tanpa downtime. ALB dan node count jadi kunci.

Tiga approach:

A) Target satu module at a time

Upgrade module.prod_jira dulu, tunggu semua instance healthy kat ALB, baru proceed ke module.prod_conf.

B) Scale up dulu, then scale down

Tambah node prod_jira dari 3 ke 6, apply, tunggu node baru healthy, then turunkan balik ke 3 — yang lama akan terminated.

C) Taint satu-satu

terraform taint satu instance, apply untuk instance tu je, tunggu healthy, ulang untuk instance seterusnya.

Prinsip sama: pastikan minimum healthy instances kat belakang ALB sebelum destroy yang lama. Monitor ALB health checks dan CloudWatch sepanjang proses.

Operasi Harian

Refresh state: tofu refresh
Plan dengan detailed exit code: tofu plan -detailed-exitcode
Apply target module: tofu apply -target=module.stg_jira
Backup state sebelum perubahan besar: tofu state pull > backup.tfstate

State bucket: mycompany-terraform-state, path: managed-services/customers/clientname/client_infra-123456789012.tfstate. DynamoDB untuk locking. Profile: CLIENTSUP via aws-vault.

Struktur Modul

Dibahagi ikut environment: prod.tf, stag.tf, dr.tf. Setiap satu panggil module macam prod_jira, prod_conf, prod_sync.

Instance size dikawal dalam terraform.auto.tfvars:

prod_app_node_instance_size = "c6i.2xlarge"
stg_app_node_instance_size = "c6i.xlarge"
stg_db_node_instance_size = "c6i.large"

Tukar value sini, then apply dengan rolling strategy kat atas.