Cloud Infrastructure & Platform Engineering

Two Platforms, One Infrastructure Codebase

The infrastructure work spans two distinct platforms, both fully managed with Terraform and deployed on AWS. The first is a Kubernetes-native microservice platform — the Microservice stack — running on EKS with a full GitOps deployment model. The second is a Saleor-based e-commerce checkout platform running on ECS Fargate, with CloudFront CDN, multi-service Redis, and a RDS PostgreSQL backend shared across three application tiers.

Every AWS resource across both platforms — networking, compute, databases, caching, CDN, secrets, IAM, DNS — is declared in Terraform using a reusable module library. Environments (dev, stage, prod) are separate root configurations that share modules but define their own variable bindings. State lives in S3 with environment-isolated prefixes. There is no console-clicking in production.

AWS VPC architecture with public/private subnets, EKS cluster, RDS, ElastiCache, and VPC endpoints

Full platform architecture — VPC layout across two AZs, EKS node groups, RDS PostgreSQL, ElastiCache Redis, and VPC endpoints for S3 and ECR

2

Production Platforms

EKS-based microservice platform and ECS Fargate-based checkout platform — each with independent staging environments.

20+

Terraform Modules

Reusable modules for VPC, EKS, ECS, RDS, Redis, ALB, CloudFront, IAM, SOPS, bastion, and Route53.

3

AZs per Environment

All workloads span us-east-1a/b/c with private/public subnet separation, NAT gateways, and VPC endpoints for S3 and ECR.

9+

ECR Repositories

Container images for payflow, brandkit, prospector, listing-sync, saleor, fundgate, mailer, discuss, and more.

Terraform Module Library

Rather than writing environment-specific code, all infrastructure is composed from typed, parameterized modules. The module library covers every layer of the stack:

vpc/

CIDR allocation, DNS hostnames, DNS resolution, resource tagging

subnet/public + private

AZ-aware subnet provisioning with Kubernetes cluster discovery tags (kubernetes.io/role/elb)

nat/

Elastic IP allocation, NAT gateway, private route table association

igw/

Internet gateway, public route table with 0.0.0.0/0 and ::/0 egress

eks/

EKS cluster v1.32, OIDC provider, EBS CSI driver addon, cluster role binding

eks/ng

Node group with instance type, min/max/desired capacity, and Kubernetes labels

rds/postgres

PostgreSQL RDS with parameter groups, subnet groups, SG, backup windows, performance insights

redis/

ElastiCache cluster with engine version, node type, subnet group, and SG rules

rabbitmq/

AWS MQ broker (mq.m7g.medium), RabbitMQ 3.13, single-AZ, private subnet

ecs_app_lb

ECS service + ALB target group + listener rules + auto-scaling + log group

ecs_task

Task definition factory: CPU, memory, container spec, secrets injection, log driver config

ext_alb

External ALB with dual-stack (IPv4/IPv6), HTTP→HTTPS redirect, ACM cert, SG rules

kms/

Customer-managed KMS key with key policy, alias, and rotation config

secrets/

Batch Secrets Manager secret creation with per-secret IAM policy generation

github-oidc

GitHub Actions OIDC provider + federated IAM role for keyless CI/CD credential vending

slack_alarms

CloudWatch alarm → SNS → Lambda → Slack webhook for operational alerts

waf/

WAF Web ACL with IP-based rule sets (currently provisioned as standby)

bastion/

EC2 bastion in private subnet with SG allowing SSH from VPC CIDR only

VPC Design

Both platforms follow an identical network architecture pattern: a single VPC per environment, three private subnets and three public subnets across us-east-1a/b/c, one NAT gateway per public subnet, and an Internet Gateway for public egress. All workloads — EKS nodes, ECS tasks, RDS instances, ElastiCache clusters — live in private subnets. The ALBs and bastion host are the only resources in public subnets.

EKS PLATFORM (stage)

VPC CIDR 10.20.0.0/16

Private A 10.20.1.0/24 us-east-1a

Private B 10.20.2.0/24 us-east-1b

Private C 10.20.3.0/24 us-east-1c

Public subnets across all 3 AZs

VPC endpoints: S3 (gateway), ECR.dkr (interface)

ECS CHECKOUT (stage)

VPC CIDR 10.50.0.0/16

Private A 10.50.0.0/18 us-east-1b

Private B 10.50.64.0/18 us-east-1c

Private C 10.50.128.0/18 us-east-1a

Public 10.50.232–248.0/21 (AZs a/b/c)

Prod mirrors on 10.51.0.0/16

VPC Endpoints for Cost and Security

EKS nodes and ECS tasks pull container images from ECR and write to S3 in the critical path of every deployment and file upload. Without VPC endpoints, this traffic would exit via NAT gateways, incurring both latency and NAT data-processing costs at scale. Both platforms have S3 gateway endpoints (free, route-table-based) and ECR.dkr interface endpoints (private DNS enabled, eliminating NAT for image pulls entirely).

VPC networking — public/private subnets across two AZs with NAT gateways and VPC endpoints

VPC layout — public subnets (ALB + Bastion EC2), private subnets (EKS pods, RDS, ElastiCache), NAT gateways per AZ, VPC endpoints for S3 and ECR

DNS and TLS

Route53 zones are managed at an account level and referenced via Terraform remote state in environment configs. Each environment creates ALIAS records pointing to its ALB (evaluate-target-health enabled for failover), with ACM certificates provisioned per environment. The checkout platform runs dual-stack (IPv4 + IPv6) ALBs with HTTP-to-HTTPS redirect on port 80.

Resource	Stage	Prod
VPC CIDR (EKS)	10.20.0.0/16	10.21.0.0/16
VPC CIDR (ECS)	10.50.0.0/16	10.51.0.0/16
RDS Instance (EKS)	db.t3.medium	Multiple instances (platform, CS, sandbox)
RDS Instance (ECS)	db.t3.small	db.t3.medium
ElastiCache	cache.t3.micro	cache.t3.small
EKS Nodes	t2.large (min 2, max 10)	t2.large (dedicated node groups)
NAT Gateways	1 per AZ	1 per AZ
WAF	Provisioned (standby)	Active

EKS Platform — Kubernetes Orchestration

The Microservice microservice platform runs on EKS v1.32. Two clusters are provisioned: infra-stage (handles both dev and staging workloads) and infra-prod. Each cluster has managed node groups on t2.large instances with autoscaling (min 2, max 10, desired 2 per group). The EBS CSI driver addon is installed via Terraform for persistent volume support.

Service routing is handled by Traefik IngressRoute CRDs. Kubernetes manifests for each service live in the deploy/ directory, namespaced by service and environment. Services deployed to the platform include: payflow, brandkit, prospector, listing-sync-service, billing-sub, microfund, and standalone.

EKS cluster with Traefik ingress controller routing to payflow, brandkit, and prospector namespaces

EKS cluster — Traefik IngressRoute dispatching to service namespaces, SOPS-encrypted secrets per namespace, OIDC-authenticated GitOps pipeline with ECR image pulls

ECS Fargate — Checkout Platform

The Saleor e-commerce checkout stack runs on ECS Fargate — no EC2 nodes to manage, no AMI patching, and precise per-task CPU/memory billing. The platform runs four primary service groups:

Saleor API 2048 / 4096 MB 8000 Primary e-commerce GraphQL API. Health check on /health/. Execute command enabled for live container debugging. Image tagged with semantic version (3.21.1-stg).

FundGate API Custom 8000 Custom payment pledge API. Integrates Redis (DBs 2–3) as Celery broker and cache. Separate RDS database (app_checkout). Secrets injected via Secrets Manager ARN references.

Storefront Dashboard Custom 8000 Secondary Saleor instance with dedicated database DSN (storefront_dashboard), separate ALB, Route53 record, and CloudFront distribution. Runs on Redis DBs 4–5.

Celery Worker + Beat Custom — Async task execution and scheduled task dispatch. Uses same container image as Saleor API, different entrypoint command. Broker: Redis DB 1; Results: DB 0. S3 media write access via task IAM role.

Mailer Minimal — Email delivery service. Credentials from Secrets Manager. Uses SES SMTP (v4 signature) via IAM access key with ses:SendRawEmail.

CloudFront CDN — Cache Behavior Architecture

Two CloudFront distributions per environment (Saleor + Storefront) sit in front of the ALBs and S3 origins. The cache behavior rules are the most critical part of the CDN config — different paths have radically different caching requirements and the wrong default would either serve stale API responses or miss static asset caching entirely.

Path Pattern	Origin	Cache Policy	Reason
`/graphql/*`	ALB (API)	CachingDisabled	GraphQL mutations and user-specific queries must never be cached
`/static/*`	S3 static bucket	CachingOptimized	Immutable build assets — content-hashed filenames, aggressive cache TTL
`/thumbnail/*`	S3 media bucket	CachingOptimized	Generated thumbnails are deterministic given the same source + dimensions
`/media/*`	S3 media bucket	CachingOptimized	User-uploaded media. OAC enforces S3 access via CloudFront principal only
`/.well-known/jwks.json`	ALB (API)	CachingDisabled	JWT public keys — must be fresh for token verification
`/*` (default)	ALB (API)	CachingOptimized	Remaining paths optimized; HTTP/2+3, Gzip+Brotli compression enabled globally

CloudFront distribution cache behavior routing — path patterns, DISABLED/OPTIMIZED badges, S3 and ALB origins

CloudFront cache behavior routing — per-path policies, S3 static/media origins with OAC, ALB API origin with caching disabled for GraphQL and JWKS

S3 origin access is enforced with Origin Access Control (OAC) using SigV4 signing, replacing the legacy OAI pattern. Bucket policies reject all direct S3 requests and only allow the CloudFront service principal. TLS minimum: TLSv1.2_2021. HTTP versions: HTTP/2 and HTTP/3.

Data Layer

PostgreSQL 16 runs on RDS with parameter groups that enable query logging and set max_connections = 1000. ElastiCache Redis serves multiple logical databases per cluster: DBs 0–5 are partitioned across Saleor cache, Celery broker, FundGate, and Storefront to avoid key namespace collisions while sharing a single node in staging. The EKS platform uses Redis 6.x; the ECS checkout platform uses Redis 7.1 with dual-stack networking. RabbitMQ (AWS MQ, mq.m7g.medium) handles async messaging for event-driven services.

Bastion Host Architecture

Private cluster resources — RDS, Redis, EKS API, RabbitMQ — are not reachable from the public internet. The access pattern uses a bastion host in each environment's public subnet as the sole ingress point. There are two usage modes depending on whether you need full network routing or just proxy access:

MODE 1

sshuttle VPN tunnel (full network routing)

Establishes a transparent L3 VPN over SSH: sshuttle -r deploy@<bastion-ip> -e "ssh -i <key>" 10.0.0.0/8. Routes the entire 10.0.0.0/8 supernet through the bastion — meaning psql, redis-cli, and kubectl all work transparently from your local machine as if you were inside the VPC. This is the standard operational mode for database migrations and cluster maintenance.

MODE 2

SOCKS5 proxy (kubectl traffic)

Creates a SOCKS5 proxy: ssh -D 1080 -N deploy@<bastion-ip> then export HTTPS_PROXY=socks5://127.0.0.1:1080. Lighter-weight option for kubectl-only access without routing the full supernet. Used when only EKS API access is needed.

ACCESS CONTROL

PR-based authorized_keys management

SSH public keys are managed via pull requests to the bastion-access repository. Adding or revoking a developer's access requires a PR review — no direct console access, no shared credentials. Separate bastion instances serve stage and production, with the prod bastion exposed via a DNS alias rather than a raw IP.

Bastion host access diagram — sshuttle VPN tunnel and SOCKS5 proxy modes

Bastion access — sshuttle routing 10.0.0.0/8 through the EC2 bastion for transparent VPC access (top), and SOCKS5 proxy mode for kubectl-only traffic (bottom)

SOPS + KMS Encrypted Secrets

All Kubernetes secrets — database DSNs, API keys, payment credentials, JWT secrets — are stored as SOPS-encrypted ConfigMaps in Git. A customer-managed KMS key provisioned by Terraform handles encryption, with grants managed via IAM. The GitHub Actions deploy workflow decrypts manifests at deploy time using the KMS key vended via OIDC — no plaintext secrets ever appear in the repo or CI logs. The secrets Terraform module provisions Secrets Manager secrets for ECS workloads using the same IAM-controlled access pattern.

GitHub Actions OIDC — Keyless CI/CD

There are no long-lived AWS access keys in GitHub Actions. The github-oidc Terraform module provisions an AWS IAM OIDC identity provider bound to token.actions.githubusercontent.com and creates a federated IAM role that GitHub Actions can assume for specific repositories (acme-platform/*). The role has least-privilege policies: ECR describe/push, EKS describe, KMS decrypt for SOPS. Credentials are valid only for the duration of the workflow run and are scoped to the specific repo and branch.

IAM Principal	Permissions	Scope
GitHub Actions OIDC role	ECR push/describe, EKS describe, KMS decrypt	acme-platform/* repos only
EKS Node role	AmazonEKSWorkerNodePolicy, ECR pull, VPC CNI	EKS worker nodes only
ECS task role (Saleor)	S3 GetObject/PutObject/Delete (media), Secrets Manager GetSecretValue, DynamoDB CRUD (avatax)	Scoped to per-service S3 prefixes and secret ARNs
ECS task role (Celery)	S3 read/write (media bucket), CloudWatch PutLogEvents	Media bucket only
SES SMTP user	ses:SendRawEmail	Mailer service only, via access key in Secrets Manager

EKS RBAC — aws-auth ConfigMap

Kubernetes RBAC is managed via the aws-auth ConfigMap, also version-controlled in the repo. Ten IAM users are mapped to the system:masters group, granting cluster-admin access. Node roles are mapped to system:bootstrappers and system:nodes for kubelet registration. Any RBAC change goes through code review — there is no ad-hoc kubectl edit configmap aws-auth pattern in production.

GitOps Deployment Pipeline

The deployment model is declarative and Git-driven. Kubernetes manifests live in deploy/<service>/<env>/ with a standard structure: namespace.yaml, deployment.yaml, service.yaml, env-config.sops.yaml. On merge to main, GitHub Actions runs kubectl apply against the target cluster — decrypting SOPS secrets at runtime using the OIDC-vended KMS key. There is no manual kubectl apply in the deployment lifecycle.

01

Build and push container image

Docker buildx builds for linux/amd64. Image pushed to ECR with the Git commit SHA as the tag: aws ecr get-login-password | docker login --username AWS --password-stdin <registry>. ECR repos have AES256 encryption at rest.

02

Configure AWS credentials via OIDC

The aws-actions/configure-aws-credentials@v4 action exchanges the GitHub OIDC token for temporary AWS credentials. No secrets stored in GitHub. Credentials scoped to the deploying repo and branch.

03

Decrypt manifests and apply to cluster

SOPS decrypts env-config.sops.yaml using the KMS key. kubectl apply -f deploy/<service>/<env>/ runs against the cluster kubeconfig. Kubernetes rolling update handles zero-downtime deployment — old pods stay up until new pods pass readiness probes.

04

ECS task update (checkout platform)

ECS deployments update the task definition with the new image tag using aws ecs register-task-definition + aws ecs update-service --force-new-deployment. ECS replaces tasks using the minimum-healthy-percent/maximum-percent rolling strategy defined in the service config.

GitOps CI/CD pipeline — GitHub Actions, Docker Buildx, ECR push, SOPS decrypt, EKS kubectl apply, ECS update-service, zero-downtime rolling update

GitOps pipeline — git push triggers Docker Buildx → ECR push in parallel with SOPS KMS decryption, both feeding into EKS kubectl apply and ECS update-service with zero-downtime rolling update

Terraform State and Environment Promotion

Terraform state is stored in S3 (terraform-state-platform bucket, us-east-1) with environment-isolated key prefixes. Each environment (infra-stage/, infra-prod/, dev, stage, prod checkout) has its own state file, preventing cross-environment blast radius from a failed apply. Remote state outputs are consumed across configurations using terraform_remote_state data sources — for example, the checkout environment reads the Route53 zone ID and ACM certificate ARN from the account-level state without duplicating the resource.

What I Would Do Differently

The Terraform state backend has no explicit DynamoDB locking table, which means concurrent terraform apply runs are technically possible. In practice, the team was small enough that this was managed by convention, but at any reasonable team size a DynamoDB lock table should be standard.

Redis logical database partitioning (DB 0–5 on a single node) works in staging but is operationally fragile — a cache flush on one DB index can inadvertently affect another app's broker queue. The correct production pattern is separate ElastiCache clusters per application, or at minimum Redis 7 keyspace isolation with ACLs.

The EKS node group uses t2.large instances. Under sustained CPU credit exhaustion (the t2 burstable model), nodes can degrade unpredictably. m5.large or m6i.large instances with consistent CPU delivery would be more appropriate for production Kubernetes worker nodes handling latency-sensitive workloads.