Durable AI ResearchTemporal on Azure Kubernetes Service

๐Ÿ“ Solution design โ€” how it works

This application generates AI research reports as a durable workflow. Even if a pod crashes, a node reboots, or the LLM API times out mid-run, the work resumes exactly where it left off โ€” nothing is lost and nothing is duplicated. This page explains the architecture, the components, and the request lifecycle end to end.

โ˜๏ธ Azure Kubernetes Service โฑ๏ธ Temporal durable execution ๐Ÿ Python ยท Flask ๐Ÿค– OpenAI GPT-4o ๐Ÿ˜ PostgreSQL ๐Ÿ“Š Prometheus ยท Grafana

A single research report requires several steps that each take time and can fail independently: call an LLM (slow, rate-limited, occasionally errors), turn the result into a PDF, persist metadata to a database, and notify the user. On a normal stateless web server, if the process restarts or crashes anywhere in the middle, the whole job is lost โ€” the user gets nothing, and a retry re-runs steps that already succeeded (wasting tokens and money).

Temporal solves this by turning the multi-step job into a workflow whose every step is recorded in an append-only event history. If anything crashes, Temporal replays that history on another worker and continues from the exact point of failure โ€” steps that already completed are not re-executed. The whole thing runs on a managed Azure Kubernetes Service cluster with redundant, self-healing pods.

๐Ÿ’ก
In one sentence: the app is a Flask front end that hands work to a Temporal workflow, which orchestrates durable activities (LLM → PDF → DB → notify) so the job always finishes exactly once, even across crashes.
flowchart TB
    User([๐Ÿ‘ค User / Browser])
    subgraph Azure["โ˜๏ธ Azure"]
      DNS["๐ŸŒ Public DNS + TLS
temporal-ai-app...cloudapp.azure.com"] subgraph AKS["โŽˆ Azure Kubernetes Service โ€” namespace: temporal-ai"] ING["๐Ÿ”€ ingress-nginx
+ cert-manager (Let's Encrypt)"] API["๐Ÿ–ฅ๏ธ API pods ร—2
Flask ยท api_server.py
serves UI + REST"] WK["โš™๏ธ Worker pods ร—2
worker.py
runs workflows + activities"] TS["โฑ๏ธ Temporal server
auto-setup"] PG[("๐Ÿ˜ PostgreSQL
StatefulSet + PVC")] PROM["๐Ÿ“ˆ Prometheus"] GRAF["๐Ÿ“Š Grafana"] end ACR["๐Ÿ“ฆ Azure Container Registry"] end OAI["๐Ÿค– OpenAI GPT-4o API"] User --> DNS --> ING ING -->|"/"| API ING -->|"/grafana"| GRAF API -->|"start / signal / query"| TS TS -->|"dispatch tasks"| WK WK -->|"LLM activity"| OAI WK -->|"store metadata"| PG TS --- PG WK -.->|"metrics"| PROM TS -.->|"metrics"| PROM PROM --> GRAF ACR -.->|"pull image"| API ACR -.->|"pull image"| WK
Solid arrows = request/data flow ยท dotted arrows = image pulls & metrics scraping
๐Ÿ–ฅ๏ธ
API serverapi_server.py ยท 2 replicas

Flask app that serves the web UI and the REST API. It does not run business logic itself โ€” it starts Temporal workflows and reads their state via signals/queries. Stateless, so it scales horizontally.

โš™๏ธ
Workerworker.py ยท 2 replicas

Polls the research task queue and actually executes the workflow code and activities. This is where the LLM call, PDF creation, DB write and notification run. Any worker can pick up any task.

โฑ๏ธ
Temporal servertemporalio/auto-setup

The orchestration engine. Stores each workflow's event history, dispatches tasks to workers, manages durable timers, retries and signals. The brain that makes "exactly-once" possible.

๐Ÿ˜
PostgreSQLStatefulSet + PVC

Persistent store. Backs Temporal's event history and holds the app's own reports metadata table. Data survives pod restarts via an Azure managed-disk volume.

๐Ÿงฉ
Workflow & activitiesworkflow.py ยท activities.py

The actual job definition. GenerateReportWorkflow orchestrates four activities: llm_call โ†’ create_pdf โ†’ store_report_metadata โ†’ send_notification, each with its own retry policy and timeout.

๐Ÿ“Š
Prometheus + Grafanamonitoring.yaml

Prometheus scrapes Temporal & worker metrics; Grafana visualizes throughput, latency and message processing. Embedded directly into the app under /grafana.

Browser โ†’ API The UI POSTs the prompt to /api/workflows/generate-report on a Flask API pod.
API โ†’ Temporal: start workflow The API calls client.start_workflow(GenerateReportWorkflow, ...) on the research task queue and immediately returns a workflow_id โ€” it does not wait for the job to finish.
Temporal โ†’ Worker: dispatch Temporal records "workflow started" in history and hands the first task to a free worker pod.
Activity 1 โ€” LLM call retry ร—3 Worker calls OpenAI GPT-4o. If it times out or errors, Temporal retries automatically with backoff. The successful result is written to history.
Activity 2 โ€” Create PDF retry ร—2 The model output is rendered to a PDF file.
Activity 3 โ€” Store metadata PostgreSQL Report id, filename, user and prompt are persisted to the reports table.
Activity 4 โ€” Notify A completion notification is sent. Workflow records "completed" and returns the final result string.
Browser polls status Meanwhile the UI polls /api/workflows/<id>/status and /result until the workflow reports COMPLETED, then shows the report.
โšก
See it live: the Features playground runs a self-contained version of this โ€” with deliberate failures โ€” so you can watch retries, durable timers, signals and the full event history in real time.
CapabilityWhat it gives this app
Event history Every step is appended to a durable log in PostgreSQL. A crashed workflow is replayed from this log on another worker and continues from the exact failure point โ€” completed steps never re-run.
Automatic retries Each activity has a RetryPolicy. Transient LLM/DB/network failures are retried with exponential backoff without any custom retry code.
Durable timers Sleeps and timeouts survive process restarts โ€” a workflow can wait minutes, hours or days and resume correctly even if every pod was replaced in the meantime.
Signals External events (e.g. a human approval) are delivered into a running workflow durably, pausing it until the signal arrives.
Queries The current state of a running workflow can be read at any time without affecting it โ€” this powers the live progress UI.
Exactly-once The combination above guarantees the end-to-end job completes once and only once, even under crashes, redeploys or node failures.

Redundancy & self-healing

  • 2 replicas each for API and worker โ€” one pod can die with no outage.
  • Liveness & readiness probes restart unhealthy pods and keep traffic off pods that aren't ready.
  • PodDisruptionBudgets (minAvailable: 1) keep at least one pod serving during node drains and cluster upgrades.
  • Kubernetes reschedules any failed pod automatically.

Zero-downtime deploys

  • Rolling updates with maxUnavailable: 0, maxSurge: 1 โ€” a new pod becomes healthy before an old one is removed.
  • Workers are stateless against the task queue, so a redeploy never loses in-flight workflows โ€” Temporal simply redispatches their next task to a surviving worker.
  • Right-sized CPU/memory requests so replicas always schedule.
โš ๏ธ
Known single points of failure: the cluster currently runs on a single AKS node, and postgres / temporal are single-replica. Scaling the node pool to 2+ nodes is the main step to also survive a node failure. See the Costs page for the trade-offs.

Temporal and the workers expose Prometheus metrics; Prometheus scrapes them and Grafana visualizes throughput, task latency and message-processing rates. The dashboard is embedded straight into the app so you never leave the experience.

  • Live metrics โ€” the home page embeds the Grafana dashboard under /grafana.
  • Workflow history โ€” the playground renders the raw append-only event log of any workflow.
  • Health endpoint โ€” /health backs the Kubernetes probes.
  • TLS everywhere โ€” public traffic is HTTPS, terminated at ingress-nginx with a Let's Encrypt certificate auto-renewed by cert-manager.
  • Secrets, not code โ€” the OpenAI key and DB password live in a Kubernetes Secret (injected as env vars), never baked into the image or source.
  • Config separation โ€” non-secret settings (hosts, ports, model name) live in a ConfigMap.
  • Image supply chain โ€” the app image is built and stored in a private Azure Container Registry and pulled by the pods.