diff --git a/docs/encyclopedia/workers/serverless-workers/autoscaling.mdx b/docs/encyclopedia/workers/serverless-workers/autoscaling.mdx new file mode 100644 index 0000000000..0ebddfb103 --- /dev/null +++ b/docs/encyclopedia/workers/serverless-workers/autoscaling.mdx @@ -0,0 +1,125 @@ +--- +id: autoscaling +title: Serverless Worker autoscaling +sidebar_label: Autoscaling +description: + How Temporal autoscales Serverless Workers on each compute provider, including the scaling signals, algorithm behavior, + and tuning parameters. +slug: /encyclopedia/workers/serverless-workers/autoscaling +toc_max_heading_level: 4 +keywords: + - serverless + - workers + - autoscaling + - lambda + - cloud run + - worker controller instance +tags: + - Workers + - Concepts + - Serverless +--- + +:::tip SUPPORT, STABILITY, and DEPENDENCY INFO + +Serverless Workers are in [Pre-release](/evaluate/development-production-features/release-stages#pre-release) and available to select Temporal Cloud customers. +To request access during Pre-release, create a [support ticket](/cloud/support#support-ticket) or contact your account team. +APIs are experimental and may be subject to backwards-incompatible changes. +[Sign up for updates](https://temporal.io/pages/serverless-workers-updates) to be notified when Serverless Workers reach Public Preview. + +::: + +The [Worker Controller Instance (WCI)](/serverless-workers#worker-controller-instance) autoscales Serverless Workers +using two signals: sync match failure and Task Queue backlog. The autoscaling algorithm differs by compute provider +because of differences in cold start latency, invocation duration limits, and provider APIs. + +## Scaling signals + +Both compute providers use the same two signals to drive scaling decisions. + +### Sync match failure {#sync-match-failure} + +When a Task is submitted, the [Matching Service](/temporal-service/temporal-server#matching-service) attempts to route +it directly to an available Worker. If no Worker is available, the sync match fails, and the Matching Service pushes a +signal to the WCI. Because the Matching Service pushes match failures as they happen rather than the WCI polling on a +timer, scaling is responsive. + +### Task Queue backlog {#task-queue-backlog} + +The WCI monitors Task Queue metadata to determine whether pending Tasks exist without enough Workers to process them. If +there are Tasks on the queue and not enough Workers, the WCI scales up. + +## AWS Lambda {#aws-lambda} + +The Lambda algorithm is event-driven and reactive. Sync match failure is the primary control signal, and backlog aids +sizing. + +When the WCI needs more capacity, it calls the Lambda `InvokeFunction` API to start new Workers. Each call is a discrete +action ("invoke N more functions"), not a target state. The WCI does not manage a fleet of instances. + +### Scale-out + +On sync match failure, the WCI invokes new Lambda functions. Because Lambda cold start is sub-second to low +single-digit seconds, reactive-only control does not create meaningful backlog overshoot. The WCI can scale from zero +with low latency. + +### Scale-in + +Scale-in is automatic. Each Lambda invocation runs until the Worker has finished processing available Tasks or +approaches the 15-minute execution time limit, then shuts down. There is no drain logic or stabilization window. The WCI +does not need to actively remove capacity. + +### Instance model + +Each invocation is independent. The Worker starts, creates a fresh client connection, processes multiple Tasks until near +the execution time limit, and then shuts down gracefully. There is no shared state across invocations. + +## GCP Cloud Run {#cloud-run} + +The Cloud Run algorithm is a hybrid rate-plus-backlog controller. It extends the base algorithm with a latency-first +fast-path that reacts to sync match failures. + +Unlike Lambda, the WCI outputs a target state ("there should be _c_ instances") rather than discrete invocations. The +WCI adjusts Cloud Run's instance count through the Cloud Run admin API. + +### Scale-out + +The algorithm uses four layers to determine the desired instance count: + +1. **Feedforward base capacity.** The WCI estimates the required fleet size from the Task arrival rate, divided by per-instance throughput at the target utilization. Feedforward sizing is critical because Cloud Run cold start is approximately 10-30 seconds. Waiting for backlog to signal under-provisioning means new capacity is 10-30 seconds away. +2. **Backlog-drain correction.** If a backlog exists, the WCI adds instances to drain it within the target queue wait time. +3. **Warm-reserve headroom.** The WCI maintains extra capacity above the feedforward estimate to absorb sync match failures without triggering cold starts. +4. **Sync match fast-path.** On any sync match failure, the WCI immediately re-evaluates and scales out if the current fleet is undersized. This event-triggered path bypasses the regular control interval. + +The final desired count is the maximum of the reactive and event-driven calculations, clamped to the configured minimum +and maximum instance counts, and quantized to the scaling granularity. + +### Scale-in + +Scale-in is conservative to avoid oscillation: + +- **Scale-down stabilization window.** After a scale-down decision, the WCI waits (default 300 seconds) before removing instances. If load increases during this window, the scale-down is canceled. +- **Hold after scale-out.** After scaling out, the WCI holds the new capacity for a minimum period before considering scale-in. +- **Drain logic.** When removing instances, the WCI drains them over a configurable horizon, allowing in-flight Tasks to complete before the instance is terminated. + +### Minimum instances + +Setting `c_min >= 1` keeps at least one instance warm at all times. With constant traffic, this behaves like an +always-on Worker with elastic scale-up and scale-down. Setting `c_min = 0` enables full scale-to-zero but means the +first Task after an idle period incurs a cold start. + +### Tuning parameters + +The following parameters control Cloud Run autoscaling behavior. These are starting points for latency-first operation. + +| Parameter | Starting value | Description | +| -------------------------- | --------------------------- | ------------------------------------------------------------------------------------------- | +| Control interval | 15s | How often the WCI re-evaluates the desired instance count. | +| Utilization target | 0.70-0.80 | Target per-instance utilization for feedforward sizing. | +| Queue wait target | 3-5s | Target time a Task should wait in the queue before being picked up. | +| Drain horizon | 30-60s | How long the WCI allows for in-flight Tasks to complete when removing an instance. | +| Event cooldown | max(5s, 0.25 x scale-up latency) | Minimum time between event-triggered scale-out evaluations. | +| Scale-down stabilization | 300s | How long the WCI waits after a scale-down decision before removing instances. | +| Hold after scale-out | max(60s, 2 x scale-up latency) | Minimum time to hold new capacity before considering scale-in. | +| Min instances | >= 1 for latency-first | Minimum instance count. Set to 0 for full scale-to-zero. | +| Scaling granularity | 1 | Minimum step size for scaling changes. | diff --git a/docs/encyclopedia/workers/serverless-workers.mdx b/docs/encyclopedia/workers/serverless-workers/index.mdx similarity index 86% rename from docs/encyclopedia/workers/serverless-workers.mdx rename to docs/encyclopedia/workers/serverless-workers/index.mdx index 16dc4a0fec..7df26ffd45 100644 --- a/docs/encyclopedia/workers/serverless-workers.mdx +++ b/docs/encyclopedia/workers/serverless-workers/index.mdx @@ -10,6 +10,7 @@ keywords: - serverless - workers - lambda + - cloud run - compute provider tags: - Workers @@ -68,10 +69,9 @@ The Worker Controller Instance (WCI) is a system Workflow that scales Serverless One WCI Workflow runs per Worker Deployment Version that has a compute provider configured. The WCI runs in the same Namespace as your Worker Deployment. -The WCI responds to two triggers: [sync match failures](#sync-match-failure) and -[Task Queue backlog](#task-queue-backlog). When either trigger fires, the WCI produces a scaling action, such as -invoking the configured compute provider (for example, calling AWS Lambda's `InvokeFunction` API) to start new Workers. -For details on how scaling works, see [Autoscaling](#autoscaling). +The WCI responds to two triggers: sync match failures and Task Queue backlog. When either trigger fires, the WCI +produces a scaling action, such as invoking the configured compute provider to start new Workers. For details on how +scaling works, see [Autoscaling](#autoscaling). You can list WCI Workflows in your Namespace: @@ -115,28 +115,15 @@ reuse or shared state across invocations. ## Autoscaling {#autoscaling} -The [WCI](#worker-controller-instance) automatically scales Serverless Workers based on Task Queue signals. When Tasks -arrive and no Worker is available, the WCI invokes new Workers. When the Tasks are done, Workers exit and scale to zero. - -The WCI uses two signals to decide when to invoke new Workers: - -### Sync match failure {#sync-match-failure} - -When a Task is submitted, the [Matching Service](/temporal-service/temporal-server#matching-service) attempts to route -it directly to an available Worker. If no Worker is available, the sync match fails, and the Matching Service pushes a -signal to the WCI. The WCI then invokes a new Worker. This is the primary scaling path. Because the Matching Service -pushes match failures to the WCI as they happen rather than the WCI polling on a timer, latency stays low and scaling is -responsive. - -### Task Queue backlog {#task-queue-backlog} - -The WCI monitors Task Queue metadata to determine whether pending Tasks exist without enough Workers to process them. If -there are Tasks on the queue and not enough Workers, the WCI invokes additional Workers. +The [WCI](#worker-controller-instance) automatically scales Serverless Workers based on Task Queue signals. The +autoscaling algorithm differs by compute provider because of differences in cold start latency, invocation duration +limits, and provider APIs. For details on how autoscaling works on each platform, see +[Serverless Worker autoscaling](/encyclopedia/workers/serverless-workers/autoscaling). ## Scaling with long-lived Workers {#scaling-with-long-lived-workers} Serverless Workers can share a Task Queue with long-lived Workers. Because Serverless Workers are only invoked on -[sync match failure](#sync-match-failure), Serverless Workers only pick up Tasks that no long-lived Worker was available +sync match failure, Serverless Workers only pick up Tasks that no long-lived Worker was available to handle. In practice, the Serverless Workers act as spillover capacity for the long-lived fleet. :::caution @@ -259,6 +246,7 @@ provider because the Worker process manages its own lifecycle. ### Supported providers -| Provider | Description | -| ---------- | ----------------------------------------------------------------------------- | -| AWS Lambda | Temporal assumes an IAM role in your AWS account to invoke a Lambda function. | +| Provider | Description | +| ------------- | ----------------------------------------------------------------------------- | +| AWS Lambda | Temporal assumes an IAM role in your AWS account to invoke a Lambda function. | +| GCP Cloud Run | Temporal manages Cloud Run instance scaling through the Cloud Run admin API. | diff --git a/sidebars.js b/sidebars.js index 79d3fbd669..72161cc705 100644 --- a/sidebars.js +++ b/sidebars.js @@ -1534,7 +1534,15 @@ module.exports = { 'encyclopedia/workers/sticky-execution', 'encyclopedia/workers/worker-shutdown', 'encyclopedia/workers/worker-versioning', - 'encyclopedia/workers/serverless-workers', + { + type: 'category', + label: 'Serverless Workers', + collapsed: true, + link: { type: 'doc', id: 'encyclopedia/workers/serverless-workers/index' }, + items: [ + 'encyclopedia/workers/serverless-workers/autoscaling', + ], + }, ], }, {