AtMostOncePerRetry vs AtLeastOncePerRetry Semantics in Lambda Durable Function Step

For a durable Lambda function, standard errors at any step are checkpointed and handled during the next replay by retry-strategy configurations.

However, there can be scenarios where a durable step starts execution but fails to complete, while also leaving no ERROR checkpoint behind. A common example is a Lambda execution timeout.

In such cases, step semantics become important because they define how the workflow should behave when the system cannot determine whether the step completed successfully or not.

There are two types of semantics available:

AtLeastOncePerRetry (Default)

For a step, the SDK initiates a START checkpoint (by calling the internal checkpoint API) and proceeds to code execution without waiting for the checkpoint response.

Now, if Lambda is interrupted mid-execution (timeout, OOM, sandbox crash), the step has no completion checkpoint. On the next replay, the SDK cannot determine that the step completed successfully and runs it again (more on it in the source code exploration section).

This is safe for idempotent operations where running the same operation twice is safe, like database read/upsert or an operation with an idempotency key.

AtMostOncePerRetry

Here, the SDK waits for the START checkpoint initialization before executing the code. If Lambda is interrupted mid-execution, the START checkpoint exists, but the completion checkpoint does not.

On the next replay, SDK detects the START checkpoint (step started execution but not ended) and skips the step. Also, it makes sure to throw StepInterruptedError that will bubble up, allowing the retry strategy to decide what happens next.

Neither guarantees the step runs exactly once across the entire workflow!

At first glance, AtMostOncePerRetry looks perfect for an idempotent operation. However, the documentation clearly states that both semantics apply per retry attempt, not across the entire workflow.

This means AtMostOncePerRetry guarantees that the step will run only once within a single retry attempt. But if the retry strategy is configured and StepInterruptedError is retryable, a new attempt begins, and the step runs again in that new attempt.

How to make sure that the step runs exactly once in the entire workflow?

To guarantee the step runs exactly once end-to-end, you must combine AtMostOncePerRetry with a no-retry strategy:

await context.step(
  'charge-payment',
  async () => paymentService.charge(amount, cardToken),
  {
    semantics: StepSemantics.AtMostOncePerRetry,
    retryStrategy: () => ({ shouldRetry: false })
  }
);

Now it is perfectly safe for operations that are not safe to repeat 👏🏼

Experiment Time🔬

Let us understand how these semantics behave with the help of CloudWatch logs.

Start by creating a durable Lambda function with an execution time of 1 minute. To simulate a timeout, change the timeout setting of the lambda to 3 seconds. Now, inside a durable step, run an operation that keeps executing for 10 seconds.

This setup will intentionally interrupt the Lambda execution before the step completes, helping us observe how AtLeastOncePerRetry and AtMostOncePerRetry behave during replay.

Code:-

import { withDurableExecution, StepSemantics } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(async (event, context) => {

  console.log("Execution started.");
  

  await context.step('Step #1', (stepCtx) => {
    stepCtx.logger.info('Hello from step #1');
  });

  const message = await context.step('Step #2', async () => {
    context.logger.info('inside step 2');
    await new Promise(resolve => setTimeout(resolve, 10000)); // 10 seconds
    return 'Hello from Durable Lambda!';
  }, {
    // semantics: StepSemantics.AtMostOncePerRetry,
    // retryStrategy: () => ({ shouldRetry: false })
  });

  const response = {
    statusCode: 200,
    body: JSON.stringify(message),
  };
  return response;
});

Scenario 1: AtLeastOncePerRetry (Default) and shouldRetry as true (default)

If we pass no semantics and retry, this is the default case. And it results in an infinite loop (until duration execution timeout), ie, the function keeps running for the whole 1 minute.

Reason?
On the first try, the lambda times out at the promise line without logging any error in the checkpoint. On the next invocation, since semantics is AtLeastOncePerRetry, the step runs again till the lambda timeout. It keeps repeating on and on and on and on…

Log analysis:

The first column is the lambda invocation ID, the second is the attempt count to run the step, third is the log to show the code executed.

AtLeastOncePerRetry (Default) and shouldRetry as true (default)

Clearly, we can see that across all invocations, the code in step 2 got executed.

Since the Lambda times out before throwing any error, the retry strategy is never invoked, so the attempt counter never increments beyond 1.

A replay is not the same as a retry attempt!
Lambda may invoke the function multiple times during replay while still staying within the same retry attempt. Retry attempts increment only when the retry strategy schedules a new attempt after a checkpointed error.

Scenario 2: AtLeastOncePerRetry (Default) and shouldRetry as false

Similar to the previous case, the error is never thrown, and the retry strategy is never invoked. Because of AtLeastOncePerRetry semantics, the step executes again during every replay because no completion checkpoint exists.

Clearly, if semantics is AtLeastOncePerRetry and lambda is interrupted mid-execution (timeout in our case) without a proper step end/error checkpoint, the retry strategy is of no use.

shouldRetry: false only works when an error is actually thrown. A Lambda timeout halts the process silently, and no error is thrown, so the retry strategy is never called.

Scenario 3: AtMostOncePerRetry and shouldRetry as true (default)

Things become interesting here! Code of step 2 is executed on every second invocation.

What is happening here?

On first invocation, step 2 code executed, but the lambda timeout, hence no FINISH/ERROR checkpoint.
On second invocation, since the AtMostOncePerRetry semantic is used, step2 execution is skipped and StepInterruptedError is thrown. This time, no timeout; instead, a proper error bubbled out.
Now things are under the control of the retry strategy. Since the default retry strategy is true, attempt 2 is scheduled. Notice “ATTEMPT” here, that is very important.
On the third invocation, step 2 code is executed. Because AtMostOncePerRetry guarantees per ATTEMPT, not across the workflow. Lambda times out, and the process repeats….

Thanks to Kiro for explaining this confusing behaviour to me 😊

Notice this time, there is no infinite loop. Things are in control of the retry strategy because AtMostOncePerRetry emitted the error StepInterruptedError.

Scenario 4: AtMostOncePerRetry and shouldRetry as false

Finally, the scenario we’ve all been waiting for!

On first invocation, the lambda timed out inside step 2.
On the second invocation, StepInterruptedError is thrown, and step 2 is not executed. Since the retry strategy returns shouldRetry: false, no new attempt. And that’s how we achieved “exactly once execution of step”.

Let Us Inspect Source Code 🕵🏼

In our observations, we noticed that the retry strategy has no effect in the case of a lambda timeout with AtLeastOncePerRetry. Also, AtLeastOncePerRetry does not wait for the START checkpoint to finish.

But then a question came to my mind… Why does AtLeastOncePerRetry not wait for the START checkpoint?

I got my answer after looking at the code of the Lambda Durable SDK (step handler). There, we can clearly see that the durable SDK checks if the START checkpoint already exists only for AtMostOncePerRetry. If yes, the retry strategy is checked. But in the case of AtLeastOncePerRetry, no checking is done, and executeStepLogic() is called.

So the answer is: AtLeastOncePerRetry does not wait for the START checkpoint because it does not need to!

For AtLeastOncePerRetry, it does not care about the START checkpoint. The only thing it cares about is whether the step is COMPLETED. If not, run it. There will be no retry, same attempt with multiple invocations/replays until the step succeeds at least once. That’s how it got its name, “AtLeastOncePerRetry” 💪🏼

On the other hand, AtMostOncePerRetry gives the vibe of a sincere student. It waits for the START checkpoint to finish because it needs it (like a good student needing notes during exams 😄).

When a lambda timeout happens with step completion, AtMostOncePerRetry first checks whether the START checkpoint exists. If it does, it then consults the retry strategy:
* If the retry strategy says “don’t retry,” it immediately stops and bubbles out StepInterruptedError.
* If retryDecision.shouldRetry is true, it checkpoints the StepInterruptedError and schedules a retry.

This is what ensures at-most-once execution per retry attempt.

Conclusion

So in this post, we understood how semantics and retry strategy work in a durable lambda function.

The most important thing to remember is that a replay is not the same as a retry attempt. A Lambda function may replay multiple times within the same attempt if execution gets interrupted before proper checkpointing.

AtLeastOncePerRetry prioritizes progress and eventual completion, making it suitable for idempotent operations.

AtMostOncePerRetry prioritizes avoiding duplicate side effects within a retry attempt, making it useful for non-idempotent operations like payments or notifications. However, by itself, it still does not guarantee exactly-once execution across the entire workflow.

To achieve true end-to-end exactly-once execution, AtMostOncePerRetry must be combined with a no-retry strategy.

Hopefully, this post helped clarify one of the most confusing parts of the Durable Execution SDK 😄