Resiliency policies

Configure resiliency policies for timeouts, retries, and circuit breakers

Define timeouts, retries, and circuit breaker policies under policies. Each policy is given a name so you can refer to them from the targets section in the resiliency spec.

Note: Dapr offers default retries for specific APIs. See here to learn how you can overwrite default retry logic with user defined retry policies.

Timeouts

Timeouts are optional policies that can be used to early-terminate long-running operations. If you’ve exceeded a timeout duration:

  • The operation in progress is terminated (if possible).
  • An error is returned.

Valid values are of the form accepted by Go’s time.ParseDuration, for example: 15s, 2m, 1h30m. Timeouts have no set maximum value.

Example:

  1. spec:
  2. policies:
  3. # Timeouts are simple named durations.
  4. timeouts:
  5. general: 5s
  6. important: 60s
  7. largeResponse: 10s

If you don’t specify a timeout value, the policy does not enforce a time and defaults to whatever you set up per the request client.

Retries

With retries, you can define a retry strategy for failed operations, including requests failed due to triggering a defined timeout or circuit breaker policy.

Pub/sub component retries vs inbound resiliency

Each pub/sub component has its own built-in retry behaviors. Explicity applying a Dapr resiliency policy doesn’t override these implicit retry policies. Rather, the resiliency policy augments the built-in retry, which can cause repetitive clustering of messages.

The following retry options are configurable:

Retry optionDescription
policyDetermines the back-off and retry interval strategy. Valid values are constant and exponential.
Defaults to constant.
durationDetermines the time interval between retries. Only applies to the constant policy.
Valid values are of the form 200ms, 15s, 2m, etc.
Defaults to 5s.
maxIntervalDetermines the maximum interval between retries to which the exponential back-off policy can grow.
Additional retries always occur after a duration of maxInterval. Defaults to 60s. Valid values are of the form 5s, 1m, 1m30s, etc
maxRetriesThe maximum number of retries to attempt.
-1 denotes an unlimited number of retries, while 0 means the request will not be retried (essentially behaving as if the retry policy were not set).
Defaults to -1.

The exponential back-off window uses the following formula:

  1. BackOffDuration = PreviousBackOffDuration * (Random value from 0.5 to 1.5) * 1.5
  2. if BackOffDuration > maxInterval {
  3. BackoffDuration = maxInterval
  4. }

Example:

  1. spec:
  2. policies:
  3. # Retries are named templates for retry configurations and are instantiated for life of the operation.
  4. retries:
  5. pubsubRetry:
  6. policy: constant
  7. duration: 5s
  8. maxRetries: 10
  9. retryForever:
  10. policy: exponential
  11. maxInterval: 15s
  12. maxRetries: -1 # Retry indefinitely

Circuit Breakers

Circuit Breaker (CB) policies are used when other applications/services/components are experiencing elevated failure rates. CBs monitor the requests and shut off all traffic to the impacted service when a certain criteria is met (“open” state). By doing this, CBs give the service time to recover from their outage instead of flooding it with events. The CB can also allow partial traffic through to see if the system has healed (“half-open” state). Once requests resume being successful, the CB gets into “closed” state and allows traffic to completely resume.

Retry optionDescription
maxRequestsThe maximum number of requests allowed to pass through when the CB is half-open (recovering from failure). Defaults to 1.
intervalThe cyclical period of time used by the CB to clear its internal counts. If set to 0 seconds, this never clears. Defaults to 0s.
timeoutThe period of the open state (directly after failure) until the CB switches to half-open. Defaults to 60s.
tripA Common Expression Language (CEL) statement that is evaluated by the CB. When the statement evaluates to true, the CB trips and becomes open. Defaults to consecutiveFailures > 5.

Example:

  1. spec:
  2. policies:
  3. circuitBreakers:
  4. pubsubCB:
  5. maxRequests: 1
  6. interval: 8s
  7. timeout: 45s
  8. trip: consecutiveFailures > 8

Overriding default retries

Dapr provides default retries for any unsuccessful request, such as failures and transient errors. Within a resiliency spec, you have the option to override Dapr’s default retry logic by defining policies with reserved, named keywords. For example, defining a policy with the name DaprBuiltInServiceRetries, overrides the default retries for failures between sidecars via service-to-service requests. Policy overrides are not applied to specific targets.

Note: Although you can override default values with more robust retries, you cannot override with lesser values than the provided default value, or completely remove default retries. This prevents unexpected downtime.

Below is a table that describes Dapr’s default retries and the policy keywords to override them:

CapabilityOverride KeywordDefault Retry BehaviorDescription
Service InvocationDaprBuiltInServiceRetriesPer call retries are performed with a backoff interval of 1 second, up to a threshold of 3 times.Sidecar-to-sidecar requests (a service invocation method call) that fail and result in a gRPC code Unavailable or Unauthenticated
ActorsDaprBuiltInActorRetriesPer call retries are performed with a backoff interval of 1 second, up to a threshold of 3 times.Sidecar-to-sidecar requests (an actor method call) that fail and result in a gRPC code Unavailable or Unauthenticated
Actor RemindersDaprBuiltInActorReminderRetriesPer call retries are performed with an exponential backoff with an initial interval of 500ms, up to a maximum of 60s for a duration of 15minsRequests that fail to persist an actor reminder to a state store
Initialization RetriesDaprBuiltInInitializationRetriesPer call retries are performed 3 times with an exponential backoff, an initial interval of 500ms and for a duration of 10sFailures when making a request to an application to retrieve a given spec. For example, failure to retrieve a subscription, component or resiliency specification

The resiliency spec example below shows overriding the default retries for all service invocation requests by using the reserved, named keyword ‘DaprBuiltInServiceRetries’.

Also defined is a retry policy called ‘retryForever’ that is only applied to the appB target. appB uses the ‘retryForever’ retry policy, while all other application service invocation retry failures use the overridden ‘DaprBuiltInServiceRetries’ default policy.

  1. spec:
  2. policies:
  3. retries:
  4. DaprBuiltInServiceRetries: # Overrides default retry behavior for service-to-service calls
  5. policy: constant
  6. duration: 5s
  7. maxRetries: 10
  8. retryForever: # A user defined retry policy replaces default retries. Targets rely solely on the applied policy.
  9. policy: exponential
  10. maxInterval: 15s
  11. maxRetries: -1 # Retry indefinitely
  12. targets:
  13. apps:
  14. appB: # app-id of the target service
  15. retry: retryForever

Setting default policies

In resiliency you can set default policies, which have a broad scope. This is done through reserved keywords that let Dapr know when to apply the policy. There are 3 default policy types:

  • DefaultRetryPolicy
  • DefaultTimeoutPolicy
  • DefaultCircuitBreakerPolicy

If these policies are defined, they are used for every operation to a service, application, or component. They can also be modified to be more specific through the appending of additional keywords. The specific policies follow the following pattern, Default%sRetryPolicy, Default%sTimeoutPolicy, and Default%sCircuitBreakerPolicy. Where the %s is replaced by a target of the policy.

Below is a table of all possible default policy keywords and how they translate into a policy name.

KeywordTarget OperationExample Policy Name
AppService invocation.DefaultAppRetryPolicy
ActorActor invocation.DefaultActorTimeoutPolicy
ComponentAll component operations.DefaultComponentCircuitBreakerPolicy
ComponentInboundAll inbound component operations.DefaultComponentInboundRetryPolicy
ComponentOutboundAll outbound component operations.DefaultComponentOutboundTimeoutPolicy
StatestoreComponentOutboundAll statestore component operations.DefaultStatestoreComponentOutboundCircuitBreakerPolicy
PubsubComponentOutboundAll outbound pubusub (publish) component operations.DefaultPubsubComponentOutboundRetryPolicy
PubsubComponentInboundAll inbound pubsub (subscribe) component operations.DefaultPubsubComponentInboundTimeoutPolicy
BindingComponentOutboundAll outbound binding (invoke) component operations.DefaultBindingComponentOutboundCircuitBreakerPolicy
BindingComponentInboundAll inbound binding (read) component operations.DefaultBindingComponentInboundRetryPolicy
SecretstoreComponentOutboundAll secretstore component operations.DefaultSecretstoreComponentTimeoutPolicy
ConfigurationComponentOutboundAll configuration component operations.DefaultConfigurationComponentOutboundCircuitBreakerPolicy
LockComponentOutboundAll lock component operations.DefaultLockComponentOutboundRetryPolicy

Policy hierarchy resolution

Default policies are applied if the operation being executed matches the policy type and if there is no more specific policy targeting it. For each target type (app, actor, and component), the policy with the highest priority is a Named Policy, one that targets that construct specifically.

If none exists, the policies are applied from most specific to most broad.

How default policies and built-in retries work together

In the case of the built-in retries, default policies do not stop the built-in retry policies from running. Both are used together but only under specific circumstances.

For service and actor invocation, the built-in retries deal specifically with issues connecting to the remote sidecar (when needed). As these are important to the stability of the Dapr runtime, they are not disabled unless a named policy is specifically referenced for an operation. In some instances, there may be additional retries from both the built-in retry and the default retry policy, but this prevents an overly weak default policy from reducing the sidecar’s availability/success rate.

Policy resolution hierarchy for applications, from most specific to most broad:

  1. Named Policies in App Targets
  2. Default App Policies / Built-In Service Retries
  3. Default Policies / Built-In Service Retries

Policy resolution hierarchy for actors, from most specific to most broad:

  1. Named Policies in Actor Targets
  2. Default Actor Policies / Built-In Actor Retries
  3. Default Policies / Built-In Actor Retries

Policy resolution hierarchy for components, from most specific to most broad:

  1. Named Policies in Component Targets
  2. Default Component Type + Component Direction Policies / Built-In Actor Reminder Retries (if applicable)
  3. Default Component Direction Policies / Built-In Actor Reminder Retries (if applicable)
  4. Default Component Policies / Built-In Actor Reminder Retries (if applicable)
  5. Default Policies / Built-In Actor Reminder Retries (if applicable)

As an example, take the following solution consisting of three applications, three components and two actor types:

Applications:

  • AppA
  • AppB
  • AppC

Components:

  • Redis Pubsub: pubsub
  • Redis statestore: statestore
  • CosmosDB Statestore: actorstore

Actors:

  • EventActor
  • SummaryActor

Below is policy that uses both default and named policies as applies these to the targets.

  1. spec:
  2. policies:
  3. retries:
  4. # Global Retry Policy
  5. DefaultRetryPolicy:
  6. policy: constant
  7. duration: 1s
  8. maxRetries: 3
  9. # Global Retry Policy for Apps
  10. DefaultAppRetryPolicy:
  11. policy: constant
  12. duration: 100ms
  13. maxRetries: 5
  14. # Global Retry Policy for Apps
  15. DefaultActorRetryPolicy:
  16. policy: exponential
  17. maxInterval: 15s
  18. maxRetries: 10
  19. # Global Retry Policy for Inbound Component operations
  20. DefaultComponentInboundRetryPolicy:
  21. policy: constant
  22. duration: 5s
  23. maxRetries: 5
  24. # Global Retry Policy for Statestores
  25. DefaultStatestoreComponentOutboundRetryPolicy:
  26. policy: exponential
  27. maxInterval: 60s
  28. maxRetries: -1
  29. # Named policy
  30. fastRetries:
  31. policy: constant
  32. duration: 10ms
  33. maxRetries: 3
  34. # Named policy
  35. retryForever:
  36. policy: exponential
  37. maxInterval: 10s
  38. maxRetries: -1
  39. targets:
  40. apps:
  41. appA:
  42. retry: fastRetries
  43. appB:
  44. retry: retryForever
  45. actors:
  46. EventActor:
  47. retry: retryForever
  48. components:
  49. actorstore:
  50. retry: fastRetries

The table below is a break down of which policies are applied when attempting to call the various targets in this solution.

TargetPolicy Used
AppAfastRetries
AppBretryForever
AppCDefaultAppRetryPolicy / DaprBuiltInActorRetries
pubsub - PublishDefaultRetryPolicy
pubsub - SubscribeDefaultComponentInboundRetryPolicy
statestoreDefaultStatestoreComponentOutboundRetryPolicy
actorstorefastRetries
EventActorretryForever
SummaryActorDefaultActorRetryPolicy

Next steps

Try out one of the Resiliency quickstarts:

Last modified October 11, 2024: Fixed typo (#4389) (fe17926)