Skip to content

Retry policy

The retry policy is deliberately simple and bounded. Posthorn retries at most once per request, classifies failures into three buckets, and enforces a hard 10-second overall timeout. There is no exponential backoff and no jitter — both would complicate the contract without meaningful benefit at the scale and use case Posthorn targets.

Transport responseClassAction
200 OK / 202 AcceptedsuccessReturn 200 to client
5xx server errorErrTransientWait 1s, retry once
Network timeout / connection refusedErrTransientWait 1s, retry once
429 Too Many RequestsErrRateLimitedWait 5s, retry once
4xx other than 429 (401, 403, 422, etc.)ErrTerminalNo retry → 502 to client
Either retry exhausted (still failing after retry)(terminal)502 to client
10 seconds elapsed in total(timeout)Cancel in-flight, return 502

Three reasons:

  1. Synchronous request. The client is waiting on the HTTP response. Each retry attempt adds latency the client experiences directly. Multi-attempt exponential backoff with a 60-second tail would be a poor UX — better to fail fast and let the operator handle recovery from logs.

  2. The cause is usually persistent. Transport 429 means upstream is rate-limiting; one retry after a generous 5s pause is enough to disambiguate “blip” from “policy.” Transport 5xx is usually a brief network or upstream issue; 1s is enough.

  3. Failure mode is logged, not lost. Posthorn logs the full submission payload on terminal failure (when log_failed_submissions = true, the default). The operator can recover the data and re-send manually if the transient issue was longer than the retry window.

Every request runs under a context.WithTimeout(ctx, 10*time.Second). When the deadline fires:

  • Any in-flight retry is cancelled mid-request via the request context.
  • The handler returns 502 immediately.
  • A submission_failed log line is written with the underlying error. There is no separate timeout_exceeded event — the timeout surfaces as the cancelled retry’s error.

This means a worst case looks like:

t (sec)Event
0.0Request received, validated, send started
5.0Transport returns 5xx (took 5s — slow upstream)
5.0Retry waits 1s
6.0Retry send started
10.010s deadline fires; retry cancelled
10.0502 returned

In practice, transient failures resolve in under a second and you never come close to the cap. The cap exists to bound the absolute worst case so client connections don’t pile up.

The 5-second backoff on transport 429 is chosen to be:

  • Longer than ErrTransient’s 1s — 429 is upstream policy, not a network blip
  • Shorter than the 10-second cap, leaving room for the retry to complete

If a 429 repeats on retry, Posthorn returns 502 to the client. The operator’s mitigation is to either:

  • Lower Posthorn’s own rate limit so it doesn’t blow past Postmark’s
  • Upgrade the Postmark plan
  • Switch to a transport with a higher quota

A 4xx response from the upstream provider other than 429 is almost always a configuration error:

StatusMeaning
401 UnauthorizedInvalid api_key — wrong token, account deleted, or token revoked
403 ForbiddenAccount or sender domain suspended
422 Unprocessable EntityProvider rejected the message (unverified from, invalid recipient, blocked content)
404 Not FoundURL misconfigured (Posthorn wouldn’t normally hit this — internal bug)

In every case, retrying without fixing the config will fail again. Posthorn logs the response status and returns 502 to the client. The operator’s recovery is to read the log, fix the config, and re-send manually if needed.

When the first attempt fails with a retryable class, Posthorn emits a send_retry_scheduled event:

{
"time": "2026-05-16T20:01:23Z",
"level": "INFO",
"msg": "send_retry_scheduled",
"submission_id": "7f2c84d6-9b1e-4c2f-a3b8-1a2b3c4d5e6f",
"endpoint": "/api/contact",
"transport": "postmark",
"class": "transient",
"status": 503,
"delay": 1000000000
}

delay is nanoseconds (slog’s default Duration encoding) — 1s for transient, 5s for rate_limited.

After the retry returns, Posthorn emits one of:

  • send_retry_succeeded (no fields beyond standard) — followed by the normal submission_sent (the terminal event carrying latency_ms and transport_message_id).
  • send_retry_failed (with error) — followed by submission_failed and a 502 to the client.

The retry constants are not currently configurable. The values are:

ConstantValueWhy
Transient retry delay1sLong enough for brief network blips to clear; short enough not to consume the timeout budget
Rate-limit retry delay5sLong enough for a typical upstream limit window to refill partially
Max retries1See “Why one retry” above
Overall timeout10sBounded to prevent connection pileup

These are constants in code (per architecture doc), declared as package variables only so tests can override them. If you find yourself wanting to tune retry behavior in production, the right answer is usually to address the upstream cause, not the retry knob.