Retry policy
The retry policy is deliberately simple and bounded. Posthorn retries at most once per request, classifies failures into three buckets, and enforces a hard 10-second overall timeout. There is no exponential backoff and no jitter — both would complicate the contract without meaningful benefit at the scale and use case Posthorn targets.
Decision table
Section titled “Decision table”| Transport response | Class | Action |
|---|---|---|
200 OK / 202 Accepted | success | Return 200 to client |
5xx server error | ErrTransient | Wait 1s, retry once |
| Network timeout / connection refused | ErrTransient | Wait 1s, retry once |
429 Too Many Requests | ErrRateLimited | Wait 5s, retry once |
4xx other than 429 (401, 403, 422, etc.) | ErrTerminal | No retry → 502 to client |
| Either retry exhausted (still failing after retry) | (terminal) | 502 to client |
| 10 seconds elapsed in total | (timeout) | Cancel in-flight, return 502 |
Why one retry, not exponential backoff?
Section titled “Why one retry, not exponential backoff?”Three reasons:
-
Synchronous request. The client is waiting on the HTTP response. Each retry attempt adds latency the client experiences directly. Multi-attempt exponential backoff with a 60-second tail would be a poor UX — better to fail fast and let the operator handle recovery from logs.
-
The cause is usually persistent. Transport
429means upstream is rate-limiting; one retry after a generous 5s pause is enough to disambiguate “blip” from “policy.” Transport5xxis usually a brief network or upstream issue; 1s is enough. -
Failure mode is logged, not lost. Posthorn logs the full submission payload on terminal failure (when
log_failed_submissions = true, the default). The operator can recover the data and re-send manually if the transient issue was longer than the retry window.
The 10-second hard cap
Section titled “The 10-second hard cap”Every request runs under a context.WithTimeout(ctx, 10*time.Second). When the deadline fires:
- Any in-flight retry is cancelled mid-request via the request context.
- The handler returns 502 immediately.
- A
submission_failedlog line is written with the underlying error. There is no separatetimeout_exceededevent — the timeout surfaces as the cancelled retry’s error.
This means a worst case looks like:
| t (sec) | Event |
|---|---|
| 0.0 | Request received, validated, send started |
| 5.0 | Transport returns 5xx (took 5s — slow upstream) |
| 5.0 | Retry waits 1s |
| 6.0 | Retry send started |
| 10.0 | 10s deadline fires; retry cancelled |
| 10.0 | 502 returned |
In practice, transient failures resolve in under a second and you never come close to the cap. The cap exists to bound the absolute worst case so client connections don’t pile up.
429 backoff length
Section titled “429 backoff length”The 5-second backoff on transport 429 is chosen to be:
- Longer than
ErrTransient’s 1s —429is upstream policy, not a network blip - Shorter than the 10-second cap, leaving room for the retry to complete
If a 429 repeats on retry, Posthorn returns 502 to the client. The operator’s mitigation is to either:
- Lower Posthorn’s own rate limit so it doesn’t blow past Postmark’s
- Upgrade the Postmark plan
- Switch to a transport with a higher quota
What ErrTerminal looks like
Section titled “What ErrTerminal looks like”A 4xx response from the upstream provider other than 429 is almost always a configuration error:
| Status | Meaning |
|---|---|
401 Unauthorized | Invalid api_key — wrong token, account deleted, or token revoked |
403 Forbidden | Account or sender domain suspended |
422 Unprocessable Entity | Provider rejected the message (unverified from, invalid recipient, blocked content) |
404 Not Found | URL misconfigured (Posthorn wouldn’t normally hit this — internal bug) |
In every case, retrying without fixing the config will fail again. Posthorn logs the response status and returns 502 to the client. The operator’s recovery is to read the log, fix the config, and re-send manually if needed.
Observing retries
Section titled “Observing retries”When the first attempt fails with a retryable class, Posthorn emits a send_retry_scheduled event:
{ "time": "2026-05-16T20:01:23Z", "level": "INFO", "msg": "send_retry_scheduled", "submission_id": "7f2c84d6-9b1e-4c2f-a3b8-1a2b3c4d5e6f", "endpoint": "/api/contact", "transport": "postmark", "class": "transient", "status": 503, "delay": 1000000000}delay is nanoseconds (slog’s default Duration encoding) — 1s for transient, 5s for rate_limited.
After the retry returns, Posthorn emits one of:
send_retry_succeeded(no fields beyond standard) — followed by the normalsubmission_sent(the terminal event carryinglatency_msandtransport_message_id).send_retry_failed(witherror) — followed bysubmission_failedand a 502 to the client.
Tuning
Section titled “Tuning”The retry constants are not currently configurable. The values are:
| Constant | Value | Why |
|---|---|---|
| Transient retry delay | 1s | Long enough for brief network blips to clear; short enough not to consume the timeout budget |
| Rate-limit retry delay | 5s | Long enough for a typical upstream limit window to refill partially |
| Max retries | 1 | See “Why one retry” above |
| Overall timeout | 10s | Bounded to prevent connection pileup |
These are constants in code (per architecture doc), declared as package variables only so tests can override them. If you find yourself wanting to tune retry behavior in production, the right answer is usually to address the upstream cause, not the retry knob.