Reliability patterns
Automation clients must assume failures happen:
- transient network disconnects
- process restarts during an in-flight prepare/execute step
- timeouts where the client cannot tell if a request succeeded
- missed stream updates (bounded queues / coalescing)
This guide shows the recommended patterns for building restart-safe and idempotent MesoLive automation using the Python SDK.
1) Persist idempotency keys for “intents”
For idempotent Control Hub operations, generate an IdempotencyKey per intent and persist it in your durable state (DB, file, etc.). Examples:
- start/execute entry:
prep_entry_*,send_order_*,paper_entry_* - exit:
prep_exit_*,send_order_*,paper_exit_* - cancel order:
cancel_order_*
If the process crashes after submission, you can safely retry with the same key or recover via GetIdempotencyRecord.
2) Treat non-success StatusCodes as expected control flow
The SDK raises MesoLiveApiException when the hub returns Status != Success, including:
DuplicateRequest: the server already completed an idempotent operation for that keyConflict: the idempotent operation is still processing, or the key was reused incorrectly
Do not treat these as “unexpected”; handle them explicitly in automation wrappers.
3) Recover “unknown outcome” via GetIdempotencyRecord
If SendOrder / ApplyPaperFills / CancelOrder times out or disconnects, recovery should be:
- Call
GetIdempotencyRecord(IdempotencyKey=...) - If the record is still
Processing, wait/poll and retry - If
Completed, continue using the recordedLiveOrderIds/ captured response metadata - If
Failed, surface and stop (or retry with a new key if appropriate)
import asyncio
from mesolive_sdk.exceptions import MesoLiveApiException
from mesolive_sdk import models
async def wait_idempotency_terminal(control, key: str, *, timeout_s: float = 30.0) -> models.IdempotencyRecord:
deadline = asyncio.get_running_loop().time() + timeout_s
while True:
rec = await control.get_idempotency_record(models.GetIdempotencyRecordArgs(IdempotencyKey=key))
if rec.Status != models.IdempotencyRecordStatus.Processing:
return rec
if asyncio.get_running_loop().time() >= deadline:
raise TimeoutError(f"Idempotent operation still Processing after {timeout_s:.1f}s (key={key})")
await asyncio.sleep(0.25)
async def safe_send_order(control, args: models.SendOrderArgs) -> models.IdempotencyRecord:
try:
await control.send_order(args)
except MesoLiveApiException as exc:
if exc.status not in {models.StatusCode.DuplicateRequest, models.StatusCode.Conflict}:
raise
return await wait_idempotency_terminal(control, args.IdempotencyKey)
4) Use Event Hub replay to fill gaps after restarts
The Event Hub provides per-user event ordering via EventSeqId and a history API:
GetLatestEventSeqIdreturns the latest cursorGetEventsSincereturns paged historical events after a cursor
Recommended durable pattern:
- Persist
last_event_seq_idin your durable store. - On startup, connect to Event Hub and replay
GetEventsSince(SinceEventSeqId=last_event_seq_id)until caught up. - Subscribe (often
Strategies=None) and then process live callbacks. - Update
last_event_seq_idas you process each event (history envelopes and callbacks carryEventSeqId).
On connect, the server may send “active state” snapshots (signals/errors) to the caller. These may overlap with replay results. Build handlers to be idempotent (use EventId as dedupe key if needed).
5) Make handlers fast; offload slow work
Event callbacks arrive on the SignalR connection. If your handler does heavy computation or slow I/O, enqueue work to your own worker pool/queue so you don’t fall behind and miss time-sensitive updates.
6) Streaming data is not a ledger
Data Hub streams are best-effort and can drop items under backpressure. For reliable monitoring:
- treat streams as “keep me updated”
- periodically resync with snapshots (
Get*Snapshot) - if your consumer lags, restart the stream and resync
7) Use multiple data feeds and require quote convergence
If you have access to multiple quote providers/connectors, don’t treat any single feed as the source of truth. A robust pattern is to treat quotes as a sensor network:
- subscribe to the same instrument/position data from at least 2 providers
- reject stale quotes using
LivePrices.BidAgeMs/AskAgeMs(and/orSnapshotTime) - ignore “boot frame” quotes where both sides are
0/0 - sample multiple updates per provider and use the median bid/ask/mid to filter flicker
- require convergence (provider median mids within a tolerance, e.g.
Nticks) before trading - if feeds diverge beyond tolerance, pause automation, alert, and resync (restart streams, retry later)
8) Track the “far side” of the book to build a stable mid price
For limit-order repricing (especially single-leg), the top-of-book on the side you are quoting can be influenced by your own working order. A reliable approach is:
- treat the near side as potentially untrusted if it matches your working limit (self-quote guard)
- track a far-side snapshot as the primary anchor:
- buying → anchor on ask
- selling → anchor on bid
- compute mid using both sides when trusted; otherwise fall back to far-side + a base/synthetic mid
- if the far side becomes stale (no updates for your configured timeout), slow down repricing or pause
- enforce safety rails around crossing:
- don’t cross the spread until a delay elapses
- cap how far you are willing to cross (in ticks)
- add a grace period after partial fills before repricing again
9) After reconnect: resubscribe, replay, and reconcile in-flight jobs
SignalR reconnects do not preserve subscription state. On reconnect/restart, treat everything as suspect:
- resubscribe to Event Hub strategies (often
Strategies=None) - replay history from your persisted cursor (
GetEventsSince) to fill gaps - validate your cursor baseline against the server (via
GetLatestEventSeqId); if your local cursor is ahead, reset it - restart Data Hub streams and resync with snapshots
- for any in-flight async preparation jobs (
JobId), call:GetPreparePositionEntryStatus(JobId)GetPreparePositionExitStatus(JobId)GetPreparePositionAdjustmentStatus(JobId)
This job-status reconciliation is critical to avoid “stuck” workflows when completion callbacks were missed.
10) Add guardrails: dedupe, throttles, and daily limits
Real-time event streams can deliver duplicates (replay + live, reconnect snapshots, retries). Add guardrails:
- Dedupe: use
EventIdas a stable key; make handlers idempotent - Stale signal filter: ignore signals that are unreasonably old for your automation session
- Throttling: limit parallel work; exit/adjustment typically should preempt entry
- Per-strategy locking: serialize workflows per strategy (or per position) to avoid race conditions
- Daily caps: enforce max positions per day per (strategy, account) as a fail-safe against signal storms