Reliability patterns

Automation clients must assume failures happen:

transient network disconnects
process restarts during an in-flight prepare/execute step
timeouts where the client cannot tell if a request succeeded
missed stream updates (bounded queues / coalescing)

This guide shows the recommended patterns for building restart-safe and idempotent MesoLive automation using the Python SDK.

1) Persist idempotency keys for “intents”

For idempotent Control Hub operations, generate an IdempotencyKey per intent and persist it in your durable state (DB, file, etc.). Examples:

start/execute entry: prep_entry_*, send_order_*, paper_entry_*
exit: prep_exit_*, send_order_*, paper_exit_*
cancel order: cancel_order_*

If the process crashes after submission, you can safely retry with the same key or recover via GetIdempotencyRecord.

2) Treat non-success StatusCodes as expected control flow

The SDK raises MesoLiveApiException when the hub returns Status != Success, including:

DuplicateRequest: the server already completed an idempotent operation for that key
Conflict: the idempotent operation is still processing, or the key was reused incorrectly

Do not treat these as “unexpected”; handle them explicitly in automation wrappers.

3) Recover “unknown outcome” via GetIdempotencyRecord

If SendOrder / ApplyPaperFills / CancelOrder times out or disconnects, recovery should be:

Call GetIdempotencyRecord(IdempotencyKey=...)
If the record is still Processing, wait/poll and retry
If Completed, continue using the recorded LiveOrderIds / captured response metadata
If Failed, surface and stop (or retry with a new key if appropriate)

import asyncio
from mesolive_sdk.exceptions import MesoLiveApiException
from mesolive_sdk import models


async def wait_idempotency_terminal(control, key: str, *, timeout_s: float = 30.0) -> models.IdempotencyRecord:
    deadline = asyncio.get_running_loop().time() + timeout_s
    while True:
        rec = await control.get_idempotency_record(models.GetIdempotencyRecordArgs(IdempotencyKey=key))
        if rec.Status != models.IdempotencyRecordStatus.Processing:
            return rec
        if asyncio.get_running_loop().time() >= deadline:
            raise TimeoutError(f"Idempotent operation still Processing after {timeout_s:.1f}s (key={key})")
        await asyncio.sleep(0.25)


async def safe_send_order(control, args: models.SendOrderArgs) -> models.IdempotencyRecord:
    try:
        await control.send_order(args)
    except MesoLiveApiException as exc:
        if exc.status not in {models.StatusCode.DuplicateRequest, models.StatusCode.Conflict}:
            raise
    return await wait_idempotency_terminal(control, args.IdempotencyKey)

4) Use Event Hub replay to fill gaps after restarts

The Event Hub provides per-user event ordering via EventSeqId and a history API:

GetLatestEventSeqId returns the latest cursor
GetEventsSince returns paged historical events after a cursor

Recommended durable pattern:

Persist last_event_seq_id in your durable store.
On startup, connect to Event Hub and replay GetEventsSince(SinceEventSeqId=last_event_seq_id) until caught up.
Subscribe (often Strategies=None) and then process live callbacks.
Update last_event_seq_id as you process each event (history envelopes and callbacks carry EventSeqId).

note

On connect, the server may send “active state” snapshots (signals/errors) to the caller. These may overlap with replay results. Build handlers to be idempotent (use EventId as dedupe key if needed).

5) Make handlers fast; offload slow work

Event callbacks arrive on the SignalR connection. If your handler does heavy computation or slow I/O, enqueue work to your own worker pool/queue so you don’t fall behind and miss time-sensitive updates.

6) Streaming data is not a ledger

Data Hub streams are best-effort and can drop items under backpressure. For reliable monitoring:

treat streams as “keep me updated”
periodically resync with snapshots (Get*Snapshot)
if your consumer lags, restart the stream and resync

7) Use multiple data feeds and require quote convergence

If you have access to multiple quote providers/connectors, don’t treat any single feed as the source of truth. A robust pattern is to treat quotes as a sensor network:

subscribe to the same instrument/position data from at least 2 providers
reject stale quotes using LivePrices.BidAgeMs / AskAgeMs (and/or SnapshotTime)
ignore “boot frame” quotes where both sides are 0/0
sample multiple updates per provider and use the median bid/ask/mid to filter flicker
require convergence (provider median mids within a tolerance, e.g. N ticks) before trading
if feeds diverge beyond tolerance, pause automation, alert, and resync (restart streams, retry later)

8) Track the “far side” of the book to build a stable mid price

For limit-order repricing (especially single-leg), the top-of-book on the side you are quoting can be influenced by your own working order. A reliable approach is:

treat the near side as potentially untrusted if it matches your working limit (self-quote guard)
track a far-side snapshot as the primary anchor:
- buying → anchor on ask
- selling → anchor on bid
compute mid using both sides when trusted; otherwise fall back to far-side + a base/synthetic mid
if the far side becomes stale (no updates for your configured timeout), slow down repricing or pause
enforce safety rails around crossing:
- don’t cross the spread until a delay elapses
- cap how far you are willing to cross (in ticks)
- add a grace period after partial fills before repricing again

9) After reconnect: resubscribe, replay, and reconcile in-flight jobs

SignalR reconnects do not preserve subscription state. On reconnect/restart, treat everything as suspect:

resubscribe to Event Hub strategies (often Strategies=None)
replay history from your persisted cursor (GetEventsSince) to fill gaps
validate your cursor baseline against the server (via GetLatestEventSeqId); if your local cursor is ahead, reset it
restart Data Hub streams and resync with snapshots
for any in-flight async preparation jobs (JobId), call:
- GetPreparePositionEntryStatus(JobId)
- GetPreparePositionExitStatus(JobId)
- GetPreparePositionAdjustmentStatus(JobId)

This job-status reconciliation is critical to avoid “stuck” workflows when completion callbacks were missed.

10) Add guardrails: dedupe, throttles, and daily limits

Real-time event streams can deliver duplicates (replay + live, reconnect snapshots, retries). Add guardrails:

Dedupe: use EventId as a stable key; make handlers idempotent
Stale signal filter: ignore signals that are unreasonably old for your automation session
Throttling: limit parallel work; exit/adjustment typically should preempt entry
Per-strategy locking: serialize workflows per strategy (or per position) to avoid race conditions
Daily caps: enforce max positions per day per (strategy, account) as a fail-safe against signal storms

1) Persist idempotency keys for “intents”​

2) Treat non-success StatusCodes as expected control flow​

3) Recover “unknown outcome” via GetIdempotencyRecord​

4) Use Event Hub replay to fill gaps after restarts​

5) Make handlers fast; offload slow work​

6) Streaming data is not a ledger​

7) Use multiple data feeds and require quote convergence​

8) Track the “far side” of the book to build a stable mid price​

9) After reconnect: resubscribe, replay, and reconcile in-flight jobs​

10) Add guardrails: dedupe, throttles, and daily limits​