Back to field guidesField guides/Architecture Decision

Commerce integration error patterns playbook

A field guide for classifying recurring commerce integration errors, assigning ownership, and turning incidents into better contracts, monitoring, and recovery paths.

AS
Adrian Shaw
May 22, 2026 · 5 min read
Architecture Decision

Architecture Decision

Commerce integration error patterns playbook

Why error patterns matter

Commerce integration incidents do not arrive as clean technical categories. A failed order export can read as an ERP outage, a mapping defect, a missing customer field, a duplicate retry, or a payment-state mismatch depending on which log you open first. Without a shared error taxonomy, teams burn hours debating symptoms while orders, refunds, catalog updates, and marketplace offers sit stuck.

An error-pattern playbook does not replace logs or dashboards. It turns those signals into ownership, business impact, and a recovery path — the difference between observing an exception and operating a commerce flow. Use each pattern below as a first-pass filter during triage.

Pattern 1: ownership errors

The tell: the data is technically correct, but it was written by the wrong system.

Price changes landing in commerce when ERP owns price; product attributes edited in the storefront when PIM owns enrichment; order status corrected in the OMS without propagating to customer-service tooling. The defect is a source-of-truth violation, not a value error.

Recovery is more than a data fix. Update the ownership model and block the wrong update path, or the same incident returns with a different payload. Track these in the decision log, not only the defect backlog — the fix is an architecture call, not a bug.

Pattern 2: contract errors

The tell: failures cluster right after a release that touched one side of the integration without a consumer review.

Sender and receiver disagree about payload shape, required fields, version, enum values, identifier format, or status transitions. Capture the failing payload, name the contract rule that was broken, and decide whether sender, receiver, or mapper owns the correction. Add a contract test for that specific failure. If the rule needs a temporary exception, give it an expiry date — undocumented permanent exceptions become hidden architecture.

Pattern 3: identity and matching errors

The tell: the record is valid but unmatchable.

Customer IDs, product codes, variant identifiers, company accounts, marketplace seller IDs, payment references, and order numbers all drift. A single missing prefix turns a legitimate event into a duplicate or an orphan. Recovery is more than retrying the message: determine whether the identifier was absent, transformed incorrectly, created in the wrong system, or reused in a way the receiver cannot support. Make matching rules visible in the integration contract and test them against realistic legacy data, not clean fixtures.

Pattern 4: timing and consistency errors

The tell: the right data arrives in the wrong order or outside its consistency window.

Inventory updates after checkout. Payment capture after order export. Enrichment after publish. Batch jobs overlapping real-time events. Classify each flow by latency requirement — checkout-blocking, customer-visible, operational same-day, finance-close, or analytical — then design retry, replay, and reconciliation to match. Not every timing error needs real-time architecture; some need better sequencing, lock rules, or a reconciliation dashboard.

Pattern 5: retry and duplicate errors

The tell: the retries cause the damage, not the original failure.

Duplicate orders, duplicate invoices, repeated emails, doubled marketplace updates, and confused support tickets. Retries are necessary; uncontrolled retries are a bug. Pair every retry policy with idempotency keys, duplicate detection, and a hard maximum retry window. When this pattern appears, inspect whether the sender generated a stable key, whether the receiver honored it, and whether support can tell a pending retry from a permanent failure. Dead-letter handling must carry owner, cause, impact, recovery step, and replay rules.

Pattern 6: operational visibility errors

The tell: the monitor is green, but the business flow is broken.

The API responds but orders are delayed. The job completes but rejects half the records. The queue drains but the downstream system stores unusable statuses. Technical uptime is not flow health. Dashboards should show business outcomes — orders exported, records rejected, average lag, stale queue depth, settlement mismatches, manual repairs, failed replays. Alerts should name the flow and the likely owner. "Job failed" is noise; "order export to ERP rejected 18 orders for missing tax code" is a call to action.

How to use the playbook during an incident

Start with classification, not blame. Pick the most likely pattern and ask what evidence would disprove it. Preserve the failing payload, correlation IDs, timestamps, source system, destination system, and business object. Then assign one owner for immediate recovery and one for the contract or operating-model change.

After recovery, update the runbook. A good action is specific: add an idempotency key, reject the bad payload earlier, add the missing alert, clarify the source of truth, update a mapper test, change release sequencing. "Improve monitoring" usually means the team has not understood the pattern yet.

Pair this playbook with API contract design, commerce runbooks, and data ownership modeling. Error handling is where architecture decisions become visible under pressure.

What to keep out of the playbook

Do not turn the playbook into a dump of every exception string. Keep transient stack traces in logs and observability tools. The playbook captures durable patterns, decision rules, and recovery ownership. If an error happened once with no repeatable cause, it belongs in the incident record. If it changes how teams design, test, monitor, or support a flow, promote it into the playbook.

Review the playbook after every major launch, platform upgrade, marketplace onboarding wave, or ERP release. The pattern list should evolve as the operating model matures.

Related field guides