The visible failure is rarely the first failure
By the time anyone declares an integration "broken," the failure is already weeks old. What gets noticed is the symptom: an order stuck in a queue, an inventory count that does not match the warehouse, a refund that cleared the gateway but never posted to finance, a partner escalation that lands on a Friday afternoon.
The original failure was a decision nobody made. No one named which system owns the truth for price, order state, catalog enrichment, or payment reconciliation. The connector did exactly what it was built to do; it just moved data across a boundary that was never defined. Code does not cause these failures. It exposes them.
Platform capability is not the constraint
SAP Commerce Cloud, Salesforce Commerce Cloud, Shopify, and commercetools each carry different assumptions about who owns data and how the platform should be extended. Treat those assumptions as interchangeable and the integration layer stops being a system and becomes a pile of exceptions, each one a workaround for a boundary that was never agreed.
This is why a connector can pass every unit test and still be wrong. It moves the payload correctly and breaks the operating model anyway, because correctness at the API is not correctness for the business.
Rescue starts with contracts, not a rebuild
The instinct in a failing program is to rewrite the connector. The faster fix is usually a sharper map of responsibility for each flow:
- which system creates the event
- which system is allowed to enrich or change it
- which system can reject it, and on what rule
- which team owns the failure path when it breaks
Make those four answers explicit and the rest gets easier. Implementation shrinks, monitoring gets meaningful, and handoffs stop being a performance of accountability that no one actually holds. We cover the recurring breakages in detail in the commerce integration error patterns playbook.
Decisions to settle before implementation
Most expensive incidents trace back to one of these being left implicit:
- Name the source system and the consuming systems for every critical object, and separate who can correct bad data from who can only request a correction.
- Choose the integration pattern per flow, not globally: direct API, event, queue, file, iPaaS, middleware, or scheduled job each fit different latency and failure needs.
- Document the identifiers that join records across systems, including environment prefixes, legacy IDs, marketplace IDs, and financial references. A single missing prefix turns a valid event into a duplicate or an orphan.
- State latency in business terms: checkout-blocking, same-day operational, next-cycle enrichment, or finance-close reconciliation. Not every flow needs real-time.
- Design idempotency, duplicate detection, retry windows, dead-letter handling, and manual replay before the first cutover rehearsal, not after the first incident.
- Assign an owner to each failure mode, not just each system. That owner needs the authority to retry, repair, suppress, or escalate.
For platform-specific decisions on SAP Commerce, see SAP Commerce Cloud integration patterns.
Prove the operating model in thin slices
Build the first flow as a thin vertical slice that exercises the real ownership model end to end: one representative product or order, one transformation, one negative test, one retry, one alert, one support action. This proves the path before anyone scales mappings, and it reveals whether the named owners can actually make a decision when something fails.
The second slice adds the exceptions that production guarantees: missing attributes, stale inventory, duplicate identifiers, delayed payments, rejected orders, incompatible statuses, partial reversals. These are not edge cases. They are ordinary commerce, and treating them as first-class scenarios is what keeps support cost from arriving as a surprise after launch.
The third slice proves release and rollback. The team should know which jobs can pause, which queues can drain, which events can replay, which data must be reconciled, and which storefront behavior survives degraded service. A launch plan without recovery evidence is just a deployment schedule.
What changes after go-live
Run a weekly operational review until the flow is stable. Track failure count, time to recovery, stale records, manual interventions, mapping changes, and any ownership question still open. The first month is not only hypercare; it is when the permanent operating model gets proven against real data.
Two failure signals are worth watching for. If the dashboard shows failures but no one can state the business effect, the monitoring is too technical. If the team can describe the business effect but cannot trace it to a payload, job, or event, the observability is too shallow. The goal is a flow that operations and engineering can both read the same way.
This article identifies the decisions that should be settled before delivery starts. Turning them into a route map, sequence, and risk register is what an architecture review does, and getting in touch is the fastest way to start one.
Related field guides
Architecture Decision
Commerce integration error patterns playbook
Commerce integration error patterns playbook
A field guide for classifying recurring commerce integration errors, assigning ownership, and turning incidents into better contracts, monitoring, and recovery paths.
Architecture Decision
How to Build a Commerce Architecture Decision Record Practice
How to Build a Commerce Architecture Decision Record Practice
Practical guidance for architect teams to reduce SAP Commerce delivery risk and move toward measurable outcomes.