Time to ship

A quick update on what's been going on, what happened, and the path to reopening.

|By Ethan Marcus

It’s been a few days since we’ve made a public announcement, this time we wanted to break the silence with something more than just a tweet.

To be candid, the past three weeks have been excruciatingly hard. We are at the edge of something novel with massive product-market fit. But we are unable to meet the demand where it is. It is exhausting to feel like we are letting people down, even as we work tirelessly to catch up.

Over the past 21 days, we’ve surmounted some of the most difficult technical challenges we've ever worked on.

Sustaining this kind of intensity is strenuous, but we had no choice but to do what’s right to the best of our abilities. I cannot be more proud of my team and thankful to the Spark team for the sleepless nights poured into making this system scalable and reliable.

We believe in radical transparency, so here’s the play by play of what happened, where we are, and where we’re going.

What even happened?

On October 28th at 3pm PT, Flashnet went live.

Within minutes, we saw more concurrent traffic than Spark has ever processed in its history. In a vacuum, this is the kind of demand builders dream of. In reality, it acted as an immediate stress test that our infrastructure, and our underlying settlement layer was not ready for.

The result was a cascade of failures that led to a deadlock state forcing us to take the platform down.

Some perspective

To understand the failure, you have to understand the velocity at which we are operating.

Spark is a custom Bitcoin L2 built from the ground up, not a fork, and has been on mainnet for only six months. Flashnet was built on top of this novel architecture in a similar timeframe.

This means we were operating without the safety net of years of historical stress-testing. The launch traffic exposed the friction points between these two new systems in a way that internal testing could not.

We are attempting something that has never been done on Bitcoin before, and with this pace of scaling complex distributed systems all at once, you inevitably hit bumps. On launch day, we hit one hard. I take full responsibility for not planning for these more thoroughly before taking Flashnet live.

The Technical TLDR

The failure was not a single bug, but a chain reaction triggered by immense load on the system.

When we went live, the SSP (Spark Service Provider) Flashnet relies on went up against a huge amount of pressure. The SSP is core to how wallets operate against Spark. When a user has a certain set of leaves but needs to make a subset of a payment, it can use the SSP to efficiently swap its distribution of leaves for another equal sum of leaves in order to have a perfect set for a payment. This is all done via an atomic swap primitive. If a swap fails, funds have to wait until expiry time to be unlocked.

Under this pressure, thousands of atomic swaps began to fail simultaneously. In a low-volume environment, a failed swap is annoying but manageable: the funds simply wait for a timeout expiry to unlock. But at this scale, those timeouts cause cascading issues.

Our AMM utilizes a pool of leaves to fulfill outbound requests. As swaps failed, the leaves associated with them became locked in that timeout limbo. The system kept trying to process new requests, but with every failure, more leaves got locked up. Rapidly, pools were starved of available liquid leaves, despite having plenty of funds on paper. We had entered a deadlock. Because Spark isn't a traditional blockchain, operators must reach strict consensus on actions individually. The load was so high that some leaves fell out of sync, forcing the network's gossip mechanism to play catch-up. This created a consistency lag, a discrepancy between what Flashnet thought was happening and what Spark considered final. That made breaking the deadlock impossible in real-time.

Taking Ownership

It is easy to point fingers when systems overlap, but the reality is rarely black and white. This was a mix of circumstances, assumptions, and the inherent risks of innovation.

On the Flashnet side: we made many mistakes. We operated on too many assumptions regarding the Spark/SSP's upper limits. Our error handling for SSP swaps was insufficient for a high-load failure cascade. We built for the "happy path" and didn't adequately stress-test the "doomsday path."

On the Spark side: the network had not been battle-tested at this scale. We pushed the L2 to a breaking point it had never seen before, revealing latency and synchronization issues that were invisible during lower-traffic periods.

What We’ve Fixed

Since the moment the issues surfaced, we have been heads-down with the Spark team fundamentally fixing and improving various features:

Spark has:

  • Lowered Spark operators scaling pressure under load by >90%
  • Have achieved 0 slow-synchronization states under high-load and contention testing
  • Achieved sustained +200 TPS

Flashnet has:

  • Added many fallbacks on the Flashnet stack to ensure reliability under pressure
  • Achieved sustained 100+ tps on Flashnet under adversarial conditions

These are foundational to ensuring that when we go live, we stay live.

To Our Users

I want to address the elephant in the room: the market is currently paused.

We are acutely aware that thousands of you hold positions that you are waiting to add to or unwind. While you retain full ownership of your assets, we understand that without the active marketplace, your ability to manage those positions is restricted. In this industry, access is everything, and this downtime isn’t the standard we hold ourselves to. We do not take this lightly. The decision to keep the market paused is not one we make to protect ourselves, but to make sure that when we do open the doors, they stay open. We refuse to reopen a system until we are confident it can handle the demand without risking stability.

We are targeting a return to full operations, no later than December 1st.

However, please know that our team is working around the clock, literally 24/7, to make it happen sooner than that. We are pushing the limits to get you back in as quickly and safely as possible.

Our priority right now is moving forward. We are simply working until the job is done. We empathize deeply with every single user stuck in this limbo, and our number one priority is delivering the UX you were promised.

Lastly, I know we’ve been slow on comms. We are a small team, and we’ve been heads down focused on the product and nothing else. We will do better. My DMs are always open @ethannmarcus.

We are going to do everything in our power to make this up to you.

Back to building.