Upgrade Proposal #10: Granite Network Upgrade

On behalf of the Developer Advisory Board, we approve this upgrade to move to a vote.

3 Likes

Although the fallback was activated, strictly speaking, an emergency upgrade has not been performed. The contracts deployed on op-mainnet have not been changed from those approved in Protocol Upgrade #7: Fault Proofs. The ability for the Guardian or Deputy Guardian to switch to the permissioned fallback was built into that proposal as part of a staged, responsible rollout of fault proofs.

That said, other applications that are using the OptimismPortal.respectedGameType will be affected by the change and need to wait for new dispute games to resolve. The exact impact will depend on the details of the application and how they’re using dispute game results.

Apologies, the description of this change is inaccurate - I’ll get the post updated to fix that. There is a new setAnchorState method on the AnchorStateRegistry which can only be called by the guardian or deputy guardian, but it accepts as input a reference to an existing dispute game which resolved as DEFENDER_WINS. This provides an ability to reset the anchor state back to a valid game in the event that a game with an invalid proposal incorrectly resolves as DEFENDER_WINS and updates the anchor state registry with that invalid proposal. Previously this would need to be done by upgrading the FaultDisputeGame contract used for a game type to one that uses a new AnchorStateRegistry. Referencing an existing game ensures the Guardian or Deputy Guardian can’t set an arbitrary anchor state - it has to be one that the fault dispute game found to be valid.

The situation of having old anchor state when we switch back to the permissioned game can be solved without needing a permissioned role by creating a permissionless game periodically. While the portal won’t respect that game while in the fallback state, it will still update the anchor state if it resolves as DEFENDER_WINS, which will subsequently be respected by the fault proof game.

1 Like

Although the fallback was activated, strictly speaking, an emergency upgrade has not been performed. The contracts deployed on op-mainnet have not been changed from those approved in Protocol Upgrade #7: Fault Proofs. The ability for the Guardian or Deputy Guardian to switch to the permissioned fallback was built into that proposal as part of a staged, responsible rollout of fault proofs.

The safe guards, including the ability to change the respected game type are documented in the original fault proofs governance proposal, the optimism docs for fault proofs and the OP Stack specification.

If you need help with understanding how to work with these safe guards the OP Stack developer forum is a good place to ask questions. We welcome feedback to help us improve, especially if there are areas where we can increase clarity!

1 Like

The staged rollout of fault proofs is designed to ensure that the chain and user assets remain secure even if there are issues within the fault proof system, and we believe that passing this upgrade is appropriate and consistent with our previously published audit framework. This allows to continue to improve the system safely and build confidence in it over time. The safeguards were audited prior to the initial launch of fault proofs and none of the findings from these three audits allowed those safeguards to be circumvented. Notably, this proposal fixes three issues which were not identified by these audits - despite the very high quality of the auditors. If you have any specific concerns with the changes outlined in the proposal, always happy to discuss further! The audit framework gives some more detail on how OP Labs thinks about audits.

could you also elaborate on Cantina 3.3.5, i.e. the incorrect implementation of the srav opcode? why is it considered low severity?

The srav opcode deviating from the spec is only considered low priority because the actual MIPS instructions generated by the go compiler when compiling op-program are not affected by the difference in behavior. So it doesn’t currently have an impact on the fault proof system but may in the future due to changes in the go compiler or op-program. Additional information about this class of bugs can be found on this section of the Cannon FPVM spec

secondly, to me it’s a bit concerning that the fault proof system is outside the scope of external audits. I understand that in Stage 1, given the fallback, the blacklisting mechanism, the pause button and the Security Council, a bug is unlikely to affect the safety of the system. Citing Vitalik, the amazing property of Stage 1 rollups is being able to be safe AND live assuming <25% of honest members in the SC, also assuming small probability of bugs. If this probability is not small, and the system is often in the fallback state, the system mostly relies on a >75% honesty, defeating the purpose of Stage 1.

The audit framework gives some more detail on the reasoning behind the choice to leave the fault proof system outside of the initial scope. We feel that one of the best ways to gain confidence in the fault proof system is to operate it in the real world. This proposal is a good example of that - three audits have been completed and while they did find a number of issues, there were also issues they did not find that were identified either through the bug bounty program or from experience actually working with the fault proof system.

You are absolutely right that reducing the probability of bugs in the fault proof–as this upgrade itself seeks to accomplish–is important, and we felt that the path we took would ultimately get us to a bug-minimized system the fastest. At the end of the day, the community should be empowered to hold us accountable to how upgrades happen. While we (and even those who voted no on the original upgrade) continue to believe that the risks posed by this approach are not about the fundamental security of the system, there should continue to be space for the community to push back if they feel that other (i.e. reputational) risks are too high going forward.

Hi @inphi,

Thanks for putting up the detailed proposal about a potential upgrade to fix the vulnerabilities found in the conducted audits after the deployment of the Fault Proofs upgrade. We appreciate the team’s effort on fixing the issues and improving the system further while activating the permissioned fallback mechanism with proper coordinations and cautions.

Let us clarify two points,

Looking at the Cantina’s audit report, Cantina 3.1.1 was considered “Critical”, the most severe bug type of which must be fixed ASAP while you indicated this bug’s severity as “High”. That’s possibly because the team considered a potential exploit is not feasible with the Go runtime memory protection, but we believe it’s misleading as it’s an important information for us to evaluate how the Fault Proofs system should be reviewed and audited going forward. You mentioned other issues that weren’t found from the audit were identified because of running the system in production, but this is not necessarily because of deploying the system without audits complete.

In the last upgrade proposal, we (alongside @zachobront) expressed the concern about the fact that the system would be deployed without proper audits on the upgrade code while we understood that you made the clarification on how OP Labs considered the upgrade and audit on it. Yet, we suggested that coordinating with the security council, the Labs could reconsider the deployment timing. Apparently, the deployment was occurred as planned and now, there was a critical bug that caused a fallback operation. Was there even a discussion about the concerns that we made? How’s Security Council responsible for the situation?

3 Likes

The following reflects the views of L2BEAT’s governance team, composed of @krst and @Sinkas, and it’s based on the combined research, fact-checking, and ideation of the two.

We’ll be voting FOR this proposal as we find it important to fix the already known bugs in the production environment. However, we would like to raise our concern as to whether the current approach of releasing early and relying on fallback mechanism to prevent anything bad to happen is the right one.

As @Zachobront mentioned in a comment under the Protocol Upgrade #7 proposal, the Foundation’s approach to the fault dispute mechanism poses a reputational risk. As it was proven, Zach’s concerns were on point, and we now had to revert to the permission fallback mechanism while the bugs found in the fault dispute mechanism are patched.

While it might not seem like a big deal, given users’ funds were not at risk due to fallbacks, it actually is since there’s a very thin line between the current situation and a case where the Security Council is needed to secure the chain.

Luca Donnoh, a researcher at L2BEAT, has written an article that explains the risks associated with potential lack of trust in the fault proof mechanism:

… Even if the protocol requires a lot of funds to be pooled to protect it, one can argue that finding liquidity is not a difficult task since it eventually guarantees very high profits, assuming that the proof system works correctly. We argue that this assumption shouldn’t be taken lightly. Let’s say that an attacker actually spends billions of dollars to attack a protocol, and then signals on a social or with an onchain message that they found a bug in the challenge protocol where defenders are guaranteed to lose their funds. No one knows if the bug actually exists or if it’s just a bluff, but it can be used as an effective deterrent to prevent reaching the target amount of funds needed to save the chain. …

In simple terms, while the approach of deploying early and “testing in production” is safe in terms of there’s no (or very limited) risk to users’ funds as there are fallback mechanisms in place, we feel that if it leads to continuous instances where we actually have to use those fallbacks, it can damage confidence in the design of the system in the long run, and therefore make it much harder to get it working in a Stage 2 environment where no such fallbacks will be available.

6 Likes

Apparently I was confused about the deadline for this proposal, I was certain that it ends 7pm UTC but it ended few minutes before me posting the rationale. However, the vote still passed and we were supportive of it so no harm done, but I am sorry about it and we will make sure to vote earlier in the future to avoid such cases.

1 Like

We vote FOR this proposal.

In order to explain our rationale behind this decision, we want to provide some context and considerations from our perspective.

Background:
The Fault Proof proposal was introduced three months ago. The upgrade included one of the most anticipated implementations: real fraud proofs. However, the proposal included two key aspects that raised many questions and doubts about the actual impact and risk of the upgrade: the lack of a complete audit of the system and the conception of the Guardian roles as a consequence.

Amid various concerns and questions, a key comment by Zach was raised regarding the risks introduced by this upgrade; most of them were understood by us within the “reputational risk” category, as the Guardian role was specifically set to minimize any existential risk. For us, our position was to abstain due to the present risk, aligning with the opinion of the DAB leader. We highly value the minimization of reputational risk. Nevertheless, the Collective sent a strong signal in favor of the upgrade.

Granite:
As detailed, bugs were found, which was certainly an expected outcome given the circumstances. Most of the fixes are related to the findings identified in the audit results. In order to move forward, that is, to return to the mode where fault proofs are fully operational, these fixes need to be implemented.

Implementing this upgrade should objectively move us to a safer stage than before, prior to the permissioned mode being triggered. However, it is mentioned how complex Fault Proofs are, so more bugs could still be present. As the safeguards are assumed to be well-audited and managed by the Security Council and Foundation, it should be acceptable to continue having the system as is, even though there are multiple concerns about its design, implementation, and maturity.

Going ahead:
All the discussions across various instances about this upgrade have left us with several points to consider for the Collective:

  1. Highlight the importance of the Developer Advisory Board in keeping delegates well-informed and, in a sense, making recommendations and outlining expected outcomes for each possible choice. Also, all delegates should ensure that every aspect of protocol upgrade proposals is sufficiently understood before offering support.
  2. The preference for more conservative measures has been expressed by some members of the Collective that should be taken into account. In a scenario where Audits vs. Shipping, the balance might lean more towards the former.
  3. Related to point (2), the Collective should revisit the Audit Framework, as the sense of reputational risk could be more highly valued than the current version suggests.
  4. Expectations on what the Fault Proof roadmap should look like, including the communication of the current constraints and challenges around it and how the system should evolve, regardless of the approach to a multi-proof system.
  5. The pertinent disclosure of how the running and monitoring of the system actually work, and which nice-to-have features would be appropriate to encourage, for governance’s awareness. This includes any action that could favor the redundancy of the system’s monitoring.
6 Likes

We are aware of the current security gaps in the fault proofs and how crucial the proposed solution is for the future of the chain. With these debugging processes, even the weakest links in the Optimism chain will be strengthened, resulting in a more robust structure. Therefore, as ITU Blockchain Delegation Committee, we support this proposal.

2 Likes