Upgrade Proposal #10: Granite Network Upgrade

Executive Summary

Hi I’m Mofi, a protocol engineer at OP Labs. OP Labs is a software development company focused on the Optimism ecosystem and a core developer of the OP Stack. We provide some services to, but do not represent or speak on behalf of, the Optimism Foundation.

This upgrade is proposed in response to security vulnerabilities identified during a series of third-party security audits by Spearbit, Cantina, and Code4rena. None of the vulnerabilities have been exploited, and user assets are not and were never at risk. However, out of an abundance of caution, the permissioned fallback mechanism has been activated in order to avoid any potential instability while the vulnerabilities are patched. For more information on the permissioned fallback mechanism and the OP Stack’s defense-in-depth approach to Fault Proof security, see the documentation.

The upgrade includes both a set of smart contract upgrades to fix the vulnerabilities identified in the audit as well as an L2 hardfork to improve the stability and performance of the fault proof system. In addition, we propose extending the capabilities of the Guardian and DeputyGuardian to set the anchor state for the fault proof system in order to prevent referencing invalid anchor states. Aside from implementing these fixes, the primary impact of this upgrade would be to reset user withdrawals at the planned time, similar to the initial Fault Proof upgrade.

Motivation

As described in the original upgrade proposal, our rollout strategy for Fault Proofs focuses on securing the fundamental security mechanisms first, then building confidence on the correctness of the rest of the system over time. Successfully performing this strategy requires that:

  1. The fallback mechanisms are activated whenever there is a risk of a security vulnerability.
  2. Any vulnerabilities that may arise are swiftly patched.

Therefore, the Foundation (via the Deputy Guardian role) activated the permissioned fallback which restricts output proposals to a trusted proposer and we at OP Labs created this proposal to resolve the vulnerabilities identified by the security audits. Note that the Guardian role is generally authorized to enforce “Safety over Liveness” in the system, meaning that it can pause and unpause withdrawals and as part of the Fault Proofs upgrade was authorized to intervene in the event that a bug would allow an invalid L2 output to be finalized.

Specifications

The specification for the proposed changes can be found in the specs repo. Details of the individual audit issues are enumerated in the Audit Issues section below.

Technical Details

In the section below, we will summarize each of the audit issues we are fixing as well as the smart contracts/off-chain components affected by the upgrade. The full reports for each audit can be found here:

Note that this upgrade only contains the most important issues (in our opinion) from each audit. We plan on addressing other, lower-severity issues in future upgrades.

Audit Issues

The table below lists all the audit issues that are fixed in this upgrade. Issue severities have been updated to match the Optimism ImmuneFi bounty. While the auditors did discover some high severity issues, no user assets were ever at risk. All of the audit issues listed below can be detected by our monitoring tooling. Had an exploit been detected, the Deputy Guardian role - which is held by the Optimism Foundation and revocable by the Security Council - would have been expected to blacklist any exploitable dispute games or activate the permissioned fallback.

Note that we have updated the spec to clarify some assumptions around the Cannon VM and program. Specifically, we trust that the Go compiler will emit proper MIPS32 programs. As a result, the Cantina issues that reference problems related to invalid MIPS32 programs are considered out-of-scope and will not be fixed.

Unless otherwise noted, vulnerabilities in the dispute game all occur at MAX_GAME_DEPTH and are therefore classified as medium.

Issue ID Severity Summary
Cantina 3.1.1: Allocation overflow could allow for arbitrary code execution High Mmap calls did not perform memory bounds checking, which allowed memory pointers to wrap around to zero and access the entire memory of the fault proof program including the data and text sections of active MIPS programs. This could lead to arbitrary code execution within the VM, effectively breaking the VM’s correctness guarantees. No PoC was given, and it is our belief that such an exploit is infeasible due to memory protections employed by the Go runtime. Nevertheless, given the potential impact of the issue we worked with the auditors to classify this as a high severity issue and deploy a fix.
C4 H-01: Invalid DISPUTED_L2_BLOCK_NUMBER is passed to VM High An attacker can counter a valid output claim by providing a trace containing one block after the original claim. For example, if an output root is proposed for block 13, the attacker could counter using a trace that includes valid blocks up to block 14. This issue is classified as high severity since it occurs above MAX_GAME_DEPTH.
Spearbit 5.1.1: PreimageOracle.loadPrecompilePreimagePart an outOfGas error in the precompile will overwrite correct preimageParts Medium See section below.
C4 H-02: The LPP challenge period can cause malicious and freeloader claims to be uncounterable and can also cause freeloader claims to be abused to entrap honest challengers Medium The clock extension mechanism is designed to give an honest actor time to counter freeloader claims, even though in that case they will inherit the opposing chess clock which may have very limited time remaining. Since the LPP challenge period was longer than the clock extension period, the clock extension granted in that case would not be sufficient to allow the honest actor to complete a call to step. While the overall game still resolves correctly, the challenger would lose bonds posted in attempting to counter the freeloader claim.
C4 H-05: An attacker can bypass the challenge period during LPP finalization Medium An attacker can bypass the large preimage challenge period by calling addLeavesLPP with _finalize set to false. Since the challenge period timestamp is never set, attackers can then call squeezeLPP, thereby bypassing the challenge period and inserting invalid data into the preimage oracle.
Cantina 3.3.5: Wrong implementation of srav Low The srav instruction does not mask the 5 lower bits of rs, which is nonconformant with the MIPS specification. This could lead to undefined behavior.
Cantina 3.4.2: Location of registers array in memory should be verified Low The on-chain MIPS VM does not verify that the registers array is allocated right after the state struct. Validating this would make the code more defensive.
Spearbit 5.2.5: Preimage proposals can be initialized multiple times Low initLPP() does not check if a proposal already exists. This could lead to loss of funds if a user provides the same LPP UUID multiple times.
Spearbit 5.2.3: Extension period not applied correctly for next root when SPLIT_DEPTH is set to 1 or less Low When SPLIT_DEPTH is set to 1, the extension period for the next root is calculated as zero which results in no extension period being applied. If SPLIT_DEPTH is set to zero, subsequent game moves will result due to an integer underflow.
Spearbit 5.1.2: Invalid Ancestor Lookup Leading to Out-of-Bounds Array Access Low An out of bounds array access can occur when MAX_DEPTH = SPLIT_DEPTH + 1. While this is unlikely to happen in practice, we have updated the FDG constructor to require that SPLIT_DEPTH + 1 >= MAX_GAME_DEPTH.
Spearbit 5.2.4: Inconsistent _partOffset check and memory boundaries in loadLocalData function Low The _partOffset parameter is handled inconsistently in some places within the loadLocalData function.
Spearbit 5.2.6: _clockExtension and _maxClockDuration are not validated correctly in DisputeGame constructor Low If a dispute game is initialized with a clock extension set to more than half of the max duration, move transactions during the execution trace bisection will revert since the difference between 2 * the clock extension and the game’s max duration will underflow.

Notes on Spearbit 5.1.1

By calling loadPrecompilePreimagePart with less gas than necessary, an attacker could produce an outOfGas error in the precompile. If there is enough gas left in the loadPrecompilePart function, a valid preimage could be overwritten with the outOfGas error itself. This would result in an incorrect game outcome.

The function of the loadPrecompilePreimagePart method is to allow certain expensive precompiles - namely ecrecover, ecpairing, and kzg_point_evaluation - to be accelerated. Accelerated precompiles offload their execution to an L1 oracle. Since precompiles are implemented natively rather than via EVM opcodes, this improves Cannon’s performance and allows challengers to quickly generate traces for blocks filled with these computationally expensive calls. To address the outOfGas issue, we’re adding a minimum gas requirement to the PreimageOracle to ensure there’s enough gas to accelerate precompiles on L1.

Another related issue with loadPrecompilePreimagePart is that the gas required to accelerate precompiles on L1 may be insufficient given the cost of executing them on L2. This is because the gas provided to the precompile accelerator contract on L1 can never exceed 63/64th of the gas limit on L2. This is a problem for precompiles that have a dynamic gas cost of execution. Of the accelerated precompiles, ecPairing is the only one that contains this vulnerability as its gas cost scales with its input size.

To fix this problem, we are proposing an L2 hardfork to limit the maximum input size provided to the ecPairing precompile to 112687 bytes. This number is high enough to enable all known use cases of the ecPairing precompile, but low enough to enable the challenger to generate traces for larger blocks in a timely manner. While this is technically a divergence from the EVM, our on-chain data has found no calls to the ecPairing precompile with an input size over 1187 bytes. The provided limit is therefore 2 orders of magnitude larger than any known use case, which we believe is sufficiently safe.

Additional Fixes

In addition to the audit fixes above, we are also proposing the following additional fixes:

  1. We propose reducing the ChannelTimeout value from 200 blocks to 50. Canon has a limited amount of memory - approximately 1.1GB - and is not currently garbage collected. The longer channel timeout caused Cannon to run out of memory on OP Sepolia, and was close to the limit on mainnet. Reducing the ChannelTimeout significantly increases the amount of memory available to Cannon and reduces the risk of an OOM occurring.
  2. An ImmuneFi bounty hunter noticed that DelayedWETH’s recover function is not robust against transfers which need more than 2300 gas. We propose modifying DelayedWETH such that the owner can always recover funds regardless of how much gas is required.
  3. We have updated the Guardian and DeputyGuardian roles to have the permission to set the anchor state back to a valid game. This allows the DeputyGuardian to fix the anchor state registry in the event that a game with an invalid proposal incorrectly resolves as DEFENDER_WINS. Referencing existing game ensures the Guardian or Deputy Guardian can’t set an arbitrary anchor state - it has to be one that the fault dispute game found to be valid.

All proposed contract changes can be found in the op-contracts/v1.6.0 release.

Impacted Components

This upgrade involves both L1 smart contracts as well as the node and execution client software.

The following contracts are modified as part of this upgrade:

OP Node has been updated to process the reduced ChannelTimeout. OP Geth has been updated to limit the maximum input size to ecPairing.

Security Considerations

These changes are all in response to vulnerabilities discovered during external security audits. No vulnerabilities were found in the fallback mechanisms, which were themselves audited prior to deploying Fault Proofs in June.

As per the Audit Framework, the dispute game and MIPS contracts fall into the liveness/reputational risk category which do not require audits. The fallback mechanisms make any bugs simple to recover from and pose no risk to user funds. Therefore, we have opted not to pursue a fix review for the changes made in this proposal. We propose addressing any additional issues discovered in a similar manner to the way they are being addressed here, specifically:

  1. Depending on the issues at hand, Labs recommend that the Deputy Guardian trigger the fallback or blacklist specific dispute games.
  2. Labs or (others in the core developer community) would create a governance proposal to resolve the issues.

There’s quite a bit of nuance to when the fallbacks should be activated. In light of this audit, we propose adopting the following rubric to decide if/when the fallback should be activated. If an issue is costly to exploit - e.g., it requires playing the game to MAX_GAME_DETH - then we propose disclosing it immediately and using the dispute game blacklist to mitigate any attempts to exploit it. The dispute game blacklist will seize any bonds paid by an attacker, and makes attempting to exploit dispute resolution deeply unprofitable. On the other hand, if issues are not costly to exploit then we propose activating the fallback prior to disclosure. In both cases, fixes for the vulnerabilities would be proposed as a regular protocol upgrade in the nearest voting cycle.

Consistent with the OP Labs Audit Framework, we have not had the contents of the hardfork audited. However, OP Labs did perform a security review of these changes. Risk analysis of each L2 change is below.

  • Limiting the size of ecPairing’s input is considered a low-risk change. Implementation bugs would not put user assets at risk. Even though this is technically a divergence from the EVM, our data suggests that there have been no usages of the ecPairing precompile with an input size > 1152 bytes, which is far below the limit we will be imposing.
  • Reducing the ChannelTimeout is considered a low-risk change. Implementation bugs would not put user assets at risk.

Impact Summary

  • OP Labs does not anticipate any downtime due to this upgrade.
  • If this proposal is approved, node operators must upgrade their node software prior to September 11th in order to avoid a chain split.
  • As a result of triggering the fallback, all pending withdrawals will be invalidated. Users with pending withdrawals will need to re-prove them against an output proposal submitted by the permissioned proposer. This means that withdrawals initiated less than one week before the upgrade is executed will only be finalized one week after the upgrade is complete. For example, a withdrawal initiated 6 days before the upgrade would take a total of 13 days to finalize. In addition, proposals made within a week of the permissionless game being reactivated will also be invalidated.
  • Users will be unable to provide more than 112687 bytes of input to the ecPairing precompile.
  • Proposers (other than the trusted proposer operated by OP Labs) will be unable to propose their own outputs until the fallback is deactivated following the L1 upgrades.
  • All client-side tooling is unaffected.

Action Plan

If this vote passes, the Granite upgrade will be scheduled for execution on September 11th at 16:00:01 UTC. The upgrade will occur automatically for nodes on a release which contains the baked-in activation time. Granite is code complete in the optimism monorepo at commit a81de910dc2fd9b2f67ee946466f2de70d62611a and op-geth at commit 0f5b9dcfd2ac66f6fd8faae526b1549721f5f392. The smart contracts release is op-contracts/v1.6.0-rc.1. The op-node and op-geth releases will be finalized if this proposal passes.

This upgrade has already been activated on internal devnets and the Sepolia Superchain in coordination with Base and Conduit.

The overall upgrade plan is as follows:

  1. Update the Absolute Pre-State: Prior to the hardfork activation, we will update the absolute pre-state as done in the Fjord upgrade. This ensures that the new op-program can be used with the upgraded protocol, and must be performed prior to hardfork activation. See the Fjord upgrade proposal for more details. This upgrade is transparent to users, and no action is required.
  2. Activate the Hardfork on L2: The hardfork will activate on the L2 network at the scheduled time. Node operators must upgrade to the versions described above to avoid a chain split. Once upgraded, no further action is required.
  3. Update the L1 Smart Contracts: Finally, we will update the L1 smart contracts to new versions that contain fixes for the audit issues. This upgrade is transparent to users, and no action is required. This update will also deactivate the fallback mechanism, and revert back to permissionless proposing.

The Security Council and Optimism Foundation must sign the transactions for steps 1 and 3 prior to the hardfork activation. This sequence is crucial to prevent breaking the fault proof system.

Emergency Cancellation

The releases above will contain a Granite activation at the above-mentioned time. If a critical security issue is found between approval and rollout, the Optimism Foundation and Security Council should coordinate an emergency cancellation. Node operators can quickly react by using the --override.granite flag on both op-node and op-geth.

Conclusion

This proposal outlines the Granite network upgrade, which responds to security vulnerabilities identified by third-party auditors. This upgrade brings better security and performance to the fault proof system.

Proposal Edit Changelog

  • 8/21/2024 - In Additional Fixes - clarified the reason the Guardian and DeputyGuardian roles have extended capabilties.
  • 8/21/2024 - Added etherscan references to the newly deployed contract implementations.
  • 8/29/2024 - Fixed etherscan links to the deployed contract implementations
13 Likes

I am an Optimism delegate with sufficient voting power and I believe this proposal is ready to move to a vote.

3 Likes

This seems like a highly technical upgrade so though the post makes sense I can’t say I myself can say it’s all fine.

I will trust the auditors and developers here and just give the okay as a delegate for this to go to a vote.

I am an optimism delegate with sufficient voting power and I believe the proposal is ready to move to a vote.

6 Likes

Seems a reasonable upgrade to address the vulnerabilities of the security audits and prioritize user safety/reinforce the fault proof system.

I am an Optimism delegate with sufficient voting power and I believe this proposal is ready to move to a vote.

2 Likes

Can someone explain the respectedGameType (0 → 1) change?

eg. why hasn’t a PermissionedDisputeGame been posted yet? The current AnchorStateRegistry is 4m blocks old.

2 Likes

Emergency upgrade may affect protocols that rely on dispute game such as ENS gateway. Need to test if this such case is handled.

There have been a number of proposals already made with the permissioned game type (about one an hour as was done with the permissionless games). The AnchorStateRegistry is only updated once the dispute period for games has elapsed and the game resolves as Defender Wins. It’s then used as the starting point for new games after that. Having an old anchor state just means there are more blocks that could be disputed in the top half of the dispute game which narrows down to find the first disputed block.

1 Like

Thanks for the response. Let me rephrase: once the gameType switch was made, why wasn’t the new respectedGameType’s anchor root set to the last finalized game?

Similar to the new setAnchorState(), there could be copyAnchorState(GameType, GameType).

There’s no need to adjust the anchor state. It doesn’t affect withdrawals at all and will just naturally be updated when the next game resolves.

1 Like

Isn’t it the latest on-chain finalized state?

No, the anchor state is just the starting point for new dispute games of that game type.

can you better elaborate when this scenario can happen? Aren’t there solutions that don’t involve permissioned roles?

This is the case.

Is there documentation for this?

—-

Whilst I appreciate that this is an ‘emergency’ upgrade and a good opportunity to prepare for the future, tooling utilising the implementation as is(was) is prone to breaking - it would be great to have additional documentation and clarity on what can change and under what circumstances.

I am an Optimism delegate with sufficient voting power and I believe this proposal is ready to move to a vote.

From my understanding, this hardfork upgrade has not completed an audit so far.

Such an upgrade shouldnt pass without an audit, as it is already a fix of existing bugs.

If there are further bugs introduced in this upgrade, it will result in irreparable harm to the chain which will require further fixes.

could you also elaborate on Cantina 3.3.5, i.e. the incorrect implementation of the srav opcode? why is it considered low severity?

secondly, to me it’s a bit concerning that the fault proof system is outside the scope of external audits. I understand that in Stage 1, given the fallback, the blacklisting mechanism, the pause button and the Security Council, a bug is unlikely to affect the safety of the system. Citing Vitalik, the amazing property of Stage 1 rollups is being able to be safe AND live assuming <25% of honest members in the SC, also assuming small probability of bugs. If this probability is not small, and the system is often in the fallback state, the system mostly relies on a >75% honesty, defeating the purpose of Stage 1.

2 Likes

Where can I query the current finalized root on mainnet?

Game 1633 (first gameType 1) is starting from the current gameType 1 anchor state 0x2694ac14dcf54b7a77363e3f60e6462dc78da0d43d1e2f058dbb6a1488814977 @ block 120059863 (95 days ago)


proveWithdrawalTransaction() apparently doesn’t require finalization… shouldn’t that be != DEFENDER_WINS ? This was clarified on Discord (there is another finalization step. No withdrawals have been processed post gameType change.)

The SEED Latam delegation, as we have communicated here, with @Joxes being an Optimism delegate with sufficient voting power we believe this proposal is ready to move towards a vote.

2 Likes

I am an Optimism Delegate with sufficient voting power and I believe this proposal is ready to move to a vote.

Here is a non-technical summary of Granite upgrade proposal on behalf of the Developer Advisory Board:

Major changes:

In response to security audits on the Fault Proof system, this upgrade aims to make three major changes:

  1. Fix the individual vulnerabilities defined in the audits.
  • There were 3 audits performed. Here are the reports: Spearbit, Cantina and Code4rena.

    This upgrade fixes important issues identified in these audits. Lower-severity issues will be planned in future upgrades.

  • DAB and the security audit firms haven’t reviewed fixes but they have been reviewed by OP Labs team and remain behind all the audited safeguards.

    A non-technical summary for the audit issues is added to the appendix.

  1. Make other changes to the smart contract system to improve robustness
  • In response to these audits, the system was put into “fallback” mode, where only OP Labs trusted proposer can propose state. After this upgrade is complete, the system will be put back into permissionless mode.
  • Make DelayedWETH robust to ETH transfers: DelayedWETH contract holds the bonded ETH for each fault dispute game. ETH transfers usually takes 2300 gas. However, the receiver can execute some code on receiving ETH increasing the gas consumption. DelayedWETH is not robust against such transfers. The proposed fix is to remove the requirement on gas.
  • Grant privileged actors the power to set anchor state: As a result of switching back to the permissionless fault dispute game, anchor state (proved state) from the time before the fallback mechanism was activated can be referred again in fault dispute game. To prevent this, Guardian and DeputyGuardian roles are being given the permission to set this anchor state themselves.
  1. There will be a hardfork to the L2 node software that makes two changes to improve the stability and performance of fault proof system
  • Reduce memory load to run off-chain node software: The off-chain software uses “channels” to pass data between processes. Any data received from channels after a certain amount of time (called ChannelTimeout) is considered invalid. This upgrade reduces that time to reduce the load on memory.

  • Limit the maximum input size to a precompile: Whenever a smart contract A calls another smart contract B, the gas it can forward to the call is limited (63/64th of A’s gas budget).

    This can cause issues in the Fault Dispute Game, because there are certain precompiles where we need to call them on L1 to prove the result from L2. If they used enough gas on L2, it may not be possible for the 63/64ths to be sufficient on L1, so it will always fail.

    To solve this, a change has been made to op-geth to limit the maximize amount of gas that can be used for these precompiles on L2.

    Onchain data shows that the new limit is 2x higher than the largest call that has ever occurred, so there should be no user impact.

User Impact

  • This upgrade resets user withdrawals. Any proof submissions within the last 7 days from the upgrade will be invalid and they will be have to resubmit the proofs.

Appendix

  • Cantina 3.1.1: During fault proof execution, there are certain assumptions taken by the program. This bug, if realized, can invalidate those assumption leading to wrong verification of fault proof. Although, no proof of concept was given to show an exploit, this bug fix is prioritized as a cautionary measure.

  • C4 H-01: A claim proposed for block n can be countered with using information from block n+1. This is incorrect behavior as information only up to block n should be relevant in this scenario.

  • Spearbit 5.1.1: Whenever a smart contract A calls another smart contract B, the gas it can forward to the call is limited (63/64th of A’s gas budget). If the call to B reverts, A can continue executing the code that comes after the call. until it itself runs out of gas or executes its own code or reverts. The audit discovered that when PreimageOracle.loadPrecompilePreimagePart (Smart contract A) calls precompiles (B), the call can revert if enough gas isn’t passed to the call. On revert, the error message is used by the function instead of the value that would have been returned had the call succeeded.

    Only one precompile is vulnerable to this attack. The gas needed for precompile execution depends on the size of the input passed to it. The proposed fix is to put a upper limit on this input size. From onchain data, it has been found that this limit is twice the size of the maximum data size that has been passed to it ever, hence there is no user impact.

  • C4 H-02: In a particular section of the fault dispute game (requesting preimage of hash values), there is mismatch between relative values of time an honest actor and a malicious actor gets. This fix adjusts these values so that an honest actor gets more time to post the response.

  • C4 H-05: Invalid data can be used in fault dispute game by calling a function with passing false to a function call. This leads to the challenge period never kicking in.

  • Rest are low-severity issues.

8 Likes