Executive Summary
Hi I’m Mofi, a protocol engineer at OP Labs. OP Labs is a software development company focused on the Optimism ecosystem and a core developer of the OP Stack. We provide some services to, but do not represent or speak on behalf of, the Optimism Foundation.
This upgrade is proposed in response to security vulnerabilities identified during a series of third-party security audits by Spearbit, Cantina, and Code4rena. None of the vulnerabilities have been exploited, and user assets are not and were never at risk. However, out of an abundance of caution, the permissioned fallback mechanism has been activated in order to avoid any potential instability while the vulnerabilities are patched. For more information on the permissioned fallback mechanism and the OP Stack’s defense-in-depth approach to Fault Proof security, see the documentation.
The upgrade includes both a set of smart contract upgrades to fix the vulnerabilities identified in the audit as well as an L2 hardfork to improve the stability and performance of the fault proof system. In addition, we propose extending the capabilities of the Guardian and DeputyGuardian to set the anchor state for the fault proof system in order to prevent referencing invalid anchor states. Aside from implementing these fixes, the primary impact of this upgrade would be to reset user withdrawals at the planned time, similar to the initial Fault Proof upgrade.
Motivation
As described in the original upgrade proposal, our rollout strategy for Fault Proofs focuses on securing the fundamental security mechanisms first, then building confidence on the correctness of the rest of the system over time. Successfully performing this strategy requires that:
- The fallback mechanisms are activated whenever there is a risk of a security vulnerability.
- Any vulnerabilities that may arise are swiftly patched.
Therefore, the Foundation (via the Deputy Guardian role) activated the permissioned fallback which restricts output proposals to a trusted proposer and we at OP Labs created this proposal to resolve the vulnerabilities identified by the security audits. Note that the Guardian role is generally authorized to enforce “Safety over Liveness” in the system, meaning that it can pause and unpause withdrawals and as part of the Fault Proofs upgrade was authorized to intervene in the event that a bug would allow an invalid L2 output to be finalized.
Specifications
The specification for the proposed changes can be found in the specs repo. Details of the individual audit issues are enumerated in the Audit Issues section below.
Technical Details
In the section below, we will summarize each of the audit issues we are fixing as well as the smart contracts/off-chain components affected by the upgrade. The full reports for each audit can be found here:
Note that this upgrade only contains the most important issues (in our opinion) from each audit. We plan on addressing other, lower-severity issues in future upgrades.
Audit Issues
The table below lists all the audit issues that are fixed in this upgrade. Issue severities have been updated to match the Optimism ImmuneFi bounty. While the auditors did discover some high severity issues, no user assets were ever at risk. All of the audit issues listed below can be detected by our monitoring tooling. Had an exploit been detected, the Deputy Guardian role - which is held by the Optimism Foundation and revocable by the Security Council - would have been expected to blacklist any exploitable dispute games or activate the permissioned fallback.
Note that we have updated the spec to clarify some assumptions around the Cannon VM and program. Specifically, we trust that the Go compiler will emit proper MIPS32 programs. As a result, the Cantina issues that reference problems related to invalid MIPS32 programs are considered out-of-scope and will not be fixed.
Unless otherwise noted, vulnerabilities in the dispute game all occur at MAX_GAME_DEPTH and are therefore classified as medium.
Issue ID | Severity | Summary |
---|---|---|
Cantina 3.1.1: Allocation overflow could allow for arbitrary code execution | High | Mmap calls did not perform memory bounds checking, which allowed memory pointers to wrap around to zero and access the entire memory of the fault proof program including the data and text sections of active MIPS programs. This could lead to arbitrary code execution within the VM, effectively breaking the VM’s correctness guarantees. No PoC was given, and it is our belief that such an exploit is infeasible due to memory protections employed by the Go runtime. Nevertheless, given the potential impact of the issue we worked with the auditors to classify this as a high severity issue and deploy a fix. |
C4 H-01: Invalid DISPUTED_L2_BLOCK_NUMBER is passed to VM | High | An attacker can counter a valid output claim by providing a trace containing one block after the original claim. For example, if an output root is proposed for block 13, the attacker could counter using a trace that includes valid blocks up to block 14. This issue is classified as high severity since it occurs above MAX_GAME_DEPTH. |
Spearbit 5.1.1: PreimageOracle.loadPrecompilePreimagePart an outOfGas error in the precompile will overwrite correct preimageParts | Medium | See section below. |
C4 H-02: The LPP challenge period can cause malicious and freeloader claims to be uncounterable and can also cause freeloader claims to be abused to entrap honest challengers | Medium | The clock extension mechanism is designed to give an honest actor time to counter freeloader claims, even though in that case they will inherit the opposing chess clock which may have very limited time remaining. Since the LPP challenge period was longer than the clock extension period, the clock extension granted in that case would not be sufficient to allow the honest actor to complete a call to step. While the overall game still resolves correctly, the challenger would lose bonds posted in attempting to counter the freeloader claim. |
C4 H-05: An attacker can bypass the challenge period during LPP finalization | Medium | An attacker can bypass the large preimage challenge period by calling addLeavesLPP with _finalize set to false. Since the challenge period timestamp is never set, attackers can then call squeezeLPP, thereby bypassing the challenge period and inserting invalid data into the preimage oracle. |
Cantina 3.3.5: Wrong implementation of srav | Low | The srav instruction does not mask the 5 lower bits of rs, which is nonconformant with the MIPS specification. This could lead to undefined behavior. |
Cantina 3.4.2: Location of registers array in memory should be verified | Low | The on-chain MIPS VM does not verify that the registers array is allocated right after the state struct. Validating this would make the code more defensive. |
Spearbit 5.2.5: Preimage proposals can be initialized multiple times | Low | initLPP() does not check if a proposal already exists. This could lead to loss of funds if a user provides the same LPP UUID multiple times. |
Spearbit 5.2.3: Extension period not applied correctly for next root when SPLIT_DEPTH is set to 1 or less | Low | When SPLIT_DEPTH is set to 1, the extension period for the next root is calculated as zero which results in no extension period being applied. If SPLIT_DEPTH is set to zero, subsequent game moves will result due to an integer underflow. |
Spearbit 5.1.2: Invalid Ancestor Lookup Leading to Out-of-Bounds Array Access | Low | An out of bounds array access can occur when MAX_DEPTH = SPLIT_DEPTH + 1. While this is unlikely to happen in practice, we have updated the FDG constructor to require that SPLIT_DEPTH + 1 >= MAX_GAME_DEPTH. |
Spearbit 5.2.4: Inconsistent _partOffset check and memory boundaries in loadLocalData function | Low | The _partOffset parameter is handled inconsistently in some places within the loadLocalData function. |
Spearbit 5.2.6: _clockExtension and _maxClockDuration are not validated correctly in DisputeGame constructor | Low | If a dispute game is initialized with a clock extension set to more than half of the max duration, move transactions during the execution trace bisection will revert since the difference between 2 * the clock extension and the game’s max duration will underflow. |
Notes on Spearbit 5.1.1
By calling loadPrecompilePreimagePart with less gas than necessary, an attacker could produce an outOfGas error in the precompile. If there is enough gas left in the loadPrecompilePart function, a valid preimage could be overwritten with the outOfGas error itself. This would result in an incorrect game outcome.
The function of the loadPrecompilePreimagePart method is to allow certain expensive precompiles - namely ecrecover, ecpairing, and kzg_point_evaluation - to be accelerated. Accelerated precompiles offload their execution to an L1 oracle. Since precompiles are implemented natively rather than via EVM opcodes, this improves Cannon’s performance and allows challengers to quickly generate traces for blocks filled with these computationally expensive calls. To address the outOfGas issue, we’re adding a minimum gas requirement to the PreimageOracle to ensure there’s enough gas to accelerate precompiles on L1.
Another related issue with loadPrecompilePreimagePart is that the gas required to accelerate precompiles on L1 may be insufficient given the cost of executing them on L2. This is because the gas provided to the precompile accelerator contract on L1 can never exceed 63/64th of the gas limit on L2. This is a problem for precompiles that have a dynamic gas cost of execution. Of the accelerated precompiles, ecPairing is the only one that contains this vulnerability as its gas cost scales with its input size.
To fix this problem, we are proposing an L2 hardfork to limit the maximum input size provided to the ecPairing precompile to 112687 bytes. This number is high enough to enable all known use cases of the ecPairing precompile, but low enough to enable the challenger to generate traces for larger blocks in a timely manner. While this is technically a divergence from the EVM, our on-chain data has found no calls to the ecPairing precompile with an input size over 1187 bytes. The provided limit is therefore 2 orders of magnitude larger than any known use case, which we believe is sufficiently safe.
Additional Fixes
In addition to the audit fixes above, we are also proposing the following additional fixes:
- We propose reducing the ChannelTimeout value from 200 blocks to 50. Canon has a limited amount of memory - approximately 1.1GB - and is not currently garbage collected. The longer channel timeout caused Cannon to run out of memory on OP Sepolia, and was close to the limit on mainnet. Reducing the ChannelTimeout significantly increases the amount of memory available to Cannon and reduces the risk of an OOM occurring.
- An ImmuneFi bounty hunter noticed that DelayedWETH’s recover function is not robust against transfers which need more than 2300 gas. We propose modifying DelayedWETH such that the owner can always recover funds regardless of how much gas is required.
- We have updated the Guardian and DeputyGuardian roles to have the permission to set the anchor state back to a valid game. This allows the DeputyGuardian to fix the anchor state registry in the event that a game with an invalid proposal incorrectly resolves as
DEFENDER_WINS
. Referencing existing game ensures the Guardian or Deputy Guardian can’t set an arbitrary anchor state - it has to be one that the fault dispute game found to be valid.
All proposed contract changes can be found in the op-contracts/v1.6.0 release.
Impacted Components
This upgrade involves both L1 smart contracts as well as the node and execution client software.
The following contracts are modified as part of this upgrade:
- On-Chain MIPS VM:
- Dispute Game
- DelayedWETH.sol
- DeputyGuardianModule.sol
- AnchorStateRegistry.sol
OP Node has been updated to process the reduced ChannelTimeout. OP Geth has been updated to limit the maximum input size to ecPairing.
Security Considerations
These changes are all in response to vulnerabilities discovered during external security audits. No vulnerabilities were found in the fallback mechanisms, which were themselves audited prior to deploying Fault Proofs in June.
As per the Audit Framework, the dispute game and MIPS contracts fall into the liveness/reputational risk category which do not require audits. The fallback mechanisms make any bugs simple to recover from and pose no risk to user funds. Therefore, we have opted not to pursue a fix review for the changes made in this proposal. We propose addressing any additional issues discovered in a similar manner to the way they are being addressed here, specifically:
- Depending on the issues at hand, Labs recommend that the Deputy Guardian trigger the fallback or blacklist specific dispute games.
- Labs or (others in the core developer community) would create a governance proposal to resolve the issues.
There’s quite a bit of nuance to when the fallbacks should be activated. In light of this audit, we propose adopting the following rubric to decide if/when the fallback should be activated. If an issue is costly to exploit - e.g., it requires playing the game to MAX_GAME_DETH - then we propose disclosing it immediately and using the dispute game blacklist to mitigate any attempts to exploit it. The dispute game blacklist will seize any bonds paid by an attacker, and makes attempting to exploit dispute resolution deeply unprofitable. On the other hand, if issues are not costly to exploit then we propose activating the fallback prior to disclosure. In both cases, fixes for the vulnerabilities would be proposed as a regular protocol upgrade in the nearest voting cycle.
Consistent with the OP Labs Audit Framework, we have not had the contents of the hardfork audited. However, OP Labs did perform a security review of these changes. Risk analysis of each L2 change is below.
- Limiting the size of ecPairing’s input is considered a low-risk change. Implementation bugs would not put user assets at risk. Even though this is technically a divergence from the EVM, our data suggests that there have been no usages of the ecPairing precompile with an input size > 1152 bytes, which is far below the limit we will be imposing.
- Reducing the ChannelTimeout is considered a low-risk change. Implementation bugs would not put user assets at risk.
Impact Summary
- OP Labs does not anticipate any downtime due to this upgrade.
- If this proposal is approved, node operators must upgrade their node software prior to September 11th in order to avoid a chain split.
- As a result of triggering the fallback, all pending withdrawals will be invalidated. Users with pending withdrawals will need to re-prove them against an output proposal submitted by the permissioned proposer. This means that withdrawals initiated less than one week before the upgrade is executed will only be finalized one week after the upgrade is complete. For example, a withdrawal initiated 6 days before the upgrade would take a total of 13 days to finalize. In addition, proposals made within a week of the permissionless game being reactivated will also be invalidated.
- Users will be unable to provide more than 112687 bytes of input to the ecPairing precompile.
- Proposers (other than the trusted proposer operated by OP Labs) will be unable to propose their own outputs until the fallback is deactivated following the L1 upgrades.
- All client-side tooling is unaffected.
Action Plan
If this vote passes, the Granite upgrade will be scheduled for execution on September 11th at 16:00:01 UTC. The upgrade will occur automatically for nodes on a release which contains the baked-in activation time. Granite is code complete in the optimism monorepo at commit a81de910dc2fd9b2f67ee946466f2de70d62611a and op-geth at commit 0f5b9dcfd2ac66f6fd8faae526b1549721f5f392. The smart contracts release is op-contracts/v1.6.0-rc.1. The op-node and op-geth releases will be finalized if this proposal passes.
This upgrade has already been activated on internal devnets and the Sepolia Superchain in coordination with Base and Conduit.
The overall upgrade plan is as follows:
- Update the Absolute Pre-State: Prior to the hardfork activation, we will update the absolute pre-state as done in the Fjord upgrade. This ensures that the new op-program can be used with the upgraded protocol, and must be performed prior to hardfork activation. See the Fjord upgrade proposal for more details. This upgrade is transparent to users, and no action is required.
- Activate the Hardfork on L2: The hardfork will activate on the L2 network at the scheduled time. Node operators must upgrade to the versions described above to avoid a chain split. Once upgraded, no further action is required.
- Update the L1 Smart Contracts: Finally, we will update the L1 smart contracts to new versions that contain fixes for the audit issues. This upgrade is transparent to users, and no action is required. This update will also deactivate the fallback mechanism, and revert back to permissionless proposing.
The Security Council and Optimism Foundation must sign the transactions for steps 1 and 3 prior to the hardfork activation. This sequence is crucial to prevent breaking the fault proof system.
Emergency Cancellation
The releases above will contain a Granite activation at the above-mentioned time. If a critical security issue is found between approval and rollout, the Optimism Foundation and Security Council should coordinate an emergency cancellation. Node operators can quickly react by using the --override.granite flag on both op-node and op-geth.
Conclusion
This proposal outlines the Granite network upgrade, which responds to security vulnerabilities identified by third-party auditors. This upgrade brings better security and performance to the fault proof system.
Proposal Edit Changelog
- 8/21/2024 - In Additional Fixes - clarified the reason the Guardian and DeputyGuardian roles have extended capabilties.
- 8/21/2024 - Added etherscan references to the newly deployed contract implementations.
- 8/29/2024 - Fixed etherscan links to the deployed contract implementations