Crypto.com Chain Crossfire Mainnet Dry-Run was an incentivized testnet aimed at stress testing the network in a practical, real-world setting before public release. It is an important milestone and the final step in preparation for Mainnet launch.
The dry-run lasted for 4 weeks between 18 January and 15 February 2021. It was a huge success to the Crypto.com Chain team as it achieved its goals and provided us a valuable experience and opportunity to test out different things under a stressed network. The dry-run brought to light some expected issues as well as newly-identified areas for improvement. These lessons will support us in our preparation for the Mainnet launch.
What Were the Tasks in Crossfire?
Crossfire was divided into three phases:
- Phase One (week 1): participants set up their nodes and get themselves prepared;
- Phase Two (week 2 & 3): validators have to maintain a 50% signing rate;
- Phase Three (week 4): the attack phase, where participants are encouraged to launch an attack on the network and other validators while maintaining uptime as high as possible.
Throughout the competition, participants were also scored based on the completion of different tasks, the total number of transactions sent, and their validator’s signing rate. The top 10 participants with the highest score earn extra rewards.
The team received 1,000 registrations within the first 3 days after the announcement of the dry-run. In total, we received 3,342 registrations, including those from long-time supporters of Crypto.com and blockchain enthusiasts with solid experience in setting up validator nodes on other blockchains.
To simulate a more realistic number of concurrent validators on a network, we decided to only host 220 participants for this event. We would like to take this chance to express our gratitude for the overwhelming support from the community.
Among the 206 replied participants, 197 of them have joined the network and completed at least one of the tasks. 174 participants completed at least 2, 158 participants completed at least 3, and 124 participants completed all of them.
At the end of the Crossfire event, there were 180K blocks proposed and 275M transactions broadcasted to the network in just 4 weeks, creating a highly-stressed environment that we have never seen before. At peak time, there were more than 180 validators from all over the world, achieving consensus together on the network.
One thing to note is that although the Crossfire is positioned as a Mainnet Dry-Run, the tasks we set up and the actual traffic we saw was a more extreme environment than what we expected to see on the Mainnet. One of the big goals of any blockchain network is fault tolerance under adversarial conditions, and we are happy to say that Crossfire has achieved this goal. We did see network degradation, but we worked with the validator participants to keep the blockchain running and eventually survived the extreme conditions.
With this extremely active traffic, we were able to collect data that was hard to set up before. We also observed and monitored the network behavior closely . Some issues required our team to act promptly, which we will cover later.
The Mainnet Dry-Run not only helped us be prepared for Mainnet, we also discovered some issues that could help make the Tendermint and the Cosmos SDK (on which the Chain was built) better and more secure during this process.
How Crypto.com Helped to Secure Stargate
Stargate is the largest update to the Tendermint Core and the Cosmos SDK ecosystem so far. It has brought performance improvements, new features (such as State Sync, which allows new nodes to get the latest blockchain state in seconds instead of syncing for hours or days), and unlocks the whole new world of opportunities through interoperability of self-sovereign blockchains via the Inter-Blockchain Communication (IBC) protocol. You can read more about this huge collaborative development effort in this article.
Crypto.com Chain has used Cosmos SDK since the early release candidates of Stargate and has kept up to date with its development. As with any large software development overhauls, Stargate has undergone initial growing pains and Crypto.com helped in different ways to strengthen the Tendermint Core and Cosmos SDK software stacks.
First, during the internal testing rounds, the Crypto.com security team discovered Denial of Service vulnerabilities in Tendermint Core. Following the responsible disclosure process, we communicated this issue to the Tendermint Core developers and assisted them in identifying the root cause and testing their proposed fixes. These vulnerabilities have been fixed in subsequent v0.34.2 and v0.34.3 releases of Tendermint Core. You can read more details in this retrospective write-up.
Second, we identified several issues in Cosmos SDK, which we had notified the core maintainers or contributed our patches. In addition to that, we also co-funded a security audit of several critical Cosmos SDK modules. Depending on the audit findings, their respective fixes will be incorporated into the upcoming Cosmos SDK releases.
By the time the competition began, we were evaluating the vulnerabilities, which led to an urgent security patch to the Crossfire network during the competition. For more details, please refer to the section “Applied Security Patch in Action” below.
Finally, several Crossfire participants reported an increased RAM usage during some of the stress-testing periods. We helped the Tendermint Core team to investigate the root cause of this issue. It has now been fixed (https://github.com/tendermint/tendermint/pull/6068) and was incorporated in the 0.34.4 release.
For more details of the memory usage discovery, please refer to the section “Revised Hardware and Network Requirements after Stress Testing” below.
In terms of the next major release (v0.35), the Tendermint Core team is focusing on some of the long-standing P2P and mempool-related issues.
What we have learned
Improved Block Explorer Performance with Database Optimization
The competition had a high loading, which could help us identify the inefficiencies of the block explorer, and improve by optimizing the database.
The first stress testing challenge we observed is on our block explorer. We have been using our own indexing service to power the block explorer data and the Crossfire dashboard. When the testnet began and participants started running their transaction spamming tool, we observed that the home page was loading slower than normal. After investigation, we realized that we are using some un-optimized SQL queries on our front page.
Once we identified the issue, we re-visited the SQL queries used in indexing and created database indices for the more frequently used ones.
With the database index in place, we successfully lowered the latency from a few hundred milliseconds to single-digit milliseconds, yet it was still insufficient due to high traffic. Thus, we rolled out more improvements:
- Modified index codebase to perform batch database insertion
- Enhanced our indexing and Postgres database hardware
- Separated the database to a write master and a read replica, with a dedicated machine to serve API with a read replica
Eventually, we have a healthier block explorer. These improvements will be applied in the Mainnet launch.
Remedied Block Explorer and Tendermint Timeout Issues due to Large Blocks
The large block “attack” during week 4 revealed a potential problem - the default HTTP timeout was too short. As a result, we revised the timeout settings of the indexing service and Tendermint, while still being able to protect the server from slowloris attack.
Another challenge observed was that the large block size caused long HTTP response time. Some participants were sending very large transactions, and these large blocks took a much longer time to be returned in the HTTP API call.
For security reasons, it is a good practice to restrict the HTTP server and client connection timeout in order to avoid slowloris attack. Our indexing service and Tendermint both adopted this measure and use a 10s timeout. However, 10s is not enough for these large blocks.
We identified the cause of this issue to our indexing monitoring system reports synchronization. We tried to request block results API and the response was cut in the middle. When we requested programmatically with curl, we saw the error of the stream was not closed cleanly. These were all signs of the WriteTimeout kicking in.
To remedy this issue, we had to extend the timeout while protecting the server from the slowloris attack. The following settings were applied:
- For indexing service, we extended only the timeout of the HTTP client to Tendermint
- For the Tendermint server, there was no dedicated setting for HTTP timeout. After studying their changelog, we realized that “rpc.timeout_broadcast_tx_commit” controls the timeout on the transaction broadcasting API. Since Golang does not have fine-grain control of HTTP timeout on different API routes, the overall HTTP server write timeout is specified by the formula “MAX(10s, rpc.timeout_broadcast_tx_commit)”.
- We also set up a dedicated and private full node to serve indexing service. This ensures that it is accessed by trusted clients, and could provide extra benefits of performance gain to indexing.
Applied Security Patch in Action
The Tendermint evidence vulnerability allowed us to coordinate an urgent security patch with the 200 participants in action, which is a valuable experience for us.
When we launched the mainnet dry-run, the patch to Tendermint evidence vulnerability was still being evaluated, so we decided to use the unpatched version first. During the first two weeks of Crossfire, we started seeing some evidence propagated in the network, which suggested that the issue might have been exploited. We were faced with two choices: we could either apply the patch immediately, or wait till the scheduled network upgrade at the end of week 3.
We made a quick decision to apply the patch immediately because this exploit was classified as a severe security patch, we wanted to see the patch in action to make the network more efficient. The patch was therefore released on 25 January, with instructions announced on 26 January.
Here, we would like to stress the importance of providing the security contact when you set up a validator. This time we used a mix of both registration email and security contact to notify the validator operators of the updates. When the Mainnet launches, the security contact of a validator will be an important way for us to contact validators on security patches.
Understood the Relationship Between of Network Parameters
Another lesson we have learned is the importance of balancing different parameters. In the competition, we had the opportunity to observe how parameters affect each other in practice. This can help us to decide the parameters on Mainnet.
Crypto.com Chain is built on top of Tendermint, which uses the Practical Byzantine Fault Tolerance (PBFT) consensus algorithm. The blockchain is safe as long as less than ⅓ of the validators are faulty. This is a strict property to make the chain safe.
However, when we have a total of 200 validators and an expected block time of 5 seconds, we have to carefully decide what parameters we should use to make sure validators are not faulty. When we say a validator is faulty, it does not necessarily mean the nodes are trying to cheat other validators, it could also refer to the ability of the node to submit and propagate the votes during the consensus rounds due to issues such as network speed.
In the mainnet dry-run, we observed and sometimes deliberately tweaked the balance between different parameters. Transaction spamming by the participants was inevitable for their task fulfillment, so at any time the network was flooded with both valid and invalid transactions. The transaction traffic had created pressure and we saw the consensus algorithm was taking multiple rounds of votes exchange before a block can be proposed.
When we increased the transaction memory pool size from 5k to 20K, we saw the consensus process was taking even more rounds, because the bigger blocks and higher number of transactions needed more time to propagate and are congesting the network, demanding the validators with higher network bandwidth and computation power.
At the end of the day, we understood the importance of different parameters, and this is a good practice that can help us decide the parameters for mainnet.
Set Up Node in Diverse Geographical Location for Better Network Coverage
Before the dry-run, we set up nodes mainly on a single region. Yet during the competition, we got feedback from some of the participants as they were facing difficulties to connect and synchronize. We suspected that it was related to the location of our seed nodes.
Our seed nodes and validators were initially set up in Singapore. To find out if the node location was the root cause, we conducted a survey on Discord and observed that validators located in Europe experienced high latency.
After proving the theory, we set up two more public seed nodes in the Europe and US regions and shared them with the community. We have then seen improvements in latency and synchronization successiveness.
Revised Hardware and Network Requirements after Stress Testing
Thanks to the high network loading, we now understand the optimal hardware requirements for running nodes reliably under different circumstances.
In our previous experience running the Croeseid testnet, we ran our validators with 2 vCPU and 16GB RAM and didn’t encounter any problem.
To be more cautious, we set up our validators in Crossfire with slightly stronger hardware than those used in Croeseid. However, when more than a hundred validators joined and participants spammed the network, we saw our validators experienced performance degradation with high CPU, memory, and network bandwidth usage that we had never seen before.
After investigation, it appears that
- The existing recommended hardware requirements (2 vCPU, 16GB RAM) were unable to process large amount of transactions on the network very reliably
- There was also signs of memory leak as observed on our monitoring dashboard
- Some of the participants also reported similar memory issues on Discord. Later, we helped the Tendermint Core team to resolve this issue
We have now revised the recommended hardware requirements to make sure nodes can work as expected under different circumstances.
Suboptimal Node Setup is Not Recommended in the Long Run
In the competition, we found that some of the participants are running with a suboptimal node setup. While we appreciate the diversity of node setups, for the best interest of the network security and the potential risk of losing the staked funds, we recommend running a node with better hardware or consider using a reliable infrastructure provider such as Bison Trail, or delegate your funds to robust validators.
The Crossfire Mainnet Dry-run was not only an opportunity for the Crypto.com Chain team to learn, it also aimed for the public to try out different deployment options and run a validator in action. We appreciate the diversity of node setup we have seen in the competition. Nodes were set up across different regions on different hardware, and they were running on cloud vendors, data centers, offices, and homes.
On the other hand, from the survey we conducted earlier, we observed that some of the participants were using suboptimal setups of validator nodes. While we appreciate the diversity of setup reflecting decentralization, one of the goals of Crossfire Mainnet Dry-run is to recommend and educate everyone to have a secure and reliable node setup. This is important because when you run your validators on the Mainnet, you will be staking real money, so you would want to minimize the risk of your funds being slashed because of node failure.
If you are interested in running a validator on the Mainnet, we generally recommend you to run with a machine with the following setup:
- at least 4 CPU cores;
- 32GB RAM;
- key management service to secure your private key (e.g. Tendermint KMS);
- reliable network connection, with a minimum of 200Mb/s dedicated connection (i.e. not shared home broadband);
- reliable power supply - if you are running at home, consider having an uninterruptible power supply (UPS);
- ensure high availability;
- follow the security checklist prepared by the Crypto.com Security team
If you are interested in participating as a validator on Mainnet, but you cannot meet the infrastructure and operational requirements, Crypto.com has enlisted Bison Trails to provide reliable validator node infrastructure for the Crypto.com Chain Testnet and Mainnet.
On the other hand, you can also consider delegating your CRO to reliable validators in the Mainnet to earn the rewards.
Going forward, we will apply the lessons we learned from Crossfire to the Mainnet setup, and we will get our final preparation ready for Mainnet. We are summarizing our learnings and enriching our chain documentation so that it can serve as a portal of all educational resources for future developers and validators.
If you are interested in becoming a validator on mainnet, we recommend you to participate in our Croeseid Testnet so that you can get early access to the chain features and be prepared for Mainnet. Our chain documentation can help you get started. We also recommend you to join our Discord to get the latest updates.