Hathora's Bare Metal Journey

How Hathora decided to adopt, evaluate, and successfully incorporate bare metal servers into our product.

Hathora's Bare Metal Journey

As the CTO of a SaaS startup, my responsibility is to identify technologies that can have an outsized impact on our customers and can help our product stand out in the industry. Last summer, we took a massive bet on something that many technology companies would consider to be a relic of the past: bare metal servers. Six months later, I can say that the project was the most ambitious one we’ve ever undertaken, but also the most impactful for our product.

This post will cover our journey to bare metal, including:

  • Why we chose to adopt bare metal in the first place
  • What bare metal really means and the challenges associated with it
  • How we successfully incorporated bare metal into our product

Background

The first version of our server orchestration platform was purely cloud based. We were optimizing for the operator experience – our core belief being that every game studio shouldn’t have to build an in-house game server hosting platform from scratch for their new title. Hathora abstracted away foundational components like autoscaling, multi-region compute, build management, security, and observability, all into a single holistic interface.

After we launched the platform at GDC in 2023, our core hypothesis was quickly validated – within 6 months, we had over 100 game developers adopt the platform, including some well-known studios like Frost Giant. However, we began to see more reluctance as we started speaking with much larger studios (who had 7 or 8 figure budgets for game server hosting alone).

As an example, we had been working with a customer building a next-gen tactical shooter. They had adopted Hathora for their game in-development, and it had been powering all their internal playtests for months. With the engineering team sold on the product, we moved to pricing discussions with their exec team as they considered whether it could power their official launch. When we ran the numbers, their estimated egress bandwidth cost came out to over $1m per month, which was over 4x their estimated monthly compute cost.

With cloud data transfer costs repeatedly coming up as a concern as we spoke to these large enterprise customers, we made the decision that it was time to invest into incorporating bare metal into our product, as we knew that bandwidth was an order of magnitude cheaper on bare metal.

What is Bare Metal?

Before AWS first stormed onto the scene circa 2006 and popularized “cloud computing” in the market, bare metal was the de-facto standard for powering centralized compute workloads. Bare metal typically refers to servers in their most basic form – machines installed on server racks inside of a data center.

These machines come with the basic physical hardware you need:

  • a CPU (e.g. a 32 core Intel/AMD/ARM chip)
  • physically attached storage (e.g. a 1TB NVME SSD)
  • an IP address and networking equipment (e.g. a 50Gbit NIC) so they can communicate over the internet
  • a server chassis (e.g. HP/Dell server rack) hooked up to power supply and cooling equipment
Server racks in an Equinix data center

In terms of software, you can install a base operating system through custom images or PXE booting, and you typically administer the servers via something like an iDRAC interface.

In contrast, cloud is a much more feature-full one-size-fits-all platform containing:

  • Hypervisors that allow you to a rent a slice of a machine without needing to care about the underlying physical core count
  • Separate storage from compute, which allows you to flexibly resize volumes and move them around to other instances
  • Ability to autoscale server instances to meet variable demand, which can reduce the need for capacity planning
  • A plethora of managed infrastructure such as SQL and NoSQL databases, key-value stores, queues, etc
  • Managed service orchestration management (kubernetes and otherwise)

The power and flexibility that the cloud offers makes it the practical choice for the majority of modern applications. In fact, we plan to stick with the cloud for the foreseeable future for our control plane services (the "brains" behind our platform), as the operational simplicity we gain outweighs any cost savings we might benefit from leveraging bare metal.

However, there are some use cases that don’t benefit as much from cloud technology. Session-based game server hosting happens to be one of the use cases where bare metal really shines, since they call for large volumes of compute and bandwidth to run ephemeral processes without the need for persistent storage or internal service-to-service communication. By stripping away all the add-ons that the cloud provides, the cost savings bare metal is able to achieve are staggering – it’s easily 3-4x cheaper than cloud on compute, and 10-15x cheaper on bandwidth.

In practice, we’ve found it’s often practical to go with a hybrid approach: bare metal for your expected baseline workload of games, and cloud to handle the bursty or unexpected traffic you get. This allows you to take advantage of the cost savings that bare metal offers, while also utilizing the on-demand scalability of the cloud. We already had the cloud orchestration bit in Hathora, and so it became a matter of augmenting it with bare metal.

Achieve optimal results by combining bare metal + cloud to meet variable player demand

Adopting Bare Metal

When it came to planning how we were going to incorporate bare metal, we had to answer three major questions:

  1. Do we lease the hardware or buy it
  2. Which vendor do we go with
  3. How do we adopt bare metal machines into our orchestration software

Rent vs Buy

The first decision we had to make was on rent vs buy with the bare metal. Buying meant working with vendors like AMD or Intel to purchase hardware and host it in data centers offering co-location services. Renting meant working with a vendor who managed all aspects of the servers and leased them to us for a period of time. Given we’re an early stage startup, we quickly ruled out the buy option – it didn’t make sense in terms of the upfront capital required, nor in terms of the ongoing operations burden it would place on our team. It was much more practical to rent from a bare metal vendor who managed the hardware procurement and maintenance.

Vendor Evaluation

Having decided this, we then entered the vendor evaluation process. We wanted to ensure our customers had at least the same baseline performance when running on bare metal as they did when running on cloud. After narrowing down to a few reputable bare metal vendors, we began an extensive evaluation of compute and network performance across them.

On the compute side, we evaluated Intel and AMD processors, and found that high density AMD chips (especially the recently released 4th generation ones) were more than capable of powering modern Unreal and Unity based game servers, while also being highly affordable. We already had experience benchmarking network performance from our Cloud Latency Shootout, so we benchmarked the vendors' edge network against the winning premium cloud network (AWS Global Accelerator). For the best performing vendor, we saw it outperform Global Accelerator within distances under 4000km (the distance between New York and San Fransisco). Ultimately, our top providers were chosen based on their regional data center availability (they had to operate in at least the 11 regions that we already offered on our platform), hardware availability for our desired server specs, and edge network performance.

The top bare metal vendor outperforms the premium Global Accelerator network at short and medium distances

Hybrid cloud+metal container orchestration

The final problem to solve was container orchestration. The Hathora platform is container-native, and we were previously utilizing a fully managed cloud orchestration solution, but that didn't meet our new requirement of being able to run hybrid compute clusters with a mix of bare metal machines for base capacity, and auto-scaling cloud machines for burst capacity.

Thankfully, we found the open source project Talos Linux which was built for exactly this purpose. Talos is a streamlined, immutable Linux distribution designed to run on cloud, bare metal, and virtualized environments. It's simple, entirely API-driven, and requires no SSH. We utilized Talos' wireguard mesh, KubeSpan, to create a seamless cluster across bare metal and cloud for our platform. We worked with their team to build out an ideal orchestration solution that would meet our needs on a global scale.


Conclusion

With the incorporation of bare metal into our platform, we’ve been able to unlock tremendous cost savings for our customers. The value of hybrid cloud+metal orchestration deeply resonates with larger game studios who want to serve a massive player base without breaking the bank.

As part of this shift, we recently formalized an enterprise tier offering which gives our customers access to dedicated hardware, bare metal pricing, bring your own cloud, and more – check it all out on our pricing page.

Bare metal is just the start of major product innovations we’re beginning to roll out, stay tuned to see what’s next!