Next-Generation Cloud Clusters

July 31, 2024 · 5 min read

Software Engineer

Director of Engineering

We're excited to announce major improvements to our cloud clusters with the release of our next-generation provider.

For the past two years, we've provisioned clusters using Infrastructure as Code templates (IaC), which has allowed us to share the responsibility of development of cloud infrastructure provisioning with software engineers, system administrators, and our support staff which have more of an HPC focus. As we've scaled out our offering, and provide users with the ability to provision more varied types of resources across more clouds, we've learned a lot of lessons.

We want to provide a very transparent "window" into the provisioning process as there are a lot of places where things can go wrong. Lifting those problem moments up and making it easier to troubleshoot was a challenge using IaC, because it was ultimately meant to be used interactively. The best you can get is a set of logs, which for most users is not very helpful.

Another major concern we wanted to address was resiliency, so we ultimately needed more control over the entire provisioning process. In our newest provider, all of the code for provisioning clusters has been centralized into our core platform, using CSP-provided Software Development Kits. This will allow us to start building out more interesting and useful features around provisioning as well as rapidly adapt to changes in the CSPs' offerings.

Coming soon, we'll have a new module on the sessions page which shows each component that is being provisioned and its current status, and if something fails, it will be very easy to determine where things went wrong. Since we were already majorly overhauling the provisioning process, we also took this chance to tackle our "wish list" of cluster enhancements.

What's Changed

Faster start-up times

Previously, cloud clusters relied on the user workspace to start up. Now, a job is submitted to a worker in our core infrastructure and the entire process is run in parallel. This change makes cluster start-up much faster than before, taking only 2-3 minutes before you can log in.

Credentials on clusters

Previously, we provisioned compute nodes from the controller node, which meant some CSP credentials needed to be available on the controller node. Now, compute nodes are instead provisioned by a centralized service, so credentials are no longer present on controller nodes. This change increases security as well as reliability.

CLI available on clusters

Our new CLI is pre-installed, and is in-fact what establishes the connection back to ACTIVATE and handles provisioning requests for compute nodes. For more information, check out our CLI documentation.

New scheduler logs

Since the entire provisioning process has been centralized, including partition compute nodes, we have a central place to manage logging. In the Logs module on the cluster Definition page, select the Scheduler tab to view insights into the CSP provisioning process.

New access tab

Previously, clusters had a Sharing tab and a multi-user button, which both often caused confusion. In our previous generation of clusters, you could actually only share a cluster with a single group, the group that it was being billed to. Now, the multi-user button has been removed and you can use the Access tab to share a cluster with as many groups as you'd like, and you can update who can access your cluster while it's already running. Changes will take effect within 30 seconds of saving. The first time a user logs in to a cluster, their home directory will be created. All home directories are stored locally on the cluster, on an NFS exported share from the controller.

User scripts

We’ve added toggle buttons for Bootstrap Controller and Bootstrap Compute Nodes so users can decide where they want their bootstrap script to run. This process was previously done by including the line ALLNODES at the top of your user script. Please do not include this on the new provider.

Inline disks

You can now add inline disks to your controller. This is useful if you want the disk's lifecycle to be tied to the controller. Image Disk Name, Image Disks, and Image Disk Size have been removed, as this feature was completely replaced by inline disks. For more information, see our blog post on flexible disks

Smarter node requests

Compute partition provisioning requests are now completed with batch API requests. This means that if you request many nodes and the CSP can't fulfill the request, no nodes will be provisioned. Additionally, if one of the requested compute nodes fails to finish the configuration process, all nodes will be deleted. These changes will both improve node request times and significantly help to avoid unnecessary costs.

Smarter default settings

For GCP clusters, gVNIC is now always enabled. Migrate on Maintenance will be enabled if the instance type supports it.

For AWS clusters, EFA is now always automatically enabled if the instance type supports it.

Cluster cost

In addition to the Cost dashboard, we now have a Cost module on the cluster Sessions page. This module shows a detailed breakdown of costs on your cluster in real time. For more information about cost types, please see this page.

What's Changed​

Faster start-up times​

Credentials on clusters​

CLI available on clusters​

New scheduler logs​

New access tab​

User scripts​

Inline disks​

Smarter node requests​

Smarter default settings​

Cluster cost​