Cloud HPC Cluster - MLOps @ FAANG company

Start Your Project

Service Image

Case Study

Our solution

The AI/ML explosion changed the economics of research infrastructure overnight. A FAANG company needed to scale GPU capacity fast — more researchers, more compute, more experiments running in parallel — but on-premises HPC clusters couldn't keep up. On-premises infrastructure has real advantages: customization, performance, security, and control. But it also carries serious disadvantages: massive upfront investment, long ROI cycles, years-long build timelines, and hardware that becomes obsolete before it's fully utilized. When AI research demands accelerate, waiting years to expand capacity is not an option. Cloud HPC offered a different trade-off: more flexibility, faster provisioning, and cost efficiency for overflow capacity and cutting-edge hardware testing. The challenge was that cloud HPC offerings have limited native features and are notoriously difficult to adopt at enterprise scale. Integration with internal services was a hard requirement — not a nice-to-have.

Solution Delivered

Renaiss designed and deployed a production-ready HPC Slurm cluster on AWS using AWS ParallelCluster as the base layer, heavily customized to meet enterprise requirements.
The architecture went well beyond a standard ParallelCluster deployment. Key capabilities built on top of the base layer included secure access for internal users, Unix user management, two-factor authentication, S3 data pipelines, FSx for Lustre support across multiple configurations, Slurm partitions and limits, Slurm accounting, hardware observability, hardware testing frameworks, login nodes, and multi-tenant support across different AWS accounts. Persistent $HOME directories, Lustre eviction policies, and capacity planning tools were also implemented to support research workflows at scale.

Custom safeguards were built specifically for AWS services to prevent runaway costs and enforce governance. Over time, an Azure cluster was added to the stack using Cycle Cloud, expanding the solution to a true multi-cloud environment.Full tech stack: Terraform, Packer, AWS (EC2, EFA, FSx, EFS, S3, SES, SNS, SQS, Step Functions, Cognito, DynamoDB, CloudWatch), PyTorch, NCCL, DUO.

Project Results

  • The platform scaled to support over 500 researchers across more than 20 clusters, spanning 5+ accounts and tenants. At peak, the infrastructure managed more than 6,000 GPUs under active management and multiple petabytes of data on S3 and FSx.
  • The engagement had an impact beyond the client: AWS ParallelCluster incorporated several ideas developed during this project into its roadmap, a recognition of the technical depth and novelty of the work Renaiss contributed.
  • The result was a research infrastructure that could scale with AI demand — provisioning new clusters in hours instead of years, supporting hundreds of researchers simultaneously, and integrating seamlessly with internal enterprise systems.

+500 researchers

+500 researchers

A research team at full speed, without the friction that on-prem environments create.

+20 clusters

+20 clusters

+Twenty environments running in parallel, each tuned to its team and workload.

+5 accounts/tenants

+5 accounts/tenants

Isolated environments per team, with centralized governance across every account.

+5000 GPUs under management

+5000 GPUs under management

Five thousand GPUs orchestrated, scheduled, and billed with production-grade precision.

multiple PB on S3/FX

multiple PB on S3/FX

Five thousand GPUs orchestrated, scheduled, and billed with production-grade precision.

AWS ParallelCluster took many ideas from this engagement

AWS ParallelCluster took many ideas from this engagement

The work shaped AWS ParallelCluster's roadmap. A signal the architecture was ahead of the product.

From On-Prem Constraints to 5,000 GPUs in the Cloud

Start Your Project

Assessment & Architecture Design

01 / 05

Infrastructure Deployment

02 / 05

Security & Access Configuration

03/ 05

Storage & Data Pipeline Integration

04 / 05

Observability & Ongoing Operations

05 / 05

Assessment & Architecture Design

We mapped the client's research workflows, data volumes, and GPU demand to define a cloud HPC architecture that could scale without sacrificing security or control.

Infrastructure Deployment

We mapped the client's research workflows, data volumes, and GPU demand to define a cloud HPC architecture that could scale without sacrificing security or control.

Security & Access Configuration

We implemented Unix user management, 2FA via Duo, and secure multi-tenant access across isolated AWS accounts — built for an enterprise environment with hundreds of active researchers.

Storage & Data Pipeline Integration

We connected multiple FSx for Lustre file systems and S3 pipelines to support petabyte-scale datasets, with automated eviction policies to keep costs under control.

Observability & Ongoing Operations

We set up Slurm accounting, CloudWatch monitoring, and capacity planning tools so the client's team could run autonomously — with full visibility into usage, cost, and hardware performance.

What is nearshore software development?

What time zone does Renaiss operate in?

What cloud services does Renaiss specialize in?

Do you work with AWS, Azure, or GCP?

Can Renaiss help us modernize a legacy application?