Case Study
The AI/ML explosion changed the economics of research infrastructure overnight. A FAANG company needed to scale GPU capacity fast — more researchers, more compute, more experiments running in parallel — but on-premises HPC clusters couldn't keep up. On-premises infrastructure has real advantages: customization, performance, security, and control. But it also carries serious disadvantages: massive upfront investment, long ROI cycles, years-long build timelines, and hardware that becomes obsolete before it's fully utilized. When AI research demands accelerate, waiting years to expand capacity is not an option. Cloud HPC offered a different trade-off: more flexibility, faster provisioning, and cost efficiency for overflow capacity and cutting-edge hardware testing. The challenge was that cloud HPC offerings have limited native features and are notoriously difficult to adopt at enterprise scale. Integration with internal services was a hard requirement — not a nice-to-have.
Renaiss designed and deployed a production-ready HPC Slurm cluster on AWS using AWS ParallelCluster as the base layer, heavily customized to meet enterprise requirements.
The architecture went well beyond a standard ParallelCluster deployment. Key capabilities built on top of the base layer included secure access for internal users, Unix user management, two-factor authentication, S3 data pipelines, FSx for Lustre support across multiple configurations, Slurm partitions and limits, Slurm accounting, hardware observability, hardware testing frameworks, login nodes, and multi-tenant support across different AWS accounts. Persistent $HOME directories, Lustre eviction policies, and capacity planning tools were also implemented to support research workflows at scale.
Custom safeguards were built specifically for AWS services to prevent runaway costs and enforce governance. Over time, an Azure cluster was added to the stack using Cycle Cloud, expanding the solution to a true multi-cloud environment.Full tech stack: Terraform, Packer, AWS (EC2, EFA, FSx, EFS, S3, SES, SNS, SQS, Step Functions, Cognito, DynamoDB, CloudWatch), PyTorch, NCCL, DUO.
A research team at full speed, without the friction that on-prem environments create.
+Twenty environments running in parallel, each tuned to its team and workload.
Isolated environments per team, with centralized governance across every account.
Five thousand GPUs orchestrated, scheduled, and billed with production-grade precision.
Five thousand GPUs orchestrated, scheduled, and billed with production-grade precision.
The work shaped AWS ParallelCluster's roadmap. A signal the architecture was ahead of the product.





Assessment & Architecture Design
01 / 05
Infrastructure Deployment
02 / 05
Security & Access Configuration
03/ 05
Storage & Data Pipeline Integration
04 / 05
Observability & Ongoing Operations
05 / 05
We mapped the client's research workflows, data volumes, and GPU demand to define a cloud HPC architecture that could scale without sacrificing security or control.
We mapped the client's research workflows, data volumes, and GPU demand to define a cloud HPC architecture that could scale without sacrificing security or control.
We implemented Unix user management, 2FA via Duo, and secure multi-tenant access across isolated AWS accounts — built for an enterprise environment with hundreds of active researchers.
We connected multiple FSx for Lustre file systems and S3 pipelines to support petabyte-scale datasets, with automated eviction policies to keep costs under control.
We set up Slurm accounting, CloudWatch monitoring, and capacity planning tools so the client's team could run autonomously — with full visibility into usage, cost, and hardware performance.