About The Role
RadixArk is looking for a Cluster & Infrastructure Engineer to build and operate large-scale AI clusters that power frontier-level training and inference workloads. You'll design reliable infrastructure for multi-node, multi-rack GPU and TPU systems, optimize cluster utilization and scheduling efficiency, and ensure fault tolerance at scale for SGLang and our production systems.
Requirements
4+ years experience building and operating large-scale distributed systems or AI clusters
Bachelor's or Master's degree in Computer Science, Electrical Engineering, or equivalent industry experience
Strong experience with cluster management systems: Kubernetes, Slurm, or custom schedulers
Hands-on experience running GPU or TPU clusters at scale
Solid understanding of networking, storage, and distributed systems fundamentals
Proficiency in Python, Go, or Bash with production-quality infrastructure-as-code practices
Production experience operating large clusters (1000+ GPUs/TPUs) is a big plus
Responsibilities
Build and operate large-scale AI clusters:
Kubernetes, Slurm, schedulers, and resource management
GPU / TPU clusters, multi-node, multi-rack systems
Design reliable infrastructure for large-scale training and inference workloads
Improve cluster utilization, scheduling efficiency, and fault tolerance
Partner with systems and ML engineers to support frontier-scale workloads
Monitor, debug, and resolve infrastructure issues affecting training and serving reliability
Automate deployment, scaling, and maintenance of cluster infrastructure
Implement observability and alerting systems for cluster health and performance
Document infrastructure architecture, runbooks, and operational best practices
About RadixArk
RadixArk is an infrastructure-first company built by engineers who've shipped production AI systems at xAI, created SGLang (20K+ GitHub stars, the fastest open LLM serving engine), and developed Miles (our large-scale RL framework). We're on a mission to democratize frontier-level AI infrastructure by building world-class open systems for inference and training. Our team has optimized kernels serving billions of tokens daily, designed distributed training systems coordinating 10,000+ GPUs, and contributed to infrastructure that powers leading AI companies and research labs. We're backed by well-known investors in the infrastructure field and partner with Google, AWS, and frontier AI labs. Join us in building infrastructure that gives real leverage back to the AI community.
Compensation
We offer competitive compensation with significant founding team equity, comprehensive health benefits, and flexible work arrangements. The US base salary range for this full-time position is: $180,000 - $250,000 + equity + benefits. Our salary ranges are determined by location, level, and role. Individual compensation will be determined by experience, skills, and demonstrated expertise in cluster management and AI infrastructure.
Equal Opportunity
RadixArk is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
See other positions
