Cluster & Infrastructure Engineer

Cluster & Infrastructure Engineer

$180,000 - $250,000

$180,000 - $250,000

4+ years experience

4+ years experience

Apply Now

Apply Now

About The Role

RadixArk is looking for a Cluster & Infrastructure Engineer to build and operate large-scale AI clusters that power frontier-level training and inference workloads. You'll design reliable infrastructure for multi-node, multi-rack GPU and TPU systems, optimize cluster utilization and scheduling efficiency, and ensure fault tolerance at scale for SGLang and our production systems.

Requirements

  • 4+ years experience building and operating large-scale distributed systems or AI clusters

  • Bachelor's or Master's degree in Computer Science, Electrical Engineering, or equivalent industry experience

  • Strong experience with cluster management systems: Kubernetes, Slurm, or custom schedulers

  • Hands-on experience running GPU or TPU clusters at scale

  • Solid understanding of networking, storage, and distributed systems fundamentals

  • Proficiency in Python, Go, or Bash with production-quality infrastructure-as-code practices

  • Production experience operating large clusters (1000+ GPUs/TPUs) is a big plus

Responsibilities

  • Build and operate large-scale AI clusters:

    • Kubernetes, Slurm, schedulers, and resource management

    • GPU / TPU clusters, multi-node, multi-rack systems

  • Design reliable infrastructure for large-scale training and inference workloads

  • Improve cluster utilization, scheduling efficiency, and fault tolerance

  • Partner with systems and ML engineers to support frontier-scale workloads

  • Monitor, debug, and resolve infrastructure issues affecting training and serving reliability

  • Automate deployment, scaling, and maintenance of cluster infrastructure

  • Implement observability and alerting systems for cluster health and performance

  • Document infrastructure architecture, runbooks, and operational best practices

About RadixArk

RadixArk is an infrastructure-first company built by engineers who've shipped production AI systems at xAI, created SGLang (20K+ GitHub stars, the fastest open LLM serving engine), and developed Miles (our large-scale RL framework). We're on a mission to democratize frontier-level AI infrastructure by building world-class open systems for inference and training. Our team has optimized kernels serving billions of tokens daily, designed distributed training systems coordinating 10,000+ GPUs, and contributed to infrastructure that powers leading AI companies and research labs. We're backed by well-known investors in the infrastructure field and partner with Google, AWS, and frontier AI labs. Join us in building infrastructure that gives real leverage back to the AI community.

Compensation

We offer competitive compensation with significant founding team equity, comprehensive health benefits, and flexible work arrangements. The US base salary range for this full-time position is: $180,000 - $250,000 + equity + benefits. Our salary ranges are determined by location, level, and role. Individual compensation will be determined by experience, skills, and demonstrated expertise in cluster management and AI infrastructure.

Equal Opportunity

RadixArk is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

See other positions

Copyright. RadixArk @2025

contact@radixark.ai

Copyright. RadixArk @2025

contact@radixark.ai