About The Role
RadixArk is looking for a Diffusion Inference Engineer to optimize high-performance serving systems for image and video generation models. You'll push the limits of inference efficiency for models like Flux-1, Flux-2, Wan 2.1/2.2, and next-generation architectures, integrating them into SGLang's serving infrastructure. This role focuses on making diffusion inference faster, cheaper, and more scalable in production.
Requirements
3+ years experience building ML inference systems for generative models, computer vision, or large-scale serving
Bachelor's or Master's degree in Computer Science, Electrical Engineering, or equivalent industry experience
Strong understanding of diffusion models: sampling algorithms, noise schedules, latent diffusion, guidance techniques
Experience optimizing transformer-based (DiT/Flux) or U-Net architectures for inference
Proficiency in Python and PyTorch with production-quality code standards
Familiarity with model optimization techniques: quantization, flash attention, kernel fusion
Experience with CUDA, Triton, or GPU performance profiling is a plus
Understanding of VAEs, attention mechanisms, and multi-modal architectures
Responsibilities
Build and optimize high-performance serving systems for image and video generation models (Flux-1, Flux-2, Wan 2.1/2.2, Qwen Image Edit, Zed Image Turbo)
Implement efficient sampling algorithms: DDPM, DDIM, DPM-Solver, Euler, and custom schedulers
Optimize inference latency and throughput for text-to-image, image-to-image, and video generation workloads
Design memory-efficient serving architectures for high-resolution generation and long video sequences
Integrate diffusion models into SGLang with batching, caching, and scheduling optimizations
Profile and optimize model components: VAE encoding/decoding, DiT/U-Net forward passes, attention layers
Implement quantization and mixed-precision strategies (FP16, BF16, INT8) for production serving
Collaborate with kernel engineers to optimize attention, convolution, and sampling operations
Build benchmarks comparing inference performance across hardware (H100, B200, TPU) and configurations
Support multi-model serving pipelines: LoRA adapters, ControlNet, IP-Adapter integrations
Contribute optimizations to open-source diffusion serving frameworks
Write technical documentation on diffusion inference optimization and deployment best practices
About RadixArk
RadixArk is an infrastructure-first company built by engineers who've shipped production AI systems at xAI, created SGLang (20K+ GitHub stars, the fastest open LLM serving engine), and developed Miles (our large-scale RL framework). We're on a mission to democratize frontier-level AI infrastructure by building world-class open systems for inference and training. Our team has optimized kernels serving billions of tokens daily, designed distributed training systems coordinating 10,000+ GPUs, and contributed to infrastructure that powers leading AI companies and research labs. We're backed by well-known investors in the infrastructure field and partner with Google, AWS, and frontier AI labs. Join us in building infrastructure that gives real leverage back to the AI community.
Compensation
We offer competitive compensation with significant founding team equity, comprehensive health benefits, and flexible work arrangements. The US base salary range for this full-time position is: $180,000 - $250,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and demonstrated expertise in diffusion inference and ML infrastructure.
Equal Opportunity
RadixArk is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
See other positions
