Senior / Principal Site Reliability Engineer
DataCrunch.com
Hybrid
Remote (US)
Full Time
Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI models. In addition, our GPUs run on 100% renewable energy.
We’re ambitious, curious, and gutsy doers. We practice a low hierarchy across the company and high morale in our teams. We’ve already achieved a lot, yet we’re only getting started. Now it’s your chance to join the ride. We offer more than just the job - we offer a career-defining opportunity to be part of building something big!
As a cherry on top, we’ve recently raised $64M in Series A and are ready to reach new heights.
We’re seeking a Senior or Principal Site Reliability Engineer (SRE) to become our first U.S. hire, based in the Bay Area. This is a pivotal role as we expand our operations across the West Coast. You’ll work closely with our European engineering teams to scale our high-performance compute (HPC) and cloud infrastructure globally. As our initial U.S.-based engineer, you’ll set the standard for reliability, automation, and operational excellence.
- Generous cash + equity compensation along with various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.).
- Profitable operations, in addition to fast growth.
- Role that offers plenty of space to both make a business-critical impact and become a QA team lead or an engineer.
- Small yet mighty team of 65, challenging the status quo to positively impact the lives of many people.
- 27 nationalities in total, with 6 different ones in the management team.
- Work mode: Remote (with plans to open our first U.S. office next year)
- Seniority level: Senior
- Employment type: Full-time, permanent
- Ensure the reliability, scalability, and performance of HPC and cloud systems.
- Build and maintain automation, observability, and monitoring frameworks for compute clusters.
- Collaborate with ML, data, and infrastructure teams to deliver high-availability systems.
- Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes.
- Participate in architecture design and long-term infrastructure strategy discussions.
- Help establish local infrastructure and contribute to the setup of our future San Francisco office.
- Play a key role in recruiting and mentoring as our U.S. team grows.
- 7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems.
- Linux expertise (Ubuntu or Debian preferred).
- Strong experience with scripting and automation (Python, Go, Bash).
- Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius).
- Deep understanding networking (DNS/TCP), and infrastructure-as-code tools (Terraform, Ansible).
- Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs.
- Familiarity with ML model training environments.
- Understanding of Kubernetes (nice to have)
- Intro chat with our Talent Acquisition Partner - an initial online conversation to learn more about you and share details about the role.
- Technical assignment - a short task (around 15 minutes) to understand your approach and problem-solving style.
- Online technical interview with the Hiring Manager - a deeper discussion about your technical experience and ways of working.
- In-person interview with one of our team members - a chance to get to know the team and our culture.
- Final interview with our CTO & CEO – to align on vision and expectations.
