About the Company
We are building distributed observability and multi-cloud intelligence software for modern AI and data-intensive systems. Our platform enables engineering teams to monitor, debug, and optimize workloads operating across regions, cloud providers, and heterogeneous compute environments.
Reliability, performance, and operational clarity are foundational to our product. Infrastructure is not a support layer — it is core to what we deliver.
We are a small, senior engineering team operating in a high-ownership environment where architectural decisions have direct product impact.
The Role
We are hiring a Senior DevOps / Infrastructure Engineer to design, scale, and operate the systems that power our observability platform.
You will own multi-region cloud deployments, high-ingest telemetry pipelines, system reliability, and production observability. This role is suited for engineers who are comfortable working deeply in distributed systems and who want to influence long-term infrastructure strategy.
What You’ll Own
- Multi-region, multi-cloud infrastructure supporting real-time observability
- High-throughput ingestion pipelines and event-driven architectures
- Autoscaling, failover, and fault-tolerant system design
- CI/CD pipelines and deployment automation
- Production SLOs, incident response, and reliability engineering
- Metrics, logs, tracing, and alerting systems
- Secure, scalable API and service infrastructure
- Infrastructure-as-code and environment lifecycle management
Key Responsibilities
- Architect and operate multi-region deployments across AWS, GCP, or Azure
- Build and maintain high-throughput telemetry ingestion pipelines
- Design autoscaling and failover strategies for mission-critical services
- Own observability systems including Prometheus, Grafana, and distributed tracing
- Improve MTTR and operational readiness processes
- Manage CI/CD pipelines, GitOps workflows, and automated deployments
- Collaborate with backend teams on API performance and infrastructure reliability
- Harden infrastructure for security, compliance, and tenant isolation
- Drive long-term infrastructure roadmap and architectural direction
Requirements
Required Qualifications
- Deep experience with Kubernetes, Docker, and container orchestration
- Strong background in distributed systems and multi-region architectures
- Experience with high-ingest, streaming, or event-driven systems
- Hands-on experience with Prometheus, Grafana, and tracing/alerting frameworks
- Proficiency with Terraform or similar infrastructure-as-code tools
- Experience building and maintaining CI/CD pipelines
- Strong understanding of AWS, GCP, or Azure
- Python or Go scripting for automation and tooling
- Experience operating high-availability, production-critical systems
Preferred Experience
- Cloudflare (DNS, CDN, WAF, SSL)
- Helm, Kustomize, or similar Kubernetes tooling
- Experience with time-series databases, vector databases, or high-throughput storage systems
- Background in SRE, platform engineering, or observability tooling
- Experience supporting AI/ML workloads or GPU-based systems
- Familiarity with OpenTelemetry, Jaeger, or similar distributed tracing frameworks
Benefits
What We Offer
- Significant ownership over core infrastructure decisions
- A senior engineering team with low overhead and direct collaboration
- Fast-paced environment with measurable impact
- Competitive compensation and meaningful equity
- Opportunity to architect infrastructure for a category-defining observability platform