Sr. Site Reliability Engineer, Observability

Tesla Motors, Inc.
168,000 - 252,000 USD
paid holidays, flex time, 401(k)
United States, California, Fremont
Mar 14, 2026
What to Expect You will be responsible for designing and building an enterprise-grade observability platform with a strong focus on metrics, providing end-to-end visibility and diagnostics across Tesla's infrastructure and applications. You will be part of the Observability team, which ensures visibility across Tesla's internal and global applications, including digital, manufacturing, fleet, and Autopilot platforms. This role requires deep expertise in systems engineering, Kubernetes deployments, metrics platforms (including Grafana Mimir or equivalent), and logging platforms (Splunk). You will be responsible for ensuring the availability, performance, and scalability of a large, distributed metrics infrastructure that processes over a billion active time series. What You'll Do Build, deploy, scale, and maintain high-performance, multi-tenant, Prometheus-compatible monitoring systems that support billions of active time series Develop custom, tailored observability solutions to address unique Tesla requirements Monitor cluster health using observability dashboards, optimize query performance, tune ingestion pipelines, and scale storage infrastructure to support long-term metrics retention Design and implement next-generation observability platforms (metrics and logs) with a focus on scalability, reliability, and high performance Manage large-scale distributed Splunk clustered environments handling over 500TB+ of data daily Collaborate with cross-functional teams, including SREs, architects, and other stakeholders, to understand complex application architectures and enable top-down monitoring strategies for comprehensive service visibility Troubleshoot performance and access issues while managing metrics platforms (Grafana Mimir or equivalent), including installation and upgrades across clustered environments Respond to and resolve support requests promptly while effectively balancing project timelines and competing priorities Configure and manage CI/CD pipelines using tools such as Ansible and GitHub Actions to streamline operations Participate in an on-call rotation to support critical systems outside regular business hours What You'll Bring Strong hands-on experience with observability stacks including Grafana Mimir, Prometheus, cortex, Thanos, or equivalent enterprise-grade metrics platforms Deep expertise in Linux system internals, large-scale performance tuning, and systems administration Solid hands-on experience with Kubernetes configuration, networking, deployment, and multi-cluster HA architectures Advanced proficiency in PromQL and SQL, with strong understanding of high-cardinality metrics, label design, and series explosion impacts on storage and query performance Strong knowledge of monitoring and observability practices including OpenTelemetry (OTLP), Protobuf, and Prometheus-based metrics collection Experience with distributed systems architecture, multi-region deployments, and high-availability cluster design Good to have familiarity with S3-compatible object storage and exposure to distributed streaming systems such as Apache Kafka or Redpanda Good to have knowledge of configuring and managing authentication mechanisms (OAuth, reverse proxies, API gateways, mTLS) Proven troubleshooting expertise and performance optimization experience in large-scale distributed metrics and logs platforms (Splunk, cribl administration is a plus) Strong scripting and automation skills (Python, Ansible, GitHub Actions), excellent documentation practices, and participation in on-call and incident management processes Compensation and Benefits Benefits Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire: Medical plans > plan options with $0 payroll deduction Family-building, fertility, adoption and surrogacy benefits Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution Company Paid (Health Savings Accounts) HSA Contribution when enrolled in the High-Deductible medical plan with HSA Healthcare and Dependent Care Flexible Spending Accounts (FSA) 401(k) with employer match, Employee Stock Purchase Plans, and other financial benefits Company paid Basic Life, AD&D Short-term and long-term disability insurance (90 day waiting period) Employee Assistance Program Sick and Vacation time (Flex time for salary positions, Accrued hours for Hourly positions), and Paid Holidays Back-up childcare and parenting support resources Voluntary benefits to include: critical illness, hospital indemnity, accident insurance, theft & legal services, and pet insurance Weight Loss and Tobacco Cessation Programs Tesla Babies program Commuter benefits Employee discounts and perks program Expected Compensation $168,000 - $252,000/annual salary + cash and stock awards + benefits Pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. The total compensation package for this position may also include other elements dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.