We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

Sr. Site Reliability Engineer, Observability

Tesla Motors, Inc.
168,000 - 252,000 USD
paid holidays, flex time, 401(k)
United States, California, Fremont
Feb 21, 2026
What to Expect
You will be responsible for designing and building the enterprise-grade observability platform with a strong focus on metrics, providing end-to-end visibility and diagnostics across Tesla's infrastructure and applications. You will be part of the Observability team, which manages Tesla's system observability and ensures visibility across global and internal applications, including digital, manufacturing, fleet, and Autopilot platforms. This role requires deep expertise in system engineering, Kubernetes deployments, metrics platforms (including Grafana Mimir or equivalent), and logging platform (Splunk). You will be responsible for ensuring the availability, performance, and scalability of a large, distributed metrics infrastructure that processes over billion active time series.

What You'll Do
  • Build, deploy, scale, and maintain high-performance, multi-tenant, Prometheus-compatible monitoring systems that support over billions of active time series
  • Develop custom, tailored observability solutions to address unique Tesla's requirements
  • Monitor cluster health using observability dashboards, optimize query performance, tune ingestion pipelines, and scale storage infrastructure to support long-term metrics retention
  • Design and implement next-generation observability platforms (metrics and logs) with a focus on scalability, reliability, and high performance
  • Manage large-scale distributed Splunk cluster environments handling over 500TB+ of data daily
  • Collaborate with cross-functional teams, including SREs, architects, and other stakeholders, to understand complex application architectures and enable top-down monitoring strategies for comprehensive service visibility
  • Troubleshoot performance and access issues while managing metrics platforms ( Grafana Mimir or equivalent), including installation and upgrades across clustered environments
  • Respond to and resolve support requests promptly while effectively balancing project timelines and competing priorities
  • Configure and manage CI/CD pipelines using tools such as Ansible and GitHub Actions to streamline operations
  • Participate in an on-call rotation to support critical systems outside regular business hours

What You'll Bring
  • Strong hands-on experience with observability stacks including Grafana Mimir / Prometheus / cortex/ Thanos, or equivalent enterprise-grade metrics platforms
  • Deep expertise in Linux system internals, large-scale performance tuning, and system administration
  • Solid hands-on experience with Kubernetes configuration, networking, deployment, and multi-cluster HA architectures
  • Advanced proficiency in PromQL and SQL, with strong understanding of high-cardinality metrics, label design, and series explosion impacts on storage and query performance
  • Experience with distributed systems architecture, multi-region deployments, and high-availability cluster design
  • Hands-on experience with S3-compatible object storage and experience in distributed streaming systems Apache Kafka or Redpanda
  • Strong knowledge of monitoring and observability practices including OpenTelemetry (OTLP), Protobuf, and Prometheus-based metrics collection
  • Experience configuring and tuning caching layers and managing authentication mechanisms (OAuth, reverse proxies, API gateways, mTLS )
  • Proven troubleshooting expertise and performance optimization experience in large-scale distributed metrics platforms; Splunk administration is a plus
  • Strong scripting and automation skills (Python, Ansible, GitHub Actions), excellent documentation practices, and participation in on-call and incident management processes

Compensation and Benefits
Benefits

Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire:

  • Medical plans > plan options with $0 payroll deduction
  • Family-building, fertility, adoption and surrogacy benefits
  • Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
  • Company Paid (Health Savings Accounts) HSA Contribution when enrolled in the High-Deductible medical plan with HSA
  • Healthcare and Dependent Care Flexible Spending Accounts (FSA)
  • 401(k) with employer match, Employee Stock Purchase Plans, and other financial benefits
  • Company paid Basic Life, AD&D
  • Short-term and long-term disability insurance (90 day waiting period)
  • Employee Assistance Program
  • Sick and Vacation time (Flex time for salary positions, Accrued hours for Hourly positions), and Paid Holidays
  • Back-up childcare and parenting support resources
  • Voluntary benefits to include: critical illness, hospital indemnity, accident insurance, theft & legal services, and pet insurance
  • Weight Loss and Tobacco Cessation Programs
  • Tesla Babies program
  • Commuter benefits
  • Employee discounts and perks program
    Expected Compensation
    $168,000 - $252,000/annual salary + cash and stock awards + benefits

    Pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. The total compensation package for this position may also include other elements dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.

    Applied = 0

    (web-54bd5f4dd9-dz8tw)