|
Manager of Engineering, SRE - San Diego California
Company: Platform Science Location: San Diego, California
Posted On: 01/30/2025
Who We AreAt Platform Science, we're working to connect everything that moves.Founded in 2015, we are an open IoT platform that partners with innovative fleets, application developers, vehicle manufacturers, and equipment providers in the transportation industry to deliver revolutionary solutions to supply chain professionals across the globe.Our employees are an engaging, diverse group of people who believe in the power of great ideas. We hire people with different experiences and perspectives to build a company culture that fuels growth through innovation.We value thoughtful actions and empathy for others. We approach challenges with resiliency and creativity, while encouraging transparency because, no matter our backgrounds or responsibilities, we are one team.About the RoleThe Site Reliability Engineering (SRE) Manager will lead a high-performing team that ensures system reliability, scalability, and efficiency while championing SRE principles across the organization. This role involves coaching the team, promoting best practices, and enabling development teams to deliver observable, maintainable, and production-ready applications. The SRE Manager oversees multiple projects, requests, and initiatives while maintaining clear communication and keeping the team aligned and productive.Essential Responsibilities - Recruit, train, and mentor a team of Site Reliability Engineers to deliver operational excellence.
- Foster a culture of innovation, collaboration, and adherence to SRE principles like SLOs, error budgets, and production readiness.
- Standardize and train development teams on observability tools such as Prometheus, Grafana, and Datadog.
- Enhance developer and release workflows using CI/CD best practices, GitOps methodologies, and tools like Jenkins, ArgoCD, and Docker.
- Drive application and system resilience through chaos engineering, load testing, and automation.
- Collaborate with teams to define SLIs, SLOs, and manage error budgets.
- Manage on-call rotation schedules, optimize alerting processes, and ensure 24/7 production application support.
- Serve as the escalation point for incident resolution, providing guidance and technical expertise.
- Build tools, dashboards, and processes to improve incident response, production health, and system reliability.
- Conduct quarterly "State of the Service" reviews to assess performance, sustainability, and risks.
- Track and prioritize multiple initiatives while ensuring the team stays focused and aligned with organizational goals.
- Maintain detailed documentation on team projects, requests, policies, and best practices.
- Communicate effectively across teams, departments, and stakeholders to ensure alignment and a clear understanding of SRE initiatives.
- Evangelize SRE practices across the organization and ensure consistent adoption of reliability-focused processes.Education and Experience
- 5+ years of experience in software engineering or SRE roles.
- 2+ years in a leadership or management position.
- Proven expertise with Kubernetes, ArgoCD, AWS, Prometheus, Grafana, Datadog, FluentD, Jenkins, and Docker.
- Strong knowledge of CI/CD and GitOps practices.
- Excellent verbal and written communication skills.
- Demonstrated ability to track and prioritize multiple projects, requests, and initiatives effectively.
- Bachelor's degree in Computer Science, Engineering, or equivalent experience.Platform Science Benefits HighlightsThe company offers various benefits to regular, full-time employees including:
|
|