portal resources jobs companies s sciencelogic sr. manager, site reliability engineering

Sr. Manager, Site Reliability Engineering 🔥


*This position can be remote within the United States/Canada*

What we’re looking for…

We are looking for an experienced technical manager to lead the fast-growing Systems Architecture group under SRE team. This distributed group currently includes 7 engineers and is tasked with solving complex operational challenges like observability, capacity planning, setting up SLIs/SLOs/SLAs and working closely with product for transforming the SaaS product.

Who we are…

ScienceLogic is going through a product transformation and the Site Reliability team is at the forefront of it. We are responsible for the design, deployment, and maintenance of the Cloud Infrastructure used for running the company’s revenue generating go-forward SaaS product line.

Overall, we’re passionate about automation and solving complex business and technology challenges. Our team combines SRE, DevOps, Software Development, and Information Security knowledge to help make Site operations agile, elastic inside the security and governance framework boundaries.

What you’ll be doing…

  • Report to the Head of Site Reliability team. Participate in the management team to contribute with team strategy and roadmap planning
  • Define and execute on a cohesive observability strategy for SaaS workloads with a focus on incident detection and performance measurement
  • Participate in SRE retrospectives to postmortem incidents and outages – take ownership and deliver on action items to avoid recurrence. Occasionally write customer-facing RCA (Root Cause Analysis) documents
  • Iterate and improve upon current Capacity Management program for SaaS workloads hosted in AWS Public and GovCloud in a performant yet cost-effective manner
  • Identify process gaps and improvement opportunities to deliver SaaS product with committed uptime SLA of 99.9%
  • Participate in PRR (Production Readiness Review) program with Product to certify new features for Beta/CA releases from SRE team’s perspective
  • Provide valuable feedback for product improvement by collaborating with the Product and Software Engineering organization
  • Help deliver on corporate priorities including by not limited to retiring private data centers by end of 2023
  • Participate in shared on-call manager rotation for escalations during incidents and outages
  • Coach and mentor team members by providing appropriate guidance
  • 3+ years of experience of managing a fast-paced team of 5 or more individual contributors
  • SRE principles. Adheres to basic principles like Continuous Improvement and Automate Everything to continually improve processes and reduce operational toil
  • Data driven. Collects and uses data to support decisions
  • Clarity in communication. Provides and receives feedback in a timely manner, presents dissenting opinions in a kind and inclusive manner
  • Experience in dealing with AWS public cloud and exposed to VMWare private cloud
  • Hands-on technical experience in a large-scale production environment

About ScienceLogic

ScienceLogic is a leader in IT Operations Management, providing modern IT operations with actionable insights to resolve and predict problems faster in a digital, ephemeral world. Its solution sees everything across cloud and distributed architectures, contextualizes data through relationship mapping, and acts on this insight through integration and automation.

www.sciencelogic.com [1]


  1. http://www.sciencelogic.com

Other jobs at ScienceLogic

3 jobs in the last 60 days · 3 jobs in total · avg 1 - 3 jobs/mo · 170 job visits

ScienceLogic