Site Reliability Engineer 🔥
- Design, creation, and provisioning of infrastructure
- Administer overall site availability, security, latency and system health
- Responsible for effective provisioning, installation/configuration, operation, and maintenance of services and system software and related infrastructure
- Administer the state of all components in our cloud and bare metal environments
- Deploy, manage, and operate the cloud environments
- Design, build, manage and operate the infrastructure and configuration of SaaS applications with a focus on automation and infrastructure as code
- Design, manage and operate the infrastructure as a service layer (hosted and cloud-based platforms) that supports the different platform services
- Develop comprehensive monitoring solutions to provide full visibility to the different platform components using tools and services like Kubernetes, Prometheus, Grafana, ELK, Datadog, New Relic, and other similar tools
- Create the environments and tooling that enables the development team to release code quickly and reliably
- Identify and troubleshoot any availability and performance issues at multiple layers of deployment, from hardware, to operating environment, network, and application
- Evaluate performance trends and expected changes in demand and capacity, and establish the appropriate scalability plans
- Troubleshoot and solve customer RPC issues
- Ensure that SLAs are met in executing operational tasks
- Work with development teams to ensure best practices for scalability, reliability, and security are designed and implemented from the start
- Conduct periodic on-call duties
- Great collaborator with 5+ years of experience in a DevOps or SRE role
- Deep understand of infrastructure-as-code (Terraform, etc.) and deploying large-scale systems reliably
- Strong experience with Infrastructure as Code and Configuration Management tools
- Experience with Prometheus/Grafana for metrics aggregation/visualization
- Configuration of CI/CD pipelines
- Experience using Kubernetes
- Experience with automation tools/platforms
- Experience with alerting and monitoring tools
- Strong knowledge of monitoring and performance analytics tools (DataDog, New Relic, etc.)
- Commitment to implementing reliability and security best practices
- Capacity planning experience, including resource optimization and load testing
- Experience working in a highly distributed company is a plus
- Align a portion of your day with the business hours of Central Time Zone - UTC -6
- Working knowledge of information security issues
- Experience in Building and managing Virtualized systems (KVM, OVM, Containers/Docker) and ability to read and understand source code
- Systematic problem-solving approach, combined with a strong sense of ownership and drive
- Firm grasp of at least one modern programming language, beyond advanced scripting (Shell or Python)
- Working knowledge of web and network protocols and standards (HTTP, TLS, DNS, etc)
- Experience writing automation tools & eagerness to "automate all the things"
- In three months, you have become our infrastructure administrator with respect to overall site availability, security, latency, system health, customer accounts, and billing. You’ll have taken on independent code review responsibilities and are collaborating on the design of new features
- In six months, you have earned the trust of the team and are delivering tasks through the entire SDLC, from design through development with minimal guidance, and are helping to effectively mentor new engineers joining the team
- In twelve months, you have established a cadence of predictable, on-time delivery without cutting corners