portal resources jobs companies k knack remote senior site reliability engineer

Remote Senior Site Reliability Engineer 🔥


We’re looking for someone to help improve our reliability and performance through deep analysis and remediation of our AWS infrastructure, monitors, alerts, and code.



 Key Responsibilities 

  • Refactor our existing monitors and alerts to be actionable and reliable, recommending and implementing diagnostic techniques and monitoring tools.
  • Deep dive and analysis into RDS (Aurora PostgreSQL) performance, using that data to inform scaling policies and automation
  • Help discover correlations between customer experience and performance indicators to determine what is noticeable by customers, and suggest and implement improvements based on findings
  • Help us to develop SLI’s, SLO’s, and SLA’s that are impactful as they relate to our customer’s experience
  • Help triage outages and issues across multiple teams, services, and codebases as they arise, leading root cause analysis and creating stories to prevent and/or detect those issues in the future
  • Serve as technical lead for deep dives to identify solutions to prevent future incidents
  • Introduce chaos engineering, promoting experimentation in production to discover and remediate systemic weaknesses and improve performance and reliability



 Skills Knowledge and Expertise 

  • Expertise in AWS
  • Expertise with RDS, preferably Aurora PostgreSQL engine
  • Expertise with containerization
  • Experience with open source monitoring and visualization systems and tools, i.e. Prometheus (monitoring + tracing), Grafana/Kibana (dashboards), GrayLog (logging)
  • Experience implementing, maintaining, and troubleshooting continuous integration/continuous delivery (CI/CD) tooling
  • Experience with implementing improvements in areas such as maintainability, scalability, availability, extensibility and security
  • Ability to work with many teams across disciplines (cloud, platform, development, qa, and security) to resolve issues as they arise and implement improvements
  • Experience with distributed tracing, diagnostic tooling, application performance monitoring, and the golden signals



 Our Stack 



 

Our stack is evolving over the next year and we’d love you to be a part of that! 

Currently we’re using:

  • Back-end: JavaScript/TypeScript, Node.js, ES6, GoLang
  • Data: Aurora PostgreSQL, Redis, ElasticSearch
  • DevOps & Deployment: All things AWS, Terraform (and Terraform Cloud), Jenkins, Github, Grafana, GrayLog
  • Testing: Playwright, Mocha, Jest
  • Front-end: Vue.js, Webpack, SCSS




Let us send you new openings similar to Remote Senior Site Reliability Engineer straight to your Inbox. Weekly or Daily. 7-day free trial 💌

The ability to work remotely increases employee happiness by 20 percent.