portal resources jobs companies c clarifai inc. head of site reliability engineering

Head of Site Reliability Engineering


Clarifai is a leading, full-lifecycle deep learning AI platform for computer vision and natural language processing. We help organizations transform unstructured images, video, and text data into structured data, significantly faster and more accurately than humans would be able to do on their own. Founded in 2013 by Matt Zeiler, Ph.D. Clarifai has been a market leader in AI since winning the top five places in image classification at the 2013 ImageNet Challenge. Clarifai continues to grow with more than 100 employees and offices in New York, San Francisco, and Tallinn, Estonia.

 

Your Impact

We are looking for a Head of Site Reliability Engineering who will partner with our AI experts to scale a bleeding-edge platform across multi-cloud, bare metal and edge. You will ensure we adhere to the highest security posture for commercial and public sector clients.  You will help Clarifai researchers and engineers effortlessly iterate quickly, ship high-quality products. You will build world class training and inference clusters relied upon not only by our own researchers, but also the world’s biggest organizations and developers everywhere. You will lead a program that enables our software development with the ease and speed of the world’s most successful startups. You will support the entire software development  lifecycle, including core research & development tools, environments, build tools, CI/CD pipelines. 

Your Role

  • Lead: Drive cross-team and cross-org strategic direction, alignment,  and oversight of reliability initiatives.
  • Mentor: Mentor a highly skilled team of infrastructure, security and IT engineers. Hire, supervise, develop, evaluate, mentor and coach a geo distributed team. Cultivate an environment of continual learning and growth for group members.
  • Champion: Establish standard practices and processes for planning and prioritizing reliability work and champion a culture of reliability.
  • Partner: With senior engineering, research and product leadership to plan and ensure key initiatives are implemented for both on-premise and cloud services (GCP/AWS/Azure).
  • Budget: Develop, track, and control the information technology annual operating and capital budgets.
  • Optimize: Continuously identify opportunities for improvements, expansion, and/or reduction of services and/or costs.
  • Storytell: Use effective communication strategies such as dashboards and visual analytics and create a data-driven culture.
  • Innovate: Research current and new industry trends, technologies, and software development practices.
  • Secure: Ensure mitigation of security vulnerabilities and risks across all our managed systems, applications, and services. Direct or indirect involvement in the development of policies, standards and guidelines to ensure our product meets all security requirements
  • Operate: Develop, maintain and monitor effective operation’s processes to prevent failures of infrastructure, systems, applications and services.
  • Commit: Lead investigation of the incidents, drive the efforts to identify and fix the root causes of the incidents within our SLAs. Promote and use a data driven approach within its group to ensure SLAs are met and to drive understanding and improvements within areas of responsibilities.
  • Recover: Lead the planning, implementation, and documentation of disaster recovery and business continuity efforts.

 

Qualifications

    • A minimum of a Bachelor's degree required. Master’s degree preferred.
    • At least 10 years of experience across core technical requirements.
    • At least 4 years of experience in management of professional technical staff positions and formulating a team's technical strategy and roadmap.
    • Proficiency in Python, Golang, C++, Java and/or shell scripting.
    • Working understanding of modern security vulnerabilities and best practices.
    • Experience with 24/7/365 distributed-site monitoring and first-response support for
    • Experienced deploying container orchestration (e.g. Kubernetes, GKE, EKS)
    • Experienced debugging and operating common cloud datastores (RDS, Cloud SQL, Redshift) or their open source alternatives.
    • Experience with configuration management systems such as Ansible, Puppet or Terraform.
    • Demonstrable knowledge of TCP/IP, Linux operating system internals, filesystems, disk/storage technologies and storage protocols.
    • Expert with CI/CD pipelines.
    • Experience with distributed computing and storage (e.g. Hadoop, Spark, HDFS, Ceph).
    • Experience with large server environments (1000+ servers) that are geographically dispersed.
    • Deep understanding of advanced development engineer practices around automation, code testing, and SRE principles.
    • Demonstrated experience with data-centers, network design, application services and technologies, security and compliance.





Other openings you might be interested in

Sr. Site Reliability Engineer, Terraform Cloud

Sr. Site Reliability Engineer, Terraform Cloud

The Terraform Platform Engineering group is composed of Site Reliability Engineers and distributed systems engineers working on the Terraform Cloud [link] hosted service. Our group ensures that the platform’s underlying infrastructure, data stores, a

this week
Senior Software Engineer, Site Reliability

Senior Software Engineer, Site Reliability

The Basics: As a Senior Software Engineer in Site Reliability, you will be a part of the Tanium Cloud Engineering team. We have a focus on solving cloud operations problems and keeping our services online. We are looking for individuals who are just

last week
Principal Site Reliability Engineer-Community Team

Principal Site Reliability Engineer-Community Team

OUR CUSTOMERS DEVELOP SOFTWARE AT THE SPEED OF IDEAS CloudBees, the enterprise software delivery company, provides the industry’s leading DevOps technology platform. CloudBees enables developers to focus on what they do best: Build stuff that matter

today
Senior Site Reliability Engineer

Senior Site Reliability Engineer

PriceHubble is a PropTech company, set to radically improve the understanding and transparency of real estate markets based on data-driven insights. We aggregate and analyse a wide variety of data, run big data analytics and use state-of-the-art mach

last week
Engineering Manager, Site Reliability Engineer

Engineering Manager, Site Reliability Engineer

Location: Remote in North America Hi there! We're looking for an Engineering Manager, SRE to lead a growing engineering team at Zapier. Here you will direct the team and improve service reliability by using a software engineering approach to o

last week
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Headquarters: San Francisoc, CA URL: https://www.revenuecat.com/ About Us: At RevenueCat, we make selling subscriptions in your mobile app easy. We launched as part of Y Combinator's summer 2018 batch and today are handling subscriptions for more t

today
Director of Infrastructure Engineering

Director of Infrastructure Engineering

At Segment, we believe companies should be able to send their data wherever they want, whenever they want, with no fuss. Unfortunately, most product managers, analysts, and marketers spend too much time searching for the data they need, while enginee

today
Program Officer for Uganda

Program Officer for Uganda

Position Title:  Program Officer for Uganda Reports to:      Regional Head of Programs - Africa Position Classification: Consultant General Summary The main objective of the Program Officer consultancy is to strengthen and manage Disability Right

yesterday
Lead Site Reliability Engineer

Lead Site Reliability Engineer

InVision is the leading product design and development platform for teams building world-class digital products. It’s every company’s imperative to continuously innovate and improve on their customer experience: InVision’s platform, education, and co

yesterday
More remote jobs

Other jobs at Clarifai Inc.

3 jobs in the last 60 days · 3 in total · avg 1.85 jobs/mo · 247 job visits

Senior Infrastructure Engineer

Senior Infrastructure Engineer

ABOUT THE COMPANY: Clarifai is a leading, full-lifecycle deep learning AI platform for computer vision and natural language processing. We help organizations transform unstructured images, video, and text data into structured data at a significantly

last week
Head of Site Reliability Engineering

Head of Site Reliability Engineering

Clarifai is a leading, full-lifecycle deep learning AI platform for computer vision and natural language processing. We help organizations transform unstructured images, video, and text data into structured data, significantly faster and more accurat

2w ago
Head of Site Reliability Engineering

Head of Site Reliability Engineering

Clarifai is a leading, full-lifecycle deep learning AI platform for computer vision and natural language processing. We help organizations transform unstructured images, video, and text data into structured data, significantly faster and more accurat

6w ago
Clarifai Inc.