As the Site Reliability Engineer on our team, you are responsible for the operational support of our products and front-line customer management for any system-related incident reports. You will ensure maximum uptime and performance for our deployments. You will perform any required system updates in the form of patches, hotfixes or upgrades. You are also part of the escalation and technical support processes. You assist with prioritization of technical issues and coordinating resolutions with our clients.

Who you are

• You have a bias toward automation especially for deployment and configuration

• You are detail and stability oriented

• You enjoy designing and implementing automation code

• You follow production deployment procedures and design the best technical approach

• You document and contribute to knowledge base

• You are able to identify the signal in the noise

• You are able to ruthlessly prioritize • You consider yourself a highly-skilled generalist

• You enjoy collaborating in a multicultural and diverse environment but also function well autonomously

• You are able to communicate effectively, clearly, and timely

• You like the adventure of work travel

• You have a passion for solving hard problems

What you’ll need

• Minimum of five (5) years of product administration experience in Linux environments

• Proficiency in Enterprise Configuration Management Solutions (i.e. Puppet, Chef, Ansible)

• Proficiency in automation tools like Python or Go

• Minimum of five (5) years of related experience as a Hadoop administrator with an Expert level knowledge of Cloudera Hadoop components such as HDFS, Sentry, HBase, Impala, Hue, Spark, Hive, Kafka, YARN, and ZooKeeper

• Prior Hadoop cluster deployment experience in adding and removing nodes, troubleshooting failed jobs, configure and tune the clusters, monitor critical parts of the cluster

• Experience with containers and container orchestration

• Experience managing and configuring virtualization

• Experience with Enterprise Monitoring and Alerting Solutions (i.e. Nagios, Prometheus…)

• Experience deploying and maintaining hardware

• Minimum of two (2) years of experience managing Elasticsearch cluster

• Knowledge of best practices for Data Warehousing including business intelligence, and business continuity planning

Bonus if you have

• Experience working in an Agile environment, CSD, CSM, SA, ASE

• General knowledge of network technologies and configuration

• Experience working in Cloud infrastructure

Site Reliability Engineer(Mid-Level)