Job summary
The Red Hat OpenShift Site Reliability Engineering (SRE) team is looking for a Senior Site Reliability Engineer to join our team in Beijing, China. In this role, you will work with Red Hat OpenShift, which is a leading enterprise Kubernetes container platform, as part of the first team to host and manage the code in the public cloud. You’ll play a key part within the team, as you’ll be responsible for keeping the Red Hat OpenShift Container Platform environment available and secure. Along with the rest of your team, you will interact with other service reliability engineers and product engineering associates around the world to deliver large, containerized cluster environments. You'll be responsible for provisioning, upgrades, problem detection and automated recovery scenarios, incident management, and understanding complicated, interconnected data points to resolve faults when issues arise. As a Senior Site Reliability Engineer, you’ll need to be able to work in a complicated and fast-paced environment while quickly learning new skills. In addition, you’ll create ways to consistently meet service-level agreements (SLAs) and keep the globally distributed, cloud-based, and containerized enterprise Kubernetes running smoothly for our customers.
Primary job responsibilities
- Interact with automated monitoring and healing infrastructure to ensure healthy environments
- Design and develop highly-available Red Hat OpenShift infrastructure components to meet the needs of our growing and evolving offering
- Join a development team on a rotation to help them reduce toil and increase availability
- Develop automation to autocorrect or completely prevent issues in our online solutions
- Participate in release cycles of our offerings, deploying code to integration, staging, and production environments, integrating with continuous integration (CI) and continuous delivery (CD) tools, monitoring, and providing change management
- Perform software updates, peer code reviews, testing, and common vulnerabilities and exposures (CVE) analysis; respond to security threats
- Identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions
- Resolve customer issues in cooperation with Red Hat's global customer support team
- Create and maintain standard operating procedures (SOPs) for performing maintenance tasks, applying configuration changes, and remediating problems in our environment
- Participate in a regular shift and on-call rotation; this will include a weekend working schedule
Required skills:
- 2+ years of experience with functional programming languages like Go, C Sharp, Java, PHP, Python, or Ruby
- 5+ years of experience managing Linux servers running Red Hat Enterprise Linux (RHEL), CentOS, or Fedora hosted at a cloud provider like Amazon Web Services (AWS), Google Compute Engine (GCE), or Microsoft Azure
- 3+ years of experience with enterprise system monitoring; knowledge of Zabbix or Nagios is a plus
- 3+ years of experience with enterprise configuration management software like Red Hat Ansible Automation, Puppet, or Chef
- Experience delivering a hosted service
- Demonstrated ability to quickly and accurately troubleshoot system issues
- Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP
- Solid communication skills; experience working directly with and presenting to customers
- Experience with Kubernetes is a plus
- Experience with Docker-based containers is a plus
--
FROM 182.114.99.*