Key Role in Cloud Infrastructure Operations
About the Job
As a critical member of our team, you will play a pivotal role in ensuring seamless operation of our cloud storage infrastructure. This includes managing and maintaining site or service uptime, collaborating with internal partners to meet security, SLA, and performance requirements, and automating work to improve efficiency.
Responsibilities:
* Manage assigned site or service to ensure up-time or quick recovery from failure
* Work closely with internal partners to ensure infrastructure meets security, SLA, and performance requirements
* Automate repetitive tasks including infrastructure needs, testing, failover solutions, failure mitigation, and more
* Debug complex problems across entire stack and create effective solutions
* Persistent testing of application and infrastructure resiliency over a variety of error conditions
* Partner with security engineers to develop plans and automation to respond to new risks and vulnerabilities
* Develop, communicate, and monitor standard processes to promote long-term health of operational development tasks
* Stand up and maintain pre-production and developer environments to support the development organization and improve team velocity
Requirements:
* Knowledge of Compute - Linux (RHEL), hardware, virtualization, OS Installation
* Experience with CEPH Software Defined Storage
* CEPH and hardware troubleshooting in SDS
* Solid understanding of Cloud infrastructure/operations
* UNIX/Linux shell proficiency, scripting (shell), and understanding of Linux internals
* Expertise in Ansible, Bash, core Python development; strong familiarity with C, C++, Golang, Python, or Java
* Experience with DevOps engineering or SRE; containers (Docker, Kubernetes); and standard monitoring/observability tools
What We Offer
A challenging role in a dynamic environment, opportunities for growth and development, collaboration with experienced professionals, and the chance to make a significant impact on our cloud infrastructure operations.