Posted: 4 June
The role
Requirements
Are you a Site Reliability Engineer with who loves the challenge of automating, operating and improving innovative cloud native service platforms?
Do you love digging into a production problem and seeing it through to resolution and follow through?
You have a passion for identifying and solving problems on distributed environments scaling across configuration, Linux Operating System and network
You have hands-on experience handling distributed environments (Kubernetes experience is a big plus). Y
You are interested in improving operational efficiency, and believe that automation is the key to operating large-scale systems. You are driven to ensure customer success
Software Development Engineer - Basic Qualifications:
BS in Computer Science or related field or equivalent years of experience
3 years in handling and solving distributed systems in a public cloud
3+ years of SRE experience in a distributed systems environment
Experience with AWS, GCP, or Azure
Strong experience with Kubernetes
Experience with Linux
Proficiency with a programming language such as GoLang, Python, or Ruby (preferably GoLang (Go))
Experienced with software development standard methodologies such as code management, CI/CD, testing
Senior Associate Software Development Engineer - Basic Qualifications:
1years in handling and solving distributed systems in a public cloud
1+ years of SRE experience in a distributed systems environment
Passionate for automation, with a track record of referenceable examples
Can work independently and with the demeanor that everything can be automated
Skills to operate, maintain, support and sustain the platform
Energised by working in a fast-paced environment. Experience collaborating with multi-functional global and remote teams with a diverse set of backgrounds
Excellent documentation skills, experience with developing detailed runbooks, processes
What the job involves
The primary function of the SRE team is to ensure the reliability and availability of the platform to meet the desired SLAs, reduce operational load and to scale sustainably in alignment with business growth
Be a key member of team of versatile SREs responsible for software engineering and operations, with an emphasis on reducing operational toil
Automation and improvement is planned by following scrum practices with two week sprints
The scrum team is autonomous - on-call function is follow-the-sun
Tech stack is Cloud Native (Kubernetes, Istio, OPA, GoLang, Prometheus, Grafana etc)
Responsible for the safe change and reliability of customer environments, with SLO gated multi-stage deployment automation. Mission is to improve platform reliability, observability and overall customer happiness
Develop and launch effective SLIs to ensure that SLOs are achieved through building an extendable Observability architecture, runbook automation, and establishing new processes
Partner with platform service teams to craft and implement a range of SRE standards for their respective services to meet. Define benchmarks and automation to qualify services to move to production environments
We’re the team that deploys, operates and supports our cloud native technology platform that was designed from scratch for the cloud
We own the reliability for the complete stack and tools that delivers and supports Workday products across public clouds (e.g. AWS, GCP, Azure)
The platform is built using Cloud Native technologies (CNCF), on a foundation of Kubernetes in Public Cloud environments
This provides a secure platform on which Workday service teams, and Platform development teams can build and test their pre-release code, through deployment to production on a continuous basis
Engineers from this team have shared their experiences at Cloud Native conferences, including KubeCon
#J-18808-Ljbffr