Key Responsibilities
* Ensure
production readiness
by implementing operational criteria such as availability, capacity, performance, monitoring, self-healing, and deployment automation.
* Partner with development and product teams to design and support
secure, reliable, and scalable services
.
* Lead and participate in
incident triage, root cause analysis, and long-term remediation
to minimize business impact.
* Drive automation for deployments, operations, and monitoring to reduce manual intervention.
* Implement and manage
observability practices
(monitoring, logging, alerting, tracing) to maintain high service availability.
* Proactively manage
production and change activities
to maximize customer experience.
* Collaborate with security, risk, and compliance teams to ensure
secure operations
across environments.
* Provide continuous feedback to development teams to improve system design and customer experience.
* Advocate and contribute to the
DevOps and SRE culture
across the organization.
Skills & Qualifications
* Proven experience as a
Site Reliability Engineer, Biz Ops Engineer, or DevOps Engineer
.
* Strong knowledge of
SRE principles
and
Standard Engineering Practices
.
* Experience with
CI/CD tools
(Jenkins, GitLab CI, GitHub Actions, ArgoCD, etc.).
* Hands-on expertise with
cloud platforms
(AWS, Azure, GCP).
* Proficiency in
containerization & orchestration
(Docker, Kubernetes).
* Strong skills in
monitoring, observability, and logging tools
(Prometheus, Grafana, Splunk, ELK, Datadog, AppDynamics, etc.).
* Scripting and automation skills (Python, Bash, Go, or similar).
* Knowledge of
risk management, compliance, and security best practices
(ISO 27001, SOC2, PCI-DSS, NIST, GDPR, etc.).
* Strong problem-solving, incident management, and communication skills.
* Experience working in
Agile/Scrum environments
.