Job Description Who are we Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation.
These services have applicability across a variety of industries, including banking financial services, insurance, retail, higher education, food, healthcare, and manufacturing.
The Role Plan, manage, and oversee all aspects of a Production Environment Define strategies for Application Performance Monitoring, Optimization in Prod environment Respond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.
Support deployment of code into multiple lower environments.
Supporting current processes with an emphasis on automating everything as soon as possible.
Design, develop and standardize Monitoring and Alerting mechanism for the supported applications.
Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.
Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
Analyze ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.
Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.
Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.
Work with a global team spread across tech hubs in multiple geographies and time zones.
Ability to share knowledge and explain processes and procedures to others.
Share knowledge and mentor junior resources Able to perform on-call duties on a rotational basis.
Occasional off hours work required.
Requirements L8 Positions NGFT Key skills Must to have Jenkins Chef Bash Splunk Dynatrace Linux Bit Bucket Problem Management ITIL Remedy Good To have Python AWS * Migrating to AWS Key Responsibilities What You'll Do:
•Demonstrate and innovate SRE practices by collaborating with stakeholders to implement important SRE principles and objectives and create new practices where applicable.
•Partner with product and platform teams to define and track service level objectives (SLOs) and indicators (SLIs).
•Monitor and manage system reliability performance, ensuring systems meet SLOs.
•Communicate reliability concerns and their potential impact with key stakeholders.
•Promote the prioritization of reliability throughout the software development life cycle.
•Design, code, test, and deliver solutions to automate manual operations.
•Participate in on-call rotations, provide support for SRE systems, and lead or participate in post-mortem incident analysis.
•Engage in system design, capacity planning, and architecture discussions to ensure operational requirements are met.
•Share lessons learned and best practices regarding reliability and performance with stakeholders and team members.
•Assist in training and mentoring fellow junior SREs to ensure best practices are followed and scaled within the organization.
•Pursue continuous improvement opportunities to stay up to date on SRE methods and trends and participate in organizational learning initiatives.
•Support governance and ensure compliance with policies by collaborating with security, compliance, and other teams.
•Respond promptly to requests for assistance from technical customers, providing engineering support and best-practice guidance.
•Adhere to and suggest improvements to standard operating procedures, advocate for automation and workflow optimization.
Team Specific Skills It is not expected that any single candidate would have expertise across all these areas, but a Biz Ops engineer will spend time throughout their career with various aspects of the role: Operational Resiliency Architect:
•Support application health, performance, and capacity.
•Assist in system design consulting, capacity planning, and launch reviews.
•Collaborate with development and product teams to establish monitoring and alerting strategies.
DevOps/Automation:
•Engage in development, automation, and business process improvement.
•Support CI/CD pipelines and promote software into higher environments.
•Increase automation and tooling to reduce manual intervention.
ITSM Practices:
•Analyze ITSM activities and provide feedback to development teams on operational gaps or resiliency concerns.
•Perform root cause analysis of incidents and work with development teams to resolve issues.
Preferred Skills and Experience:
•Coding experience in one or more programming languages such as Java, Python, or Go.
•Familiarity with cloud platforms like AWS, Azure, or GCP.
•Experience with Message Queue (MQ) technologies like RabbitMQ, Kafka, or similar technologies.
•Experience with observability tools like Splunk, Dynatrace, Prometheus, or Datadog.
•Knowledge of industry-standard CI/CD tools like Git/Bitbucket, Jenkins, Maven, and Artifactory.
•Understanding of client-server relationships, network concepts, and operating system navigation.
•Familiarity with Kubernetes and configuration management tools.
General Skills and Competencies:
•Ability to work with development, operations, and product teams.
•Strong verbal and written communication skills, including the ability to explain technical issues to non-technical audiences.
•Critical thinking skills and a proactive approach to problem-solving.
•A mindset geared towards continuous improvement and learning.
•Ability to work effectively in a team and share best practices Requirements SRE skills