Job Description:
As a Cloud Site Reliability Engineer , you will be instrumental in ensuring the reliability, scalability, and performance of our hybrid cloud infrastructure across Azure and AWS .
You will collaborate with engineering and cloud platform teams to build resilient, observable, and automated systems that support rapid delivery and high availability of services.
Key Responsibilities:
- Lead SRE initiatives to improve availability, reliability, and performance of cloud-native and hybrid applications.
- Design and implement observability frameworks across Azure and AWS using tools like CloudWatch, Azure Monitor, Prometheus, and Grafana.
- Drive automation and infrastructure-as-code practices to reduce operational toil and streamline deployments.
- Collaborate with application teams to define and implement SLIs, SLOs, and Error Budgets for cloud-hosted services.
- Champion chaos engineering and resilience testing across Azure and AWS environments.
- Work with enterprise teams to deploy and scale SRE enablers such as service mesh, auto-scaling, and CI/CD pipelines.
- Establish and enforce cloud infrastructure deployment standards , including blue-green and canary deployments.
- Support cloud migration strategies , cutover planning, and testing for applications transitioning between Azure and AWS.
Requirements:
- Minimum 10 years of experience in SRE or Cloud Engineering, preferably within the banking or financial services sector.
- Deep expertise in Azure and AWS cloud platforms , including compute, networking, storage, and security services.
- Strong understanding of ITIL and SRE frameworks , with the ability to integrate traditional operations with modern cloud practices.
- Proven leadership in coordinating with application teams and vendors for cloud deployment and migration planning.
- Hands-on experience with infrastructure-as-code tools (e.g., Terraform, Bicep, CloudFormation) and scripting (Bash, Python).
- Certifications in AWS (e.g., Solutions Architect, DevOps Engineer) and Azure (e.g., Azure Administrator, Azure Solutions Architect) are highly desirable.
- Experience with monitoring and alerting tools across both cloud platforms.
- Solid grasp of SRE principles: Toil reduction, SLIs/SLOs, Error Budgets, MTTD/MTTR .
- Strong interpersonal and communication skills to foster collaboration across teams and stakeholders.
- Agile mindset with experience in DevOps, CI/CD , and cloud-native development practices.