Job Summary:
We are seeking a Senior Site Reliability Engineer (SRE) with 10–15 years of proven experience in building, managing, and maintaining highly available, scalable, and secure infrastructure across multi-cloud and hybrid cloud environments—including on-premises data centers .
The ideal candidate will have deep knowledge of SRE principles , strong hands-on experience in automation , observability , incident response , and infrastructure resilience , and the ability to architect solutions that span cloud and traditional data center environments.
Key Responsibilities:
- Design, implement, and manage reliable and scalable systems across public clouds (AWS, Azure, GCP) and on-premises data centers .
- Apply SRE best practices —including SLIs, SLOs, error budgets, incident management, and postmortems —across cloud and non-cloud environments.
- Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible, or CloudFormation.
- Drive automation for deployment, scaling, monitoring, and infrastructure management.
- Implement and enhance observability practices (monitoring, logging, tracing) using tools like Prometheus, Grafana, ELK, Datadog, New Relic, etc.
- Work with application teams to ensure high availability , performance , and cost optimization across hybrid environments.
- Lead and participate in on-call rotations and improve overall incident response processes.
- Collaborate with security and compliance teams to enforce best practices in data protection , access control, and system hardening in hybrid setups.
- Evaluate and recommend emerging tools and technologies for resilience engineering , disaster recovery , and infrastructure modernization .
Required Qualifications:
- 10–15 years of experience in SRE, DevOps, or infrastructure engineering roles.
- Proven experience managing infrastructure in multi-cloud (AWS, Azure, GCP) and hybrid cloud/on-prem environments .
- Solid understanding of networking, load balancing, storage, virtualization, and container orchestration (Kubernetes, Docker).
- Strong scripting and programming skills (e.g., Python, Go, Bash).
- Experience with CI/CD pipelines , tools like Jenkins, GitLab CI, ArgoCD, etc.
- In-depth knowledge of SRE methodologies and real-world application of SLAs, SLOs, and error budgets.
- Hands-on experience with monitoring and observability stacks .
- Strong analytical and troubleshooting skills for production incidents across complex, distributed systems.
#J-18808-Ljbffr