Job Summary:
We are seeking a
Senior Site Reliability Engineer (SRE)
with 10–15 years of proven experience in building, managing, and maintaining highly available, scalable, and secure infrastructure across
multi-cloud
and
hybrid cloud
environments—including
on-premises data centers .
The ideal candidate will have deep knowledge of
SRE principles , strong hands-on experience in
automation ,
observability ,
incident response , and
infrastructure resilience , and the ability to architect solutions that span
cloud and traditional data center
environments.
Key Responsibilities:
Design, implement, and manage
reliable and scalable systems
across
public clouds (AWS, Azure, GCP)
and
on-premises data centers .
Apply
SRE best practices —including
SLIs, SLOs, error budgets, incident management, and postmortems —across cloud and non-cloud environments.
Develop and maintain
Infrastructure as Code (IaC)
using tools like Terraform, Ansible, or CloudFormation.
Drive
automation
for deployment, scaling, monitoring, and infrastructure management.
Implement and enhance
observability practices
(monitoring, logging, tracing) using tools like Prometheus, Grafana, ELK, Datadog, New Relic, etc.
Work with application teams to ensure
high availability ,
performance , and
cost optimization
across hybrid environments.
Lead and participate in
on-call rotations
and improve overall
incident response
processes.
Collaborate with security and compliance teams to enforce
best practices in data protection , access control, and system hardening in hybrid setups.
Evaluate and recommend emerging tools and technologies for
resilience engineering ,
disaster recovery , and
infrastructure modernization .
Required Qualifications:
10–15 years
of experience in SRE, DevOps, or infrastructure engineering roles.
Proven experience managing infrastructure in
multi-cloud (AWS, Azure, GCP)
and
hybrid cloud/on-prem environments .
Solid understanding of
networking, load balancing, storage, virtualization, and container orchestration
(Kubernetes, Docker).
Strong scripting and programming skills (e.g., Python, Go, Bash).
Experience with
CI/CD pipelines , tools like Jenkins, GitLab CI, ArgoCD, etc.
In-depth knowledge of
SRE methodologies
and real-world application of SLAs, SLOs, and error budgets.
Hands-on experience with
monitoring and observability stacks .
Strong analytical and troubleshooting skills for
production incidents
across complex, distributed systems.
#J-18808-Ljbffr