Responsibilities:
• Maintains platforms or products after go live by measuring and monitoring their availability, performance and overall system health
• Recovers platforms or products during production incidents to meet targeted service-level agreements
• Set up, enhance and maintain observability tools.
• Assist in incident response, perform root cause analysis, and postmortem documentation.
• Develop tools/applications/scripts to improve operational efficiency.
• Maintain and enhance CI/CD pipelines.
• Collaborate with software engineers to design scalable and resilient systems.
• Participate in on-call and on-site rotations and contribute to reducing alert fatigue.
• Document processes, configurations, and best practices.
• Support other software efficiency improvement initiatives.
Requirements:
• At least 1-3 years' experience in software development, Devops or SRE.
• Curious, Strong communicator and ready to work in a fast-paced environment and willing to pick up new skills and technologies as necessary.
• Degree in Electrical / Electronics / Computer Engineering / Computer Science or a relevant discipline
• Basic understanding of Linux/Unix systems and shell scripting.
• Familiarity with cloud platforms (e.g., AWS, Azure, GCP).
• Exposure to containerization tools (e.g., Docker, Kubernetes).
• Experience with monitoring tools (e.g., Prometheus, Grafana, ELK).
• Knowledge of CI/CD tools (e.g., Jenkins, Gitlab, Bitbucket, Jira).
• Programming/scripting skills in Python, Java, or Bash.
• Understanding of networking fundamentals and system security.
• Self-motivated, independent and a good team player
• Able to work under pressure in a fast-paced environment
• Innovative, proactive mindset and with a focus on continuous improvement
• Strong analytical and problem-solving skills