Overview
This position is for an SRE Problem and Knowledge Management Team Lead within the enabling group, Site Reliability Engineering and Governance (SRE & Governance) department.
The role is to strategically lead incident retrospective / problem management operations and other SRE activities related to maintenance management, including availability, performance, change management, monitoring, capacity planning, and solutions derived from emergency response.
The Team Lead ensures retrospective activities are orchestrated effectively while promoting a blameless culture in line with SRE principles.
Responsibilities
Mentor the team in the seamless facilitation & conduct of root cause analysis (RCA) activities from end to end
Lead the facilitation for high-severity incidents liaising with top/senior management and keeping them updated
Prime focal point for presenting in the RCA Forum, Tech Risk Forum and other senior management meetings to report updates on retrospective findings & action plans
Absorb new technology rapidly & apply effectively
Communicate well with technical & non-technical colleagues
Work to a high standard with agreed timescales
Undertake any other tasks or duties as reasonably requested by supervisor or senior management
Do resource management to ensure problem management activities are carried out effectively and efficiently
Provide platforms and channels to ensure stakeholders are kept updated on results of retrospectives and RCA activities
Able to demonstrate authority in the problem management calls
Point of contact for assigned incidents of higher severity (from incident retrospective calls to Management Report documentation and publishing)
Take accountability for initiatives on enhancement activities related to SRE as a result of retrospectives
Collaborates with Engineering Teams within SRE and with LOBs on enabling activities as part of preventive measures
Requirements
Minimum 15 years of process improvement / RCA exposure & involvement leading discussions as a problem manager or incident commander, preferably in the Technology & Operations space
Experience with JIRA, Confluence, Jenkins, Nexus, SonarQube, Bitbucket, S3, Cloud Computing
Good exposure to logging & monitoring tools like Dynatrace, Prometheus, Grafana, ELK/ELK Stack
In-depth understanding of Incident & Problem Management functions & activities (hardware- & software-related)
Work with stakeholders & command centre in troubleshooting, escalating & solutioning critical site incidents
Identify recurring system/ application issues & collaborate with cloud, infra teams, product development, vendors & other stakeholders
Maintain accurate documentation of incidents including impact details, timelines, and mitigation/resolution steps
Strong verbal & written communication skills, especially effective documentation
Minimum 10+ years of software development or technical support or operations experience
Basic knowledge of Linux, AIX, Solaris and Windows
Exposure to Enterprise databases (e.g., Oracle, SQL Server, MariaDB, MongoDB & Sybase)
Knowledge of systems & multi-tier application & network troubleshooting
Essential knowledge & awareness of Public/Private/Hybrid cloud solutions
Job Information
Primary Location:
Singapore-DBS Asia Hub
Job:
Technology
Schedule:
Regular
Job Type:
Full-time
Job Posting:
Oct 7, 2025, 10:17:56 PM
Seniority level
Not Applicable
Employment type
Full-time
Job function
Information Technology
Industries
Banking, Financial Services, and Investment Banking
We’re unlocking community knowledge in a new way.
Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr