The Role: This position is for an SRE Problem and Knowledge Management Team Lead within the enabling group, Site Reliability Engineering and Governance (SRE & Governance) department.This role is expected to strategically lead the conduct of incident retrospective/ problem management operations and in other SRE activities in general which pertains to maintenance management that includes availability, performance, change management, monitoring, capacity planning & also the solutions offered derived from emergency response.The Team Lead is to make sure that the retrospective activities are orchestrated & carried out effectively while promoting the blameless culture in accordance with the SRE principles.Responsibilities: * Mentor the team in the seamless facilitation & conduct of root cause analysis (RCA) activities from end to end* Lead the facilitation for high-severity incidents liaising with top/ senior management and keeping the latter updated* Prime focal point for presenting in the RCA Forum, Tech Risk Forum and other senior management meetings to report updates on retrospective findings & action plans* Absorb new technology rapidly & apply effectively* Communicate well with technical & non-technical colleagues* Work to a high standard with agreed timescales* Undertake any other tasks or duties that are reasonable & requested by the supervisor or a member of the senior management team.* Do resource management to ensure problem management activities are carried out in an effective and efficient manner* Provide available platforms and channels to ensure stakeholders are kept updated on results of retrospectives and RCA activities* Able to demonstrate authority in the problem management calls.* Point of contact for assigned incidents of higher severity (from incident retrospective calls all the way up to Management Report (MR) documentation and publishing* Take accountability for initiatives on the enhancement activities related to SRE as a result of retrospectives* Collaborates with Engineering Teams within SRE and with LOBs on enabling activities as part of the preventive measuresRequirements: * Minimum 15 years of process improvement/ root cause analysis (RCA) exposure & involvement leading discussions as a problem manager or incident commander, preferably in the Technology & Operations space* Experience with JIRA, Confluence, Jenkins, Nexus, SonarQube, Bit bucket, S3, Cloud Computing.* Good exposure to logging & monitoring tools like Dynatrace, Prometheus, Grafana, ELG/ELK* In depth understanding of Incident & Problem Management functions & activities (i.e. Hardware- & Software-related incident & problem management)* Work with stakeholders & command centre in trouble shooting, escalating & solutioning critical site incidents.* Identify recurring system/ application issues & work with cloud team, infra teams, product development, vendors & other stakeholders in investigating & resolving cause* Maintain accurate documentation of incidents including impact details, timelines, steps taken for mitigation/resolution.* Strong verbal & written communication skills particularly effective documentation skills* Min 10+ yrs of software development or technical support or operations experience.* Basic knowledge of Linux, AIX, Solaris and Windows* Exposure to Enterprise databases e.g Oracle, SQL server, Maria DB, MongoDB & Sybase.* Knowledge in systems & multi-tier application & network troubleshooting* Essential knowledge & awareness of Public/Private/Hybrid cloud solutions.-en-en