Responsibilities
• Build and maintain runbooks for telemetry onboarding, parsers, and dashboards; contribute improvements via code reviews and documentation.
• Run short enablement sessions so product squads can self-serve standardized dashboards and apply tagging/SLO standards.
• Implement and operate log/metric/trace pipelines (agents, processors, parsing, routing, archive) targeting p95 ingest latency ≤ 60s and drop rate ≤ 0.1%.
• Execute phased Splunk → Datadog migrations with query/dashboard/monitor parity and validation checks.
• Apply and enforce tag standards (service, env, tier, team, owner_email, cost_center) via IaC/CI.
• Improve multi-cloud/on-prem discovery to >98% asset coverage; reconcile CIs/relationships; track and reduce CMDB data deltas.
• Align telemetry tags with the service portfolio/catalog; maintain service maps linking infrastructure to business services.
• Define and monitor CI data-quality KPIs (staleness, duplicates, orphaned CIs) and drive remediation with owning squads.
• Partner with SRE to define SLIs/SLOs, burn-rate alerts, and golden dashboards (≤15-minute freshness) for critical services.
• Provide post-incident analytics and feed learnings into instrumentation and configuration hygiene.
• Deliver infrastructure-as-code (Terraform/Ansible) for agents, pipelines, monitors, and dashboards.
• Build API/ETL integrations from observability/CMDB into BI platforms (e.g., Power BI/Fabric) for executive reporting.
• Evaluate lightweight streaming/collector options (e.g., OpenTelemetry/Fluent/Tool X) to control cost and enable fan-out where justified.
Requirements
• Bachelor's/Master's in Computer Science /IT or equivalent practical experience with 5-8 years across Observability / Platform / CMDB engineering with production ownership at scale.
• Hands-on with Datadog (Logs, APM/RUM, monitors, facets/measures, APIs)
• Strong in multi-cloud (AWS/Azure/GCP) discovery/inventory and CI reconciliation patterns (tool-agnostic).
• Scripting (Python/PowerShell), parsing (JSON/grok/regex), APIs; IaC (Terraform/Ansible).
• Familiar with SRE practices (SLIs/SLOs, error budgets), containers/Kubernetes, and secure RBAC for high-privilege systems.
• Demonstrated ability to build and guide high-performing, cross-functional teams through clear direction and structured planning.
• Strong interpersonal skills to collaborate with a diverse set of stakeholders and drive consensus on complex technical decisions.
• Organized and detail-oriented approach, aligned with delivering consistent, measurable results.
Licence no: 12C6060