Job description
Team introduction:Build Reliability at Global Scale
Every time a short video is posted or viewed on TikTok, our team is working behind the scenes to make sure it happens instantly and reliably.
The Short Video Reliability team blends deep systems expertise with large-scale architecture design to keep TikTok running smoothly for billions of users.
We design for the unexpected.
Whether it’s a viral trend flooding the platform, a major global event, a cross-region migration, or disaster recovery, our systems are built to adapt and thrive.
We’re now looking for experienced engineers and architects to join our Singapore team.
In this role, you’ll design, build, and scale the core reliability infrastructure that underpins TikTok’s short video ecosystem.
Your work will directly shape the performance, resilience, and evolution of one of the most-used platforms in the world.
Responsibilities: - Architect and build self-healing systems that adapt to infrastructure changes, migrations, and global-scale challenges
- Design smart traffic and load management to keep performance steady during viral spikes, large events, and global campaigns
- Develop monitoring, alerting, and automation that spots and fixes issues before they affect users
- Lead the creation of reliability frameworks for topology mapping, capacity planning, automated recovery, and disaster readiness
- Continuously refine system architecture for better performance, fault tolerance, and maintainability
- Apply chaos engineering, fault injection, and failure simulations to stress-test our systems
- Use A/B testing to measure the real-world impact of your improvements
- Mentor engineers and help set the team’s technical direction
Minimum Qualifications:
- 5+ years in backend, infrastructure, or reliability engineering
- Strong coding skills in Python, Go, Java, C++, or similar
- Solid grasp of distributed systems, networking, and fault-tolerant design
- Experience with Linux/Unix and large-scale infrastructure (cloud or on-prem)
- Proven track record delivering high-availability systems in production
- Strong debugging, analysis, and problem-solving skills
- Strong communication and writing skills.
Preferred Qualifications:
- Experience with video platforms, streaming, or CDN optimization
- Background in highly reliable production systems
- Knowledge of service mesh, edge routing, or traffic shaping at scale
- Hands-on experience with chaos engineering and incident response
- Strong system design and technical leadership skills
- Excellent communication and ability to work across global teams
Required Skill Profession
Computer Occupations