Position Overview:
As a Reliability Software Engineer in the Risk team, you will play a critical role in ensuring the performance, stability and availability of the Risk software systems, as well as their day-to-day operations.
Squarepoint's Risk platform is responsible for position management, profit/loss computation, inventory/locate management and internal order routing.
These critical systems need to be performant, resilient, and capable of timely processing of high volumes of trading data.
As such, the team requires a high software development capacity, along with strong analytical skills.
You will primarily be building firm-wide platforms focused on extending Squarepoint's observability, preventing functional regressions and performance regressions, and automating operational flows.
You will also make use of these platforms by implementing domain-specific logic on top of them, tailored to the requirements of the relevant sub-teams of Risk.
Here are some examples of our projects:
Observability: Our health check platform is designed to make the implementation of health checks as easy as possible, for any team at Squarepoint.
It supports generic health checks that can be set-up through configuration-only, as well as a plug-n-play architecture allowing fully custom health checks to be integrated and ran by the platform.Preventing functional/performance regressions: We are building a platform that will facilitate and automate benchmarking by abstracting away the scheduling of jobs, the hardware resourcing, the metric collection, the reporting of results, and the integration to Gitlab.Automation: We are building a self-serve automation platform that will allow users to request changes to our system configuration through a Jira portal.
Once the necessary approvals gathered, the platform automatically schedules a job to apply the requested changes. Operations are important to ensure business continuity, as-such our responsibilities also include:
Level-2 support: In order to ensure business uptime, every member of the team contributes to a daily support ROTA.
During business hours, people on-duty will prioritise responding to incidents over their project work.
On average, people are on-duty one day per week.Incident management: Root cause analyses are performed to understand the source of incidents and to raise appropriate remedial actions.Day-to-day operations: Until they're automated, the team is responsible for tweaking our system configurations to address user requests and correcting historical data in our databases.
Required Qualifications:
Education: Bachelor’s degree in Computer Science or related subjectExperience: 4+ years proven experience in Software Engineering, Software Reliability, or similar role with hand-on experience in software development and providing L2 supportExperience of developing in Python, and familiarity with version control systems such as gitExperience working in a Linux environmentProblem-Solving Skills: Strong analytical and problem-solving skills with a keen eye for detail and a proactive approach to resolving issuesCommunication: Excellent communication and collaboration skills to work effectively with cross-functional teamsAdaptability: Ability to work in a fast-paced and dynamic environment, adapting to changing priorities and requirementsAutomation and Tooling: Experience developing automation tools and implementing configuration management Nice to have:
Experience with Kafka or AMPSExperience with Kubernetes or SlurmExperience developing with PostgreSQL, Clickhouse or KDB/q