Salla is seeking a highly skilled and motivated Senior Site Reliability Engineer (SRE) to join our dynamic team. As a Senior SRE, you will be at the forefront of ensuring the reliability, performance, and scalability of our platform. You will lead initiatives, manage incidents, and guide engineering teams in building resilient systems. If you are passionate about platform reliability and thrive in a fast-paced environment, we encourage you to apply.
Improving Platform Reliability at Salla
In this role, you will play a crucial part in maintaining and improving the platform reliability. This includes:
- Leading high-severity incident response and driving post-incident reviews to enhance our platform reliability.
- Troubleshooting complex issues across applications, infrastructure, and networks.
- Improving MTTR through better monitoring, alerts, and diagnostic tooling to ensure continuous platform reliability.
- Participating in the on-call rotation supporting production systems.
Performance & Scalability
You will be responsible for identifying and resolving performance bottlenecks and scaling challenges. This includes conducting load testing and capacity planning for high-traffic scenarios. Your focus on platform reliability will be essential.
Infrastructure & Operations
Enhance cloud-native infrastructure, deployment processes, and automation. Improve resilience, fault-tolerance, and recovery mechanisms across systems. This is integral to platform reliability. We also want to ensure proper security so it’s important to keep updated with resources from OWASP and similar sites.
Observability
Build and refine dashboards, alerts, metrics, logs, and traces. Define SLIs/SLOs and improve visibility into system behavior. This ensures complete platform reliability.
Tooling & Automation
Develop tools that reduce operational toil and increase reliability. Contribute to infrastructure-as-code, CI/CD pipelines, and GitOps workflows which all help to improve platform reliability.
Collaboration
Work closely with engineering teams to ensure services are robust and production-ready. Mentor engineers on reliability, debugging, and operational best practices. Internal documentation and standardization is something that can aid this, resources like Atlassian can provide insight.
Bonus Skills:
- Background in large-scale, high-traffic systems.
- Experience with fault-tolerant design, DR, and HA patterns.
- Familiarity with SLOs, SLIs, and error budgets.
Candidates located within GMT 0 to +6 time zones are preferred. We also like to utilize tools like Grafana to improve the development process.

