Site Reliability Engineering (SRQ )
Service description:
We are looking for Site Reliability Engineering service in our Engineering chapter team. The goal is to ensure the reliability, scalability, monitoring, and performance of our on-premises services in the ERA product organization. Responsibilities will include designing, implementing best practices, and managing our infrastructure. The role includes working within cross-functional teams to improve systems and processes and ensure uptime and efficiency.
- Design and maintain monitoring infrastructure
- Create custom dashboards, alerts, and visualization solutions
- Implement distributed tracing and log aggregation systems
- Establish monitoring best practices and SLI/SLO frameworks
- Maintain security compliance for on-premises monitoring tools
- Automate deployment and configuration management
- Collaborate with development teams on application instrumentation
- Participate to on-duty rotations
Requirements:
- Core Technologies
Advanced Grafana,
Prometheus (PromQL),
OpenTelemetry,
Elasticsearch - Programming
Python,
Bash, or Go for automation - Experience
3+ years monitoring/observability,
2+ years Grafana/Prometheus in production,
strong Linux system administration experience,
proven track record with on-premises infrastructure solutions - Security
Enterprise security practices,
compliance requirements - Ability to balance technical trade-offs with business needs and prioritize effectively.
- Participation to on-duty rotations (24/7 Incident support)
- Reduced MTTD/MTTR through effective monitoring
- Comprehensive observability across all systems
- Automated monitoring, deployment, and management
- Security-compliant monitoring practices
Additional information:
Location: Brussels (Empereur)
Onsite presence: By default, a physical presence on site is required for 2 days per week.
Work regime: fulltime
Question: Would the consultant want to participate to on-duty rotations (24/7 Incident support)?
#J-18808-Ljbffr
Match jouw profiel
Solliciteren