
Site Reliability Engineering Specialist
/10
Job Description
Why this job matters
The Site Reliability Engineering Specialist plays a critical role in safeguarding BT’s ability to deliver exceptional service performance, reliability, and availability across its digital platforms. In today’s fast-paced, cloud-driven AI environment, customers expect seamless experiences, and this position ensures those expectations are met by driving scalable, robust, resilient, and cost-effective solutions. By enabling cross-team collaboration and implementing automation, monitoring, and resilience strategies, the specialist not only minimizes downtime and operational risk but also accelerates innovation and system evolution. This role is pivotal in maintaining BT’s reputation for reliability while empowering the business to adapt quickly to emerging technologies and deliver consistent value to customers worldwide.
What you’ll be doing
- Implement and operate CI/CD and SDLC automation using cloud services, infrastructure as code (IaC), GitOps patterns and containers, following established engineering and security practices.
- Contribute to test planning and execution with delivery and QA to meet quality and time goals; help define practical coverage and schedules.
- Automate to reduce toil and MTTR (mean time to resolve)—scripts, runbooks and guard railed tasks that remove repetitive work and improve recovery speed.
- Participate in Tier 2/3 incident response: diagnose, mitigate and recover; capture learnings and drive preventive follow ups.
- Implement and tune observability (metrics, logs, traces, dashboards, alerts) to improve signal quality and reduce noise.
- Apply SRE fundamentals with teams: define/maintain SLIs and SLOs with error budgets; propose data driven reliability improvements.
- Harden release reliability: keep pipelines stable, safe and reliable; identify configuration drift and remediate quickly.
- Assist on call readiness: runbook stewardship, change/rollback safety, participation in DR/failover exercises and game days.
- Identify reliability risks across services and environments; raise issues early and assist mitigations and control adoption.
- Collaborate with developers, platform, operations and partners; document clearly and assist peer learning.
Skills required for the job
Core SRE & Engineering Skills
- Strong expertise in end to end observability and monitoring platforms (e.g., Dynatrace) to grasp system health, performance trends, and reliability of business critical applications.
- Proficiency in one or more programming languages (e.g., Java, Python) with the ability to write production quality automation and tooling.
- Hands on experience with cloud platforms (AWS, Azure, or GCP) and operating distributed systems in cloud and hybrid environments.
- Firm Grasp of software architecture, design patterns, and microservices based systems.
- Practical experience with CI/CD pipelines, DevOps practices, and continuous testing to Assist fast, reliable delivery.
- Strong capability in infrastructure as code and pipeline management, enabling consistancy, scalable, and secure deployments.
Reliability, Operations & Continuous Improvement
- Proven ability to apply Site Reliability Engineering principles, including automation, toil reduction, incident learning, and reliability driven system improvements.
- Experience analysing complex, distributed systems to identify performance, resilience, and stability issues.
- Ability to assist 24x7 operational environments, working effectively with stakeholders & backend teams and managed service partners during priority incidents.
- Strong analytical, reporting, and presentation skills, enabling clear communication of operational insights, risks, and improvement opportunities.
- Demonstrated mindset for business process improvement, using data and automation to drive efficiency and reliability gains.
- Adaptability to evolving industry trends and emerging technologies, with a continuous learning and growth mindset.
AI Driven Observability & AIOps
- Understanding of AIOps fundamentals, including cross domain telemetry ingestion, event correlation, topology and context modelling, and remediation augmentation.
- Experience with AI assisted and agentic observability, using intelligent techniques to detect anomalies, correlate signals, and accelerate incident resolution.
- Capability in AI driven alerting and noise reduction, designing contextual, business impact aware alerts and leveraging machine learning to prioritise and reduce alert fatigue.
Nice to have
- AI‑assisted incident workflows: LLM‑generated summaries/timelines or suggestion prompts in collaboration tools; context‑aware runbooks under human‑in‑the‑loop controls.
- AIOps capabilities: event correlation, dynamic topology/context modelling, impact‑aware alerting and alert noise reduction features in modern observability platforms.
- Chaos engineering: exposure to controlled fault injection with tools like Gremlin/Litmus/Chaos Mesh; translating findings into tangible reliability improvements.
- ML Ops: model drift/freshness concepts and high‑level SLIs/SLOs for ML services; basic approaches to monitoring model health signals.
Our leadership standards
Looking in:
Leading inclusively and Safely
I inspire and build trust through self-awareness, honesty and integrity.
Owning outcomes
I take the right decisions that benefit the broader organisation.
Looking out:
Delivering for the customer
I execute brilliantly on clear priorities that add value to our customers and the wider business.
Commercially savvy
I demonstrate strong commercial focus, bringing an external perspective to decision-making.
Looking to the future:
Growth mindset
I experiment and identify opportunities for growth for both myself and the organisation.
Building for the future
I build diverse future-ready teams where all individuals can be at their best.
About us
BT Group was the world’s first telco and our heritage in the sector is unrivalled. As home to several of the UK’s most recognised and cherished brands – BT, EE, Openreach and Plusnet, we have always played a critical role in creating the future, and we have reached an inflection point in the transformation of our business.
Over the next two years, we will complete the UK’s largest and most successful digital infrastructure project – connecting more than 25 million premises to full fibre broadband. Together with our heavy investment in 5G, we play a central role in revolutionising how people connect with each other.
While we are through the most capital-intensive phase of our fibre investment, meaning we can reward our shareholders for their commitment and patience, we are absolutely focused on how we organise ourselves in the best way to serve our customers in the years to come. This includes radical simplification of systems, structures, and processes on a huge scale. Together with our application of AI and technology, we are on a path to creating the UK’s best telco, reimagining the customer experience and relationship with one of this country’s biggest infrastructure companies.
Change on the scale we will all experience in the coming years is unprecedented. BT Group is committed to being the driving force behind improving connectivity for millions and there has never been a more exciting time to join a company and leadership team with the skills, experience, creativity, and passion to take this company into a new era.
A FEW POINTS TO NOTE:
Although these roles are listed as full-time, if you’re a job share partnership, work reduced hours, or any other way of working flexibly, please still get in touch.
We will also offer reasonable adjustments for the selection process if required, so please do not hesitate to inform us.
DON'T MEET EVERY SINGLE REQUIREMENT?
Studies have shown that women and people who are disabled, LGBTQ+, neurodiverse or from ethnic minority backgrounds are less likely to apply for jobs unless they meet every single qualification and criteria. We're committed to building a diverse, inclusive, and authentic workplace where everyone can be their best, so if you're excited about this role but your past experience doesn't align perfectly with every requirement on the Job Description, please apply anyway - you may just be the right candidate for this or other roles in our wider team.
Company benefits
Working at BT Group
Company employees:
Gender diversity (m:f):
Hiring in countries
Brazil
Canada
Hong Kong
Hungary
India
Poland
Singapore
South Korea
Spain
United Kingdom
Office Locations
Other jobs you might like
Site Reliability Engineering Specialist
Building No 14 Sector 24 & 25A, Gurugram, India
21 Jan
Transparency9.4/10
RankingSite Reliability Engineering Specialist
RMZ Ecoworld, Devarabeesanahal, Bengaluru, India
22 Dec 2025
Transparency9.4/10
RankingNSL – Site Reliability Engineer
Leeds, United Kingdom
8 Dec 2025
Transparency8.8/10
RankingSite Reliability Engineer
Bucuresti, Bucuresti, Romania
9 Dec 2025
Transparency8.4/10
RankingSenior Site Reliability Engineer
Sofia, BG
22 Jan
Transparency8.4/10
Ranking
