BT Group • Building No 14 Sector 24 & 25A, Gurugram, India

Site Reliability Engineering Specialist

9.4

/10

Transparency ranking

Apply now

Job Description

Why this job matters

The Site Reliability Engineering Specialist plays a critical role in safeguarding BT’s ability to deliver exceptional service performance, reliability, and availability across its digital platforms. In today’s fast-paced, cloud-driven AI environment, customers expect seamless experiences, and this position ensures those expectations are met by driving scalable, fault-tolerant, and cost-effective solutions. By enabling cross-team collaboration and implementing automation, monitoring, and resilience strategies, the specialist not only minimizes downtime and operational risk but also accelerates innovation and system evolution. This role is pivotal in maintaining BT’s reputation for reliability while empowering the business to adapt quickly to emerging technologies and deliver consistent value to customers worldwide.

What you’ll be doing

1. Executes the implementation of new software development life cycle automation tools, frameworks, and code pipelines (continuous integration/continuous delivery pipelines whilst executing best practices with a focus on the re-use of application code, demonstrates consistent software delivery practices and produces continuous integration/continuous delivery
platform solutions using Amazon Web Services cloud, infrastructure as code (IaC), GitOps, and container technologies
2. Coordinates a diverse team and creates the initial test schedule to deliver all aspects of testing to time, budget and quality targets, ensuring producing outlines of solutions and defining depth of testing required.
3. Executes the implementation of automation technologies to ensure repeatability, eliminating toil, reducing mean time to detection and resolution and repair services
4. Proactively identifies and manages risk through regular assessment and diligent execution of controls and mitigations, proactively raising any concerns
5. Leads scale testing to measure, tune and optimise system performance
6. Executes metric/monitoring analysis that creates stability, security, and performance improvements
7. Designs, analyses, develops and troubleshoots highly distributed large-scale production systems spanning on-prem and cloud-based hosting
8. Executes approaches that scale systems sustainably through mechanisms like automation and evolves systems by pushing for changes that improve reliability and velocity
9. Writes and delivers infrastructure as code software to improve the availability, scalability, latency, and efficiency of services
10. Implements robust monitoring and alerting systems and performs root cause analysis and post-mortems with an eye towards future prevention
11. Inspects queue and support processing to ensure early warning of support issues
12. Executes retrospective and preventive actions after each high severity production incident
13. Analyses complex systems from a reliability and resilience perspective and identifies sources of instability in distributed systems
14. Champions, continuously develops and shares with team knowledge on emerging trends and changes in site reliability engineering best practices and industry standards
15. Mentors other site reliability engineers, helping to improve the team’s abilities by acting as a technical resource
16. Uses the network of site reliability engineers, removing BTs organisational boundaries to deliver improvements that are in synergy with initiatives being driven by other SREs.

Skills and Experience

Skills required for the job :-

• A degree in IT, Maths or Science
• A deep understanding of full stack monitoring solutions such as Dynatrace to ensure current end to end performance and trends of owned CDO Applications
• Strong proficiency in one or more programming languages (e.g. Java, Python).
• Experience with cloud platforms (AWS, Azure, or GCP).
• Solid understanding of software architecture, design patterns, and microservices.
• Familiarity with CI/CD tools and DevOps practices.
• High levels of quality presentation and reporting capabilities to collate output from Managed Service Partners.
• Resilience to ensure support teams are engaged 24x7x365 to support priority incident resolution.
• Ability to adapt to latest industry trends
• CI/CD/CT Pipeline management
• Micro-Service functionality
• Business Process Improvement
• Growth mindset

AI driven Observability & AIOps
• AIOps fundamentals (cross domain telemetry ingestion, event correlation, topology/context building, and remediation augmentation).
• Agentic/autonomous observability skills (using intelligent agents to detect anomalies, correlate signals, and trigger guarded remediations to cut MTTR).
• AI assisted alerting & noise reduction (designing contextual, business impact aware alerts; prioritization via ML).

Experience you would be expected to have :-

Incident Response with AI
LLM assisted incident workflows (AI summaries, timeline drafting, suggested fixes, and post mortems integrated with Slack/Teams).
Runbook automation with AI (building AI assisted, context aware runbooks and approval gates for high risk actions).
Generative AI for coordination & RCA (using LLMs to accelerate investigation and communications; understanding current accuracy limits and human in the loop needs).

ML Ops for Reliability
SRE principles applied to ML systems (SLOs/SLIs/error budgets for ML services; capacity planning and model freshness).
Production ML observability (data/concept/label drift detection, automated retraining triggers, explainability traces).
Telemetry & visualization for model health (instrumentation with Prometheus/Grafana for drift and degradation).

AI enhanced Automation & CI/CD
AI augmented IaC and pipelines (LLM generated Terraform/Helm/Ansible, policy enforcement, drift detection in infra).
AIOps in delivery (change impact hints, automated triage, and GitOps based auto remediation ).
AI pair programming ergonomics (using Copilot responsibly; measuring impact on quality/velocity and guardrails).

AI + Chaos Engineering (Resilience)
Designing AI guided chaos experiments (intelligent fault selection , anomaly detection during experiments, learning from outcomes).
Reinforcement learning driven fault injection (automated scenario generation to expose latent weaknesses and improve recovery times).
Operationalizing lessons from chaos + ML (predictive failure analysis and proactive controls).

Platform & Tool Literacy (AI r eady)
Hands on with AIOps/observability platforms (event correlation and unified incident views at scale).
Familiarity with AI enabled incident tooling (e.g., incident.io/Rootly/PagerDuty/Datadog for AI triage and summaries).

Governance, Safety & Measurement
Human in the loop guardrails (approval policies, rollback safety, and compliance in autonomous actions).
Trustworthy AI practices (explainability, data/model/process trust; aligning metrics with business outcomes).
Outcome measurement for AI adoption (MTTR, alert noise, developer experience/velocity with AI tools).

Our leadership standards

Looking in:
Leading inclusively and Safely
I inspire and build trust through self-awareness, honesty and integrity.
Owning outcomes
I take the right decisions that benefit the broader organisation.

Looking out:
Delivering for the customer
I execute brilliantly on clear priorities that add value to our customers and the wider business.
Commercially savvy
I demonstrate strong commercial focus, bringing an external perspective to decision-making.

Looking to the future:
Growth mindset
I experiment and identify opportunities for growth for both myself and the organisation.
Building for the future
I build diverse future-ready teams where all individuals can be at their best.

About us

BT Group was the world’s first telco and our heritage in the sector is unrivalled. As home to several of the UK’s most recognised and cherished brands – BT, EE, Openreach and Plusnet, we have always played a critical role in creating the future, and we have reached an inflection point in the transformation of our business.

Over the next two years, we will complete the UK’s largest and most successful digital infrastructure project – connecting more than 25 million premises to full fibre broadband. Together with our heavy investment in 5G, we play a central role in revolutionising how people connect with each other.

While we are through the most capital-intensive phase of our fibre investment, meaning we can reward our shareholders for their commitment and patience, we are absolutely focused on how we organise ourselves in the best way to serve our customers in the years to come. This includes radical simplification of systems, structures, and processes on a huge scale. Together with our application of AI and technology, we are on a path to creating the UK’s best telco, reimagining the customer experience and relationship with one of this country’s biggest infrastructure companies.

Change on the scale we will all experience in the coming years is unprecedented. BT Group is committed to being the driving force behind improving connectivity for millions and there has never been a more exciting time to join a company and leadership team with the skills, experience, creativity, and passion to take this company into a new era.

A FEW POINTS TO NOTE:

Although these roles are listed as full-time, if you’re a job share partnership, work reduced hours, or any other way of working flexibly, please still get in touch.

We will also offer reasonable adjustments for the selection process if required, so please do not hesitate to inform us.

DON'T MEET EVERY SINGLE REQUIREMENT?

Studies have shown that women and people who are disabled, LGBTQ+, neurodiverse or from ethnic minority backgrounds are less likely to apply for jobs unless they meet every single qualification and criteria. We're committed to building a diverse, inclusive, and authentic workplace where everyone can be their best, so if you're excited about this role but your past experience doesn't align perfectly with every requirement on the Job Description, please apply anyway - you may just be the right candidate for this or other roles in our wider team.

Apply now

Company benefits

25 (UK, increasing with service) / 21 (India) days annual leave + bank holidays

Adoption leave – 18 weeks full pay, 8 weeks half pay, 6 months statutory

Bank holiday swaps

Buy or sell annual leave – buy up to 5 days/year pro rata

Carer’s leave – Two weeks paid leave

Cinema discounts

Coaching

Compassionate leave

Complimentary Medical Services

Cycle to work scheme

Employee assistance programme

Employee discounts

Enhanced maternity leave – 18 weeks full pay, 8 weeks half pay, 6 months statutory

Enhanced paternity leave – 18 weeks full pay, 8 weeks half pay, 6 months statutory

Enhanced pension match/contribution

Enhanced sick pay – 3 months

Faith rooms

In house training

L&D budget – sponsored accreditation available for certain professions

Learning platform – internal and external learning content via Degreed

LinkedIn learning license – unlimited access

Lunch and learns

Mental health platform access – Silvercloud

Mentoring

Neo-natal leave

Open to job sharing

Open to part time work for some roles

Optional unpaid leave

Private GP service – 24/7 virtual GP access for UK colleagues

Referral bonus

Returnship

Salary sacrifice

Share options

Shared parental leave

Travel loan

Volunteer days – 3 volunteer days per year

Reservist leave

Fertility treatment leave

Pregnancy loss leave

Pregnancy support

Fertility treatment leave

Pregnancy loss leave

Pregnancy support

On-site catering

On-site barista

On-site shower

Modern office

Collaboration spaces

Private booths

On-site wellness room

Working at BT Group

Company employees:

100,000 across BT Group (24,000 at BT Business)

Gender diversity (m:f):

74.3:25.7 (BT Group)

Hiring in countries

Brazil

Canada

Hong Kong

Hungary

India

Poland

Singapore

South Korea

Spain

United Kingdom

Office Locations

Awards & Accreditations

Family Friendly

Flexa awards 2025

Career Progression

Flexa awards 2025

Other jobs you might like

BT Group
Site Reliability Engineering Specialist
RMZ Ecoworld, Devarabeesanahal, Bengaluru, India
28 Jan
9.4
/10
Transparency
Ranking
BAE Systems
NSL – Site Reliability Engineer
Leeds, United Kingdom
8 Dec 2025
8.8
/10
Transparency
Ranking
Vodafone
Site Reliability Engineer
Bucuresti, Bucuresti, Romania
9 Dec 2025
8.4
/10
Transparency
Ranking
SAP
Senior Site Reliability Engineer
Sofia, BG
22 Jan
8.4
/10
Transparency
Ranking
SAP
Senior Site Reliability Engineer
Shanghai, CN
21 Jan
8.4
/10
Transparency
Ranking