< Back to search
Principal Engineer-Cloud
Job Description
When you join Verizon
Verizon is one of the world’s leading providers of technology and communications services, transforming the way we connect around the world. We’re a human network that reaches across the globe and works behind the scenes. We anticipate, lead, and believe that listening is where learning begins. In crisis and in celebration, we come together—lifting up our communities and striving to make an impact to move the world forward. If you’re fueled by purpose, and powered by persistence, explore a career with us. Here, you’ll discover the rigor it takes to make a difference and the fulfillment that comes with living the #NetworkLife.
What you’ll be doing...
We are seeking a highly motivated and experienced Principle SRE Engineer to join our growing team. As a Principle SRE, you will play a critical role in designing, building, and maintaining highly scalable and resilient systems. You will be a technical leader, driving best practices and mentoring other engineers to adopt SRE in respective platform areas such as GCP, Hadoop, Teradata, BI Tools, ML & on-prem platforms.
Engineering & Operations:
Designing, building, and maintaining highly scalable and reliable infrastructure for our applications and platforms.
Automating infrastructure provisioning, configuration management, and application deployments by working with respective platform members.
Implementing and improving monitoring, alerting, and observability solutions to ensure system health and performance of all platforms.
Participating in on-call rotation and troubleshoot production issues, identifying root cause and implementing solutions.
Proactively identifying potential performance bottlenecks and develop solutions to mitigate them.
Leveraging GenAI solutions to solve classification problems, developing Slackbot integrated with LLM to optimize level1 support
Building End-End monitoring and observability solution to provide first hand information to admins to reduce mean time to detect and resolve
Identifying & adopting industry best practices to improve availability & reliability
Performing Chaos engineering tests to improve process awareness and reduce mean time to repair for P1 tickets
Developing AIOPs solutions to predict issues and mitigate using auto recovery system.
KB development from tickets using LLM to be fed to Slackbot.
Building AI workmate for root cause identification, effective summarization of logs, next best action recommendations based on KB
Leadership & Collaboration:
Working closely with development & platform teams to ensure that applications are designed and implemented with operational excellence in mind.
Advocating for and champion SRE best practices within the organization.
Mentoring and guiding junior SRE engineers, fostering their growth and development.
Contributing to the development and maintenance of SRE documentation, runbooks, and internal tooling.
Innovation & Continuous Improvement:
Researching and evaluating new technologies and tools to improve the efficiency, scalability, and resilience of our systems.
Leading and participating in post-incident reviews, identifying areas for improvement and driving remediation efforts.
Contributing to the continuous improvement of our SRE processes and practices.
What we’re looking for...
You’ll need to have:
Bachelor’s degree or four or more years of work experience.
Six or more years of relevant work experience.
Experience in a DevOps or SRE role, with deep understanding of Linux/Unix systems administration and networking fundamentals.
Strong experience with automation and configuration management tools (e.g., Ansible, Terraform, Jenkins, Gitlab).
Proficiency in at least one programming language (e.g., Python, Go, Java)
Experience with containerization technologies (e.g., Docker, Kubernetes)
Experience with cloud computing platforms (e.g., AWS, GCP, Azure)
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack)
Experience with chaos engineering principles and practices.
Even better if you have one or more of the following:
Master’s degree.
Excellent problem-solving and troubleshooting skills.
Strong communication and collaboration skills, with the ability to work effectively in a team environment.
Where you’ll be working
In this hybrid role, you'll have a defined work location that includes work from home and assigned office days set by your manager.
Scheduled Weekly Hours
40
Diversity and Inclusion
We’re proud to be an equal opportunity employer. At Verizon, we know that diversity makes us stronger. We are committed to a collaborative, inclusive environment that encourages authenticity and fosters a sense of belonging. We strive for everyone to feel valued, connected, and empowered to reach their potential and contribute their best. Check out our diversity and inclusion page to learn more.
When you join Verizon
Verizon is one of the world’s leading providers of technology and communications services, transforming the way we connect around the world. We’re a human network that reaches across the globe and works behind the scenes. We anticipate, lead, and believe that listening is where learning begins. In crisis and in celebration, we come together—lifting up our communities and striving to make an impact to move the world forward. If you’re fueled by purpose, and powered by persistence, explore a career with us. Here, you’ll discover the rigor it takes to make a difference and the fulfillment that comes with living the #NetworkLife.
What you’ll be doing...
We are seeking a highly motivated and experienced Principle SRE Engineer to join our growing team. As a Principle SRE, you will play a critical role in designing, building, and maintaining highly scalable and resilient systems. You will be a technical leader, driving best practices and mentoring other engineers to adopt SRE in respective platform areas such as GCP, Hadoop, Teradata, BI Tools, ML & on-prem platforms.
Engineering & Operations:
Designing, building, and maintaining highly scalable and reliable infrastructure for our applications and platforms.
Automating infrastructure provisioning, configuration management, and application deployments by working with respective platform members.
Implementing and improving monitoring, alerting, and observability solutions to ensure system health and performance of all platforms.
Participating in on-call rotation and troubleshoot production issues, identifying root cause and implementing solutions.
Proactively identifying potential performance bottlenecks and develop solutions to mitigate them.
Leveraging GenAI solutions to solve classification problems, developing Slackbot integrated with LLM to optimize level1 support
Building End-End monitoring and observability solution to provide first hand information to admins to reduce mean time to detect and resolve
Identifying & adopting industry best practices to improve availability & reliability
Performing Chaos engineering tests to improve process awareness and reduce mean time to repair for P1 tickets
Developing AIOPs solutions to predict issues and mitigate using auto recovery system.
KB development from tickets using LLM to be fed to Slackbot.
Building AI workmate for root cause identification, effective summarization of logs, next best action recommendations based on KB
Leadership & Collaboration:
Working closely with development & platform teams to ensure that applications are designed and implemented with operational excellence in mind.
Advocating for and champion SRE best practices within the organization.
Mentoring and guiding junior SRE engineers, fostering their growth and development.
Contributing to the development and maintenance of SRE documentation, runbooks, and internal tooling.
Innovation & Continuous Improvement:
Researching and evaluating new technologies and tools to improve the efficiency, scalability, and resilience of our systems.
Leading and participating in post-incident reviews, identifying areas for improvement and driving remediation efforts.
Contributing to the continuous improvement of our SRE processes and practices.
What we’re looking for...
You’ll need to have:
Bachelor’s degree or four or more years of work experience.
Six or more years of relevant work experience.
Experience in a DevOps or SRE role, with deep understanding of Linux/Unix systems administration and networking fundamentals.
Strong experience with automation and configuration management tools (e.g., Ansible, Terraform, Jenkins, Gitlab).
Proficiency in at least one programming language (e.g., Python, Go, Java)
Experience with containerization technologies (e.g., Docker, Kubernetes)
Experience with cloud computing platforms (e.g., AWS, GCP, Azure)
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack)
Experience with chaos engineering principles and practices.
Even better if you have one or more of the following:
Master’s degree.
Excellent problem-solving and troubleshooting skills.
Strong communication and collaboration skills, with the ability to work effectively in a team environment.
Where you’ll be working
In this hybrid role, you'll have a defined work location that includes work from home and assigned office days set by your manager.
Scheduled Weekly Hours
40
Diversity and Inclusion
We’re proud to be an equal opportunity employer. At Verizon, we know that diversity makes us stronger. We are committed to a collaborative, inclusive environment that encourages authenticity and fosters a sense of belonging. We strive for everyone to feel valued, connected, and empowered to reach their potential and contribute their best. Check out our diversity and inclusion page to learn more.
Company benefits
We need to ask employees of Verizon what it's like to work there before we assign the company FlexScore®.
Working at Verizon
Currently Hiring Countries
Belgium
Denmark
Germany
Hong Kong
India
Ireland
Italy
Japan
Mexico
Philippines
Singapore
Sweden
United Kingdom
United States
Office Locations