
Site Reliability Production Engineer (AI/ML Ops)
Full Time
Fidelity TalentSource is your destination for discovering your next temporary role at Fidelity Investments. We are currently sourcing for a Site Reliability Production Engineer (AI/ML Ops) to work in Fidelity’s Enterprise Infrastructure Group in Smithfield RI, Westlake TX or Merrimack NH.
We are looking for a strong Site Reliability Production Engineer (to join our Production Support Engineering team. In this role, you will help enhance service reliability, reduce operational toil, and drive continuous improvement through automation, observability, and emerging AI/LLM?driven capabilities.
Placement in the range will vary based on job responsibilities and scope, geographic location, candidate’s relevant experience, and other factors.
Team
The Business Unit aligned functions including Infrastructure Support, Cloud Enablement, Platform Engineering, Environment Management, Incident Support and Deployments.We are looking for a strong Site Reliability Production Engineer (to join our Production Support Engineering team. In this role, you will help enhance service reliability, reduce operational toil, and drive continuous improvement through automation, observability, and emerging AI/LLM?driven capabilities.
The Expertise & Skills You Bring
- 5+ years in SRE, DevOps, or production engineering, supporting distributed systems in fast-paced environments.
- Strong scripting experience with PowerShell and Python; practical knowledge of SQL, XML, and data integration.
- Hands-on experience with observability tooling (e.g., Dynatrace, Nexthink), monitoring, logging, and metrics systems.
- Knowledge of ITSM and ITIL frameworks and experience with ServiceNow or similar platforms.
- Strong understanding of DevOps/SRE principles, including SLIs, SLOs, error budgets, automation, and resiliency patterns.
- Proven experience with CI/CD pipelines, cloud platforms (AWS/Azure), and modern SaaS solutions.
- Technical depth in Windows and Linux systems and enterprise end-user computing environments.
- Ability to translate analytical insights into actions, automations, and operational improvements.
- Familiarity with AI/LLM technologies and how to use them to improve workflows, automation, observability, or troubleshooting (preferred).
- Strong problem-solving, communication, and cross-team collaboration skills.
- Bachelor’s degree or equivalent experience in Computer Science or related field (preferred).
- Proven experience supporting financial systems or working in financial services is a plus.
What You’ll Do
- Analyze system and application metrics to improve performance, reliability, and fault detection.
- Partner closely with engineering teams to design, build, deploy, and support resilient services.
- Contribute to system design reviews, platform management, and capacity planning.
- Build sustainable automation to reduce manual effort and operational overhead.
- Develop and refine SLI/SLO/SLA frameworks to balance speed, reliability, and customer experience.
- Improve observability across environments using modern tools and practices.
- Identify, prototype, and implement automation using scripting, infrastructure tooling, and AI/LLM-based solutions.
- Diagnose and tackle complex issues across distributed systems and end-user computing environments.
- Evaluate new technologies, patterns, and tools to drive continuous improvement.
- Create and deliver high-quality technical content, remote actions, and workflows to enable self-service and operational efficiency.
What We’re Looking For in You
- A proactive approach focused on reliability, performance, and continuous improvement.
- Curiosity and the ability to quickly learn complex systems and processes.
- Passion for automation, reducing toil, and improving operational perfection.
- Excitement for working in a fast-paced, collaborative, globally distributed environment.
Placement in the range will vary based on job responsibilities and scope, geographic location, candidate’s relevant experience, and other factors.
