Search
Site Reliability Engineer

Site Reliability Engineer

locationWestlake, TX, USA
remoteHybrid
PublishedPublished: 1/20/2026
Full Time
Fidelity TalentSource is your destination for discovering your next temporary role at Fidelity Investments. We are currently sourcing for a Site Reliability Engineer to work in Fidelity’s Enterprise Infrastructure Group in Westlake TX or Merrimack NH.

The Role

You will have the opportunity to lead all aspects of production support - readiness, availability and resiliency of critical Applications, Batches & Infrastructure representing various business units while being centrally aligned to the Production Services organization. Offer a plethora of opportunities to augment knowledge across multiple dimensions of Technology at the same time retaining key focus on Cloud Computing (AWS & Azure) & Enterprise tools/solutions like Jenkins, uDeploy, Docker, Kubernetes, Splunk, Datadog, etc.
This role will provide a truly predictable customer experience. Under times of market volatility and high volumes, there is an increased expectation of a consistent service level. In Fidelity, we strive to meet this expectation by building reliability into our ecosystem. This will be achieved though defining & implementing practices in Resiliency Engineering, Automation, Observability & Chaos Testing while also engraining a proactive Culture that thinks reliability first design. Solve stack-wide engineering issues related to hardware, software, network, applications, and cloud service providers.

Team

We at the Production Services (EI&O-PS) in Fidelity are a centralized support services organization in the Enterprise Infrastructure & Operations group of Fidelity. EI&O-PS supports 3000+ applications across Fidelity business units providing roster based on call rotation (follow the sun model). Services provided by EI&O-PS include Platform, Application and Batch Support via Incident management, Change management, Environment management, Cloud, UI, Middle Tier & Database Services, Mainframe operations, Release Services & Performance Engineering services.

The Expertise You Have

  • Bachelor’s degree or equivalent experience or higher in a technology related field (e.g. Engineering, Computer Science, etc.) required, Master’s degree a plus
  • 5-8+ years of hands-on experience deploying and/or supporting highly distributed multi-tiered systems at scale.
  • Hands-on experience with Public Cloud environments, preferably AWS and Azure. Certifications a plus
  • Exposure to basic OS level scripting languages such as Korn/Bash/Jscript
  • Experience with container orchestration, preferably with Kubernetes
  • Experience operating and implementing distributed & highly concurrent service-based

The Skills You Bring

  • Ability to solve application issues on Unix/Linux with J2EE, WebSphere, Tomcat and SQL
  • Familiarity with ITIL processes like Incident management, Change/Problem management
  • Balancing delivery with ad hoc workloads and re-evaluating priorities
  • Solid understanding of Cloud Computing and DevOps concepts including CI/CD pipelines
  • Hands on experience with one or more observability tools (Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, etc.)
  • Use Datadog, Catchpoint, Splunk & Grafana for Application Observability and monitoring of app & infrastructure
  • Experienced in Instrumentation with systems skills on building and operating, monitoring, logging, alerting services of distributed systems at scale
  • Proven experience in maintaining scalability and resiliency of complex environment.
  • Proven experience in implementing advanced observability practices and techniques at scale.
  • Provide enterprise Cloud and Platform Engineering support for production environments and ability to participate in on-call rotation to provide solutions.
  • Experience in Cloud development (AWS and Azure) and migration skills; Experience with building and operating highly resilient platforms in public cloud environments
  • Ability to triage, complete root cause analysis, and be decisive under pressure
  • Experience managing and interpreting large datasets using query languages and visualization tools
  • Proficient communication skills with an ability to reach both technical and non-technical audience
  • Proven experience performing chaos testing to build confidence in the system's capability to withstand turbulent conditions in production
  • Strong understanding in API testing tools (SoapUI, Postman)
  • Experience managing systems using infrastructure as code tools (IAM, ARM, Terraform, Chef)
  • Handle a huge fleet of on-prem servers (including security & patching oversight)
  • Handle hundreds of SSL certificates for all applications in scope
  • Use Ansible & Python for automating day-to-day activities, Web development with Django, JavaScript

Dynamic Working

Fidelity’s hybrid working model blends the best of both onsite and offsite work experiences. Working onsite is important for our business strategy and our culture. We also value the benefits that working offsite offers associates. Most hybrid roles require associates to work onsite all business days of every other week in a Fidelity office.