LightEdge Solutions

Site Reliability Engineer

Reposted 13 Days Ago

Remote or Hybrid

15 Locations

Senior level

Remote or Hybrid

15 Locations

Senior level

As a Site Reliability Engineer, you will design and implement monitoring solutions, establish monitoring frameworks, automate incident management, and integrate monitoring into IT processes to enhance system reliability.

The summary above was generated by AI

As a Site Reliability Engineer (SRE), you will be an integral part of the team at LightEdge Solutions. This position will report to the DevOps Manager, and will be responsible for reliable operation of the organization’s systems and services. You will play a key role in identifying our monitoring strategy and vision across multiple products and work with a variety of teams to improve the accuracy of our monitoring systems.

Responsibilities

Monitoring and Observability: Design and implement monitoring solutions to track the performance, availability, and health of various systems and services. Establish robust monitoring frameworks, set up alerts, and analyze system metrics to identify and resolve issues proactively.
Establish and align metrics, including SLAs, SLOs, and SLIs, to closely tie system performance to business objectives, ensuring that the site reliability engineering efforts support the overall goals and customer satisfaction.
Utilize AIOPS techniques to leverage automation in Incident Management and Response. Develop and maintain automated incident response systems that can detect and mitigate issues automatically. This includes automated incident triaging, remediation, and escalation workflows to minimize manual intervention and improve response times.
Leverage the IT service management platform’s capabilities to integrate monitoring into incident management, change management, and other operational processes, enhancing the efficiency and effectiveness of site reliability engineering practices.
Working closely with IT functional owners & SME’s.
Perform complex systems design, proof of concept, implementation and integration functions.
Tasks will consist of developing detailed designs, execution and troubleshooting of strategic solutions in support of effective monitoring, alerting, escalation, automation, reporting and event correlation

Education and Experience

5 years hands-on experience with enterprise monitoring solutions
Must possess knowledge of Network Switches, Server hardware, Storage, and Virtualization Technologies
Understanding of VMware Infrastructure
Experience working with variety of monitoring systems such as Zabbix, vRealize Operations Manager, Nagios and Science Logic
Experience and proficiency in integrating with ServiceNow or similar IT service management platforms.
Experience with managing automations within a monitoring environment.
Ability to provide guidance with design, maintenance, and improvements to enterprise level monitoring solutions.
Excellent verbal and written communication skills, ability to present complex ideas and designs to a variety of technical or non-technical stakeholders.
Experience with design, implementation, and support of monitoring tools in a complex, multi-platform environment.
High level of understanding monitoring requirements for Storage, Network, and Compute servers.

Top Skills

Aiops

Nagios

Science Logic

Servicenow

VMware

Vrealize Operations Manager

Zabbix

Similar Jobs

Milestone Systems

Site Reliability Engineer

Yesterday

Remote or Hybrid

United States

160K-180K Annually

Senior level

160K-180K Annually

Senior level

Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics

The Lead Site Reliability Engineer will oversee the Infrastructure SRE team, focusing on system reliability, automation, and mentoring while collaborating with product engineering.

Top Skills: Ci/CdDatadogDockerElk StackGitopsGoKubernetesLinux/UnixNew RelicNoSQLPrometheusPythonSQLStackdriverTerraform

DFIN

Site Reliability Engineer

3 Days Ago

Remote or Hybrid

United States

Senior level

Fintech • Software

The Principal Site Reliability Engineer - Cloud is responsible for managing and optimizing SaaS cloud infrastructure, ensuring performance, reliability, and security, while automating operations and collaborating within teams.

Top Skills: .NetAnsibleAppdynamicsAWSAzureAzure DevopsC#DatadogDynatraceHarnessIderaJavaJenkinsKubernetesNew RelicRedgateSolarwindsSQLTerraform

Deepgram

Site Reliability Engineer

5 Days Ago

Remote

USA

150K-220K Annually

Senior level

150K-220K Annually

Senior level

Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI

The engineer will build and operate AI/ML infrastructure, managing services on AWS and bare metal, using tools like Kubernetes and Terraform.

Top Skills: AWSBashGoKubernetesPythonSlurmTerraform

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus