fal Logo

fal

Sr Linux System Administrator

Posted 9 Days Ago
Be an Early Applicant
Easy Apply
Remote
Hiring Remotely in USA
Senior level
Easy Apply
Remote
Hiring Remotely in USA
Senior level
Responsible for the lifecycle management of GPU servers, including provisioning, automation, security hardening, and performance tuning for AI workloads.
The summary above was generated by AI

You are an expert Linux systems operator who keeps fleets of servers healthy, secure, and performant at scale. At fal, you will be responsible for the bare-metal and OS-level foundation that our entire GPU cloud runs on. From provisioning and imaging thousands of GPU nodes to kernel tuning, storage management, and security hardening, you will ensure every machine in our fleet is production-ready and running at peak efficiency. You are deeply comfortable in a terminal, you think in terms of uptime and automation, and you take pride in infrastructure that just works.

 Key Responsibilities
  • Own the full lifecycle of our bare-metal GPU server fleet: provisioning, imaging, configuration management, patching, and decommissioning across multiple data centers and providers.
  • Build and maintain our server automation stack using Ansible, Terraform, and custom tooling to manage OS configuration, kernel parameters, driver versions, and firmware updates at scale.
  • Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes).
  • Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage.
  • Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation.
  • Own system observability: deploy and maintain node-level metrics collection, log aggregation, and alerting using Prometheus, node_exporter, Loki, and Grafana.
  • Collaborate with the Compute platform team to ensure smooth integration between our infrastructure layer (K8s, Nomad, FluxCD) and the underlying Linux hosts.
Requirements
  • 8+ years of experience administering Linux systems at scale, ideally in GPU cloud, HPC, or large bare-metal environments.
  • Deep expertise in Linux internals: systemd, kernel tuning (sysctl, cgroups, namespaces), boot process, package management, and performance profiling (perf, bpftrace, sar).
  • Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init, PXE/iPXE, and custom imaging pipelines.
  • Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning.
  • Familiarity with the NVIDIA GPU software stack: drivers, CUDA toolkit, nvidia-smi, MIG, and container runtimes (nvidia-container-toolkit).
  • Proficiency in Python and Bash scripting for automation, monitoring, and fleet management tooling.
  • Excellent communication and a self-starter mindset—you take ownership and constantly seek improvement.
Nice to Have
  • Experience operating Kubernetes on bare metal (kubeadm, Kubespray) and managing GPU scheduling in K8s (device plugins, MIG slicing).
  • Hands-on experience with BMC/IPMI/Redfish for out-of-band server management and firmware lifecycle automation.
  • Familiarity with fleet-scale observability: Prometheus federation, Thanos, or Victoria Metrics for multi-cluster monitoring.
  • Contributions to open-source infrastructure tooling or Linux distributions.
  • Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001).
What we offer at fal
  • Interesting and challenging work
  • Competitive salary and equity
  • A lot of learning and growth opportunities
  • We offer visa sponsorship and will help you relocate to San Francisco.
  • Health, dental, and vision insurance (US)
  • Regular team events and offsite
Location
  • Remote

Top Skills

Ansible
Apparmor
Bash
Cuda
Gpu
Grafana
Kubernetes
Linux
Nfs
Nvme
Prometheus
Python
Raid
Selinux
Terraform

Similar Jobs

14 Days Ago
Easy Apply
Remote
USA
Easy Apply
80K-90K Annually
Senior level
80K-90K Annually
Senior level
Cloud • Information Technology
As a Senior Linux System Administrator, you will mentor junior staff, manage infrastructure, respond to incidents, and perform advanced troubleshooting.
Top Skills: BashCaching SolutionsCentosCephCumulus LinuxDatabasesDebianFirewallsKubernetesLibvirtLinuxLoad BalancingNetworkingPHPPythonUbuntuVirtualizationWeb Servers
11 Minutes Ago
Remote or Hybrid
San Francisco, CA, USA
Expert/Leader
Expert/Leader
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Lead a global team to design and execute a risk-based Compliance Monitoring and Oversight framework, drive data-driven continuous monitoring, report findings to senior leadership and the board, partner with regional MLROs and first line teams, and ensure audit- and regulator-ready assurance across jurisdictions.
Top Skills: Sql,Tableau
11 Minutes Ago
Remote or Hybrid
New York, NY, USA
Expert/Leader
Expert/Leader
Artificial Intelligence • Fintech • Payments • Business Intelligence • Financial Services • Generative AI
Lead Airwallex's global Second Line assurance function for financial crime controls. Build and mature a scalable, data-driven monitoring framework, run annual assurance planning, drive continuous monitoring and automated testing, report insights to senior leadership and regulators, and manage an international team to ensure remediation and regulatory readiness.
Top Skills: Sql,Tableau

What you need to know about the Charlotte Tech Scene

Ranked among the hottest tech cities in 2024 by CompTIA, Charlotte is quickly cementing its place as a major U.S. tech hub. Home to more than 90,000 tech workers, the city’s ecosystem is primed for continued growth, fueled by billions in annual funding from heavyweights like Microsoft and RevTech Labs, which has created thousands of fintech jobs and made the city a go-to for tech pros looking for their next big opportunity.

Key Facts About Charlotte Tech

  • Number of Tech Workers: 90,859; 6.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lowe’s, Bank of America, TIAA, Microsoft, Honeywell
  • Key Industries: Fintech, artificial intelligence, cybersecurity, cloud computing, e-commerce
  • Funding Landscape: $3.1 billion in venture capital funding in 2024 (CED)
  • Notable Investors: Microsoft, Google, Falfurrias Management Partners, RevTech Labs Foundation
  • Research Centers and Universities: University of North Carolina at Charlotte, Northeastern University, North Carolina Research Campus

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account