Clicky

Site Reliability Engineer

Micro1
Micro1
Company Website Link
Remote Job Type
Contractor
Remote Job Location
Remote: Global
Remote Job Experience
Remote Job Salary Range
$40 - $70/hr
Key Skills:
Python, Docker, Kubernetes
Apply NowApply Now
More Jobs by  
micro1

Job Description

Join our customer's team as a Site Reliability Engineer for a specialized, high-intensity project centered on training and optimizing AI models within cutting-edge containerized infrastructures. This terminal-intensive engagement demands a systems-first approach, real-time troubleshooting, and dynamic process recovery, offering significant potential for future extension or transition into advanced phases for standout performers.

Key Responsibilities:

• Lead the deployment, monitoring, and recovery of complex, containerized AI training environments using advanced terminal techniques.

• Proactively identify, diagnose, and resolve infrastructure bottlenecks and failures in long-running processes.

• Orchestrate resilient system builds and infrastructure management, ensuring stability and optimal resource utilization.

• Collaborate closely with engineering teams to refine CI/CD pipelines and automate routine operational tasks.

• Manage and optimize filesystem structures, networked storage, and process scheduling in Dockerized sandboxes.

• Conduct rapid mid-execution replanning during error states and unforeseen runtime issues.

• Document best practices, emergent solutions, and contribute to knowledge transfer across the team.

Required Skills and Qualifications:

• Demonstrated expert proficiency with terminal-based problem solving and complex system administration.

• Mastery of dynamic infrastructure recovery and long-running operational process management.

• Deep expertise in containerized environments (e.g., Docker, Kubernetes) and sandbox orchestration.

• Strong Python skills, with the ability to script, automate, and debug real-world production systems.

• Proficiency in Bash and familiarity with JavaScript/TypeScript, Go, Rust, C/C++.

• Experience with build systems, package managers, databases, version control, and cryptography tools.

• Adept at troubleshooting, documenting, and replanning in high-velocity technical environments.

Preferred Qualifications:

• Background in machine learning operations or AI infrastructure.

• Familiarity with ML frameworks and distributed computing.

• Experience supporting multi-phase, high-intensity engineering projects.

Apply NowApply Now

Related Jobs