Senior AI Infrastructure & Platform Engineer

Industry Other
Category Production/Maintenance /Quality
Location Kathmandu, Nepal
Expiry date May 17, 2026 (4 days left)

Job Description

Department:

Engineering / Infrastructure / AI Systems

Location:

Kathmandu, Nepal (Hybrid/On-site)

Experience Required:

4+ Years (DevOps / Platform Engineering / Cloud Infrastructure)

Job Summary

We are looking for a highly skilled Senior AI Infrastructure & Platform Engineer to design, deploy, scale, and maintain production-grade AI systems and cloud infrastructure.

This role is ideal for a DevOps or Platform Engineer who has strong experience managing microservices at scale and is passionate about deploying AI-powered applications in real-world production environments.

The ideal candidate should have hands-on experience with:

Kubernetes,
AWS cloud infrastructure,
CI/CD automation,
GPU workload management,
observability/monitoring,
production troubleshooting,
and scalable AI model deployment.

You will work closely with AI/ML Engineers to deploy and optimize AI inference systems, LLM services, and distributed microservice architectures.

Key Responsibilities

AI Infrastructure & Deployment

Deploy and maintain AI-powered microservices in production environments
Manage scalable GPU-based inference systems for live AI/LLM applications
Optimize model-serving infrastructure for low latency and high availability
Deploy AI workloads using Docker and Kubernetes

Cloud & Infrastructure Management

Design and manage AWS cloud infrastructure (EKS, ECS, EC2, VPC, IAM, ALB, Auto Scaling, S3)
Manage on-premise/in-house servers and hybrid infrastructure environments
Ensure infrastructure security, scalability, and reliability

Kubernetes & Container Orchestration

Deploy and manage Kubernetes clusters for distributed AI workloads
Configure auto-scaling for GPU and CPU-intensive services
Manage Helm charts, ingress controllers, service networking, and workload scheduling
Optimize container performance and resource utilization

CI/CD & Automation

Build and maintain CI/CD pipelines for microservices and AI applications
Automate deployments using GitHub Actions, GitLab CI/CD, Jenkins, Terraform, or Ansible
Implement Infrastructure as Code (IaC) best practices

Monitoring & Reliability Engineering

Implement monitoring, logging, and alerting systems using Prometheus, Grafana, Loki, ELK, or similar tools
Monitor microservice health, latency, GPU utilization, and production metrics
Troubleshoot and resolve production incidents, outages, and infrastructure bottlenecks
Ensure high uptime and operational reliability

Performance Optimization

Scale GPU instances dynamically for live inference workloads
Optimize AI inference performance, container startup time, and infrastructure costs
Improve deployment efficiency and system throughput

Required Skills & Qualifications

Must Have

4+ years of experience in DevOps, Platform Engineering, Cloud Infrastructure, or SRE
Strong hands-on experience with Kubernetes in production environments
Experience deploying and managing microservices at scale
Strong AWS experience (EKS, ECS, EC2, IAM, VPC, ALB, CloudWatch, Auto Scaling)
Strong Linux administration and troubleshooting skills
Experience with Docker and container orchestration
Experience building CI/CD pipelines
Experience handling production incidents and debugging distributed systems
Strong scripting/programming skills in Python, Bash, or Go

AI Infrastructure Experience (Preferred)