Senior AI Infrastructure & Platform Engineer
- Industry Other
- Category Production/Maintenance /Quality
- Location Kathmandu, Nepal
- Expiry date May 17, 2026 (4 days left)
Job Description
Department:
Engineering / Infrastructure / AI Systems
Location:
Kathmandu, Nepal (Hybrid/On-site)
Experience Required:
4+ Years (DevOps / Platform Engineering / Cloud Infrastructure)
Job Summary
We are looking for a highly skilled Senior AI Infrastructure & Platform Engineer to design, deploy, scale, and maintain production-grade AI systems and cloud infrastructure.
This role is ideal for a DevOps or Platform Engineer who has strong experience managing microservices at scale and is passionate about deploying AI-powered applications in real-world production environments.
The ideal candidate should have hands-on experience with:
- Kubernetes,
- AWS cloud infrastructure,
- CI/CD automation,
- GPU workload management,
- observability/monitoring,
- production troubleshooting,
- and scalable AI model deployment.
You will work closely with AI/ML Engineers to deploy and optimize AI inference systems, LLM services, and distributed microservice architectures.
Key Responsibilities
AI Infrastructure & Deployment
- Deploy and maintain AI-powered microservices in production environments
- Manage scalable GPU-based inference systems for live AI/LLM applications
- Optimize model-serving infrastructure for low latency and high availability
- Deploy AI workloads using Docker and Kubernetes
Cloud & Infrastructure Management
- Design and manage AWS cloud infrastructure (EKS, ECS, EC2, VPC, IAM, ALB, Auto Scaling, S3)
- Manage on-premise/in-house servers and hybrid infrastructure environments
- Ensure infrastructure security, scalability, and reliability
Kubernetes & Container Orchestration
- Deploy and manage Kubernetes clusters for distributed AI workloads
- Configure auto-scaling for GPU and CPU-intensive services
- Manage Helm charts, ingress controllers, service networking, and workload scheduling
- Optimize container performance and resource utilization
CI/CD & Automation
- Build and maintain CI/CD pipelines for microservices and AI applications
- Automate deployments using GitHub Actions, GitLab CI/CD, Jenkins, Terraform, or Ansible
- Implement Infrastructure as Code (IaC) best practices
Monitoring & Reliability Engineering
- Implement monitoring, logging, and alerting systems using Prometheus, Grafana, Loki, ELK, or similar tools
- Monitor microservice health, latency, GPU utilization, and production metrics
- Troubleshoot and resolve production incidents, outages, and infrastructure bottlenecks
- Ensure high uptime and operational reliability
Performance Optimization
- Scale GPU instances dynamically for live inference workloads
- Optimize AI inference performance, container startup time, and infrastructure costs
- Improve deployment efficiency and system throughput
Required Skills & Qualifications
Must Have
- 4+ years of experience in DevOps, Platform Engineering, Cloud Infrastructure, or SRE
- Strong hands-on experience with Kubernetes in production environments
- Experience deploying and managing microservices at scale
- Strong AWS experience (EKS, ECS, EC2, IAM, VPC, ALB, CloudWatch, Auto Scaling)
- Strong Linux administration and troubleshooting skills
- Experience with Docker and container orchestration
- Experience building CI/CD pipelines
- Experience handling production incidents and debugging distributed systems
- Strong scripting/programming skills in Python, Bash, or Go
AI Infrastructure Experience (Preferred)
- Experience deploying AI/ML/LLM workloads in production
- GPU infrastructure management experience
- Familiarity with:
- vLLM
- Triton Inference Server
- KServe
- Ray Serve
- CUDA containers
- NVIDIA GPU Operator
Monitoring & Observability
- Prometheus
- Grafana
- Loki
- ELK Stack
- OpenTelemetry
- Distributed tracing
Infrastructure & Automation Tools
- Terraform
- Ansible
- ArgoCD
- Helm
- GitOps workflows
Nice to Have
- Experience with vector databases
- Experience with Kafka, RabbitMQ, or Redis
- Understanding of AI inference optimization
- Experience with hybrid cloud/on-premise infrastructure
- Exposure to security best practices and DevSecOps
Key Competencies
- Strong problem-solving and debugging skills
- Production-first mindset
- Ownership mentality
- Ability to work under pressure during incidents
- Strong communication and collaboration skills
- Continuous learning attitude
What We Offer
- Opportunity to work on cutting-edge AI infrastructure systems
- Exposure to large-scale AI deployment architectures
- Competitive salary and growth opportunities
- Collaborative engineering culture
- High-impact technical ownership
KPIs / Success Metrics
- Infrastructure uptime and reliability
- Deployment success rate
- Production incident resolution time
- GPU utilization efficiency
- CI/CD deployment speed
- System scalability and performance optimization
- Monitoring and alerting effectiveness