Senior AI Infrastructure & Platform Engineer The Pace Infosys

Senior AI Infrastructure & Platform Engineer

  • Industry Other
  • Category Production/Maintenance /Quality
  • Location Kathmandu, Nepal
  • Expiry date May 17, 2026 (4 days left)
Job Description

Department:

Engineering / Infrastructure / AI Systems

Location:

Kathmandu, Nepal (Hybrid/On-site)

Experience Required:

4+ Years (DevOps / Platform Engineering / Cloud Infrastructure)


Job Summary

We are looking for a highly skilled Senior AI Infrastructure & Platform Engineer to design, deploy, scale, and maintain production-grade AI systems and cloud infrastructure.

This role is ideal for a DevOps or Platform Engineer who has strong experience managing microservices at scale and is passionate about deploying AI-powered applications in real-world production environments.

The ideal candidate should have hands-on experience with:

  • Kubernetes,
  • AWS cloud infrastructure,
  • CI/CD automation,
  • GPU workload management,
  • observability/monitoring,
  • production troubleshooting,
  • and scalable AI model deployment.

You will work closely with AI/ML Engineers to deploy and optimize AI inference systems, LLM services, and distributed microservice architectures.


Key Responsibilities

AI Infrastructure & Deployment

  • Deploy and maintain AI-powered microservices in production environments
  • Manage scalable GPU-based inference systems for live AI/LLM applications
  • Optimize model-serving infrastructure for low latency and high availability
  • Deploy AI workloads using Docker and Kubernetes


Cloud & Infrastructure Management

  • Design and manage AWS cloud infrastructure (EKS, ECS, EC2, VPC, IAM, ALB, Auto Scaling, S3)
  • Manage on-premise/in-house servers and hybrid infrastructure environments
  • Ensure infrastructure security, scalability, and reliability


Kubernetes & Container Orchestration

  • Deploy and manage Kubernetes clusters for distributed AI workloads
  • Configure auto-scaling for GPU and CPU-intensive services
  • Manage Helm charts, ingress controllers, service networking, and workload scheduling
  • Optimize container performance and resource utilization


CI/CD & Automation

  • Build and maintain CI/CD pipelines for microservices and AI applications
  • Automate deployments using GitHub Actions, GitLab CI/CD, Jenkins, Terraform, or Ansible
  • Implement Infrastructure as Code (IaC) best practices


Monitoring & Reliability Engineering

  • Implement monitoring, logging, and alerting systems using Prometheus, Grafana, Loki, ELK, or similar tools
  • Monitor microservice health, latency, GPU utilization, and production metrics
  • Troubleshoot and resolve production incidents, outages, and infrastructure bottlenecks
  • Ensure high uptime and operational reliability


Performance Optimization

  • Scale GPU instances dynamically for live inference workloads
  • Optimize AI inference performance, container startup time, and infrastructure costs
  • Improve deployment efficiency and system throughput


Required Skills & Qualifications

Must Have

  • 4+ years of experience in DevOps, Platform Engineering, Cloud Infrastructure, or SRE
  • Strong hands-on experience with Kubernetes in production environments
  • Experience deploying and managing microservices at scale
  • Strong AWS experience (EKS, ECS, EC2, IAM, VPC, ALB, CloudWatch, Auto Scaling)
  • Strong Linux administration and troubleshooting skills
  • Experience with Docker and container orchestration
  • Experience building CI/CD pipelines
  • Experience handling production incidents and debugging distributed systems
  • Strong scripting/programming skills in Python, Bash, or Go


AI Infrastructure Experience (Preferred)

  • Experience deploying AI/ML/LLM workloads in production
  • GPU infrastructure management experience
  • Familiarity with:
  • vLLM
  • Triton Inference Server
  • KServe
  • Ray Serve
  • CUDA containers
  • NVIDIA GPU Operator


Monitoring & Observability

  • Prometheus
  • Grafana
  • Loki
  • ELK Stack
  • OpenTelemetry
  • Distributed tracing


Infrastructure & Automation Tools

  • Terraform
  • Ansible
  • ArgoCD
  • Helm
  • GitOps workflows


Nice to Have

  • Experience with vector databases
  • Experience with Kafka, RabbitMQ, or Redis
  • Understanding of AI inference optimization
  • Experience with hybrid cloud/on-premise infrastructure
  • Exposure to security best practices and DevSecOps


Key Competencies

  • Strong problem-solving and debugging skills
  • Production-first mindset
  • Ownership mentality
  • Ability to work under pressure during incidents
  • Strong communication and collaboration skills
  • Continuous learning attitude


What We Offer

  • Opportunity to work on cutting-edge AI infrastructure systems
  • Exposure to large-scale AI deployment architectures
  • Competitive salary and growth opportunities
  • Collaborative engineering culture
  • High-impact technical ownership


KPIs / Success Metrics

  • Infrastructure uptime and reliability
  • Deployment success rate
  • Production incident resolution time
  • GPU utilization efficiency
  • CI/CD deployment speed
  • System scalability and performance optimization
  • Monitoring and alerting effectiveness

Download Our Mobile App