Sr Domain\Technical Architect
India
Job Description
Sr Domain\Technical Architect
Noida, Uttar Pradesh

Job Summary

Business Development Group, HCBU, HCLTech www.hcltech.com   Digital Foundation / Full-Time    We are HCLTech, one of the fastest-growing large tech companies in the world and home to 220,000+ people across 60 countries, supercharging progress through industry-leading capabilities centered around Digital, Engineering and Cloud. The driving force behind that work, our people, are diverse, creative, and passionate, raising the bar for excellence on a regular basis. We, in turn, work hard to bring out the best in them as we strive to help them find their spark and become the best version of themselves that they can be.    If all this sounds like an environment you’ll thrive in, then you’re in the right place. Join us on our journey in advancing the technological world through innovation and creativity.  AI Infrastructure Engineer- L3   The Role   The AI Infrastructure Engineer (L3) provides advanced engineering and architectural expertise for high‑performance AI and ML infrastructure. This role focuses on building, optimizing, and scaling GPU/accelerator environments and distributed systems for large‑scale training and inference workloads. Competency Focus: High‑performance computing (HPC), distributed systems, Kubernetes, GPU orchestration, cloud optimization Keywords: Nvidia GPU Infrastructure, Kubernetes, GPU Cluster Administrator, Infrastructure SME, RCA Responsibilities:  Deploy, configure, and manage GPU and AI accelerator platforms (NVIDIA A100/H100/L40, AMD Instinct, TPU). Troubleshot GPU hardware and software issues, including failures, thermal throttling, PCIe/NVLink topology, and driver conflicts. Install, upgrade, and maintain GPU software stacks, including drivers, CUDA, cuDNN, TensorRT, and firmware. Perform capacity planning and resource optimization for AI training, fine‑tuning, and inference workloads. Optimize Linux systems (Ubuntu, RHEL, Rocky) for AI/HPC workloads through NUMA, kernel, and clock tuning. Manage distributed and high‑performance storage systems, including BeeGFS, Lustre, Ceph,

Key Responsibilities

Operate high‑bandwidth, low‑latency networks, including InfiniBand, RoCE, RDMA, and NVLink. Administer Kubernetes GPU clusters, leveraging NVIDIA GPU Operator, device plugins, MIG, and node feature discovery. Support AI and HPC orchestration platforms, including Kubeflow, Ray, MLflow, and Slurm/PBS. Configure and manage GPU scheduling and sharing strategies, such as node pools, quotas, job queues, and fair‑share policies. Optimize distributed training workflows using NCCL, PyTorch Distributed, Horovod, and DeepSpeed. Operate and tune LLM and inference runtimes, including vLLM, Triton Inference Server, and TensorRT‑LLM. Monitor and tune GPU utilization, memory allocation, and container-level performance. Automate cluster provisioning and operations using Terraform, Helm, Customize, and GitOps (ArgoCD/Flux). Build automation for GPU diagnostics, node onboarding, and model deployment workflows. Implement observability and telemetry using Prometheus, Grafana, NVIDIA DCGM, and OpenTelemetry. Lead deep‑dive root cause analysis for GPU, network, storage, and orchestration issues. Provide L3 support and work with L2/L1 teams for escalations. Drive production readiness, patching, hotfix rollout, and reliability improvements across AI infrastructure. Troubleshoot & escalation for complex platform failures Deep debugging of: NCCL hangs, GPU fabric issues and co-ordinate with OEM and support vendors on critical issues Review RCA, architecture documents, and change plans Act as technical advisor to leadership and customers Qualifications & Experience  Bachelor’s degree in computer science, Engineering, Information Technology, or related field 8–12 years of overall infrastructure or platform engineering experience 4–6 years of specialized experience supporting AI/ML workloads Demonstrated experience in large‑scale GPU/accelerated computing and distributed systems Strong experience in Kubernetes, containerization, and orchestration tools Understanding of AI workload and MLOps Certifications Required NVIDIA Certified Associate – AI Infrastructure NVIDIA NPN Certification NVIDIA Base Command Manager certification AWS Solutions Architect Associate CKA – Certified Kubernetes Administrator CKAD – Certified Kubernetes Application Developer   How You’ll Grow    At HCLTech, the growth of an L3 AI Infrastructure Engineer is closely aligned with the organization’s competency framework, which emphasizes technical excellence, collaboration, continuous learning, and Ideapreneurship. At this level, engineers expand their impact through cross‑team collaboration, working with application, data science, cloud, SRE, security, and FinOps teams to design, operate, and optimize scalable AI platforms that directly support customer and business requirements. Regular vendor interaction further accelerates growth, as L3 engineers engage with OEMs and technology partners to resolve critical platform issues, lead joint root cause analyses, and contribute to roadmap and early‑access discussions, strengthening HCLTech ecosystem value and delivery confidence. Sponsored certifications play a key role in enhancing future‑ready skills, enabling engineers to deepen expertise in GPU platforms, Kubernetes, cloud‑native AI, and HPC technologies while applying this knowledge to live customer environments and mentori

Skill Requirements

null

Other Requirements

null
Information at a Glance

Why HCLTech?

At HCLTech, you'll supercharge your potential. You'll find your career. And you'll find your spark. All at a place that knows that helping its customers stay on top starts by putting its people first.

HCLTech is a global technology company, home to more than 226,300 people across 60 countries, delivering industry-leading capabilities centered around digital, engineering, cloud and AI, powered by a broad portfolio of technology services and products. We work with clients across all major verticals, providing industry solutions for Financial Services, Manufacturing, Life Sciences and Healthcare, Technology and Services, Telecom and Media, Retail and CPG, and Public Services. Consolidated revenues as of 12 months ending December 2025 totaled $14.5 billion.

23 Benefits At HCLTech, we believe in empowering our employees with comprehensive benefits that support their professional growth and enhance their well-being. When you sign up for a career with us, you gain access to: https://rmkcdn.successfactors.com/147eb21f/a701dca9-f32d-4fc9-9447-6.svg Industry-benchmarked compensation https://rmkcdn.successfactors.com/147eb21f/b0c54381-ddcc-4a33-9b35-9.svg Best-in-class healthcare benefits https://rmkcdn.successfactors.com/147eb21f/b73027be-7aae-4d36-a090-4.svg Personal time off https://rmkcdn.successfactors.com/147eb21f/d5b4fdfd-2e99-4e26-9878-9.svg Maternity and paternity benefits https://rmkcdn.successfactors.com/147eb21f/3d42b0fc-4652-435a-9ece-c.svg Access to skills / higher education programs/resources https://rmkcdn.successfactors.com/147eb21f/aeddeaf2-9e25-4584-ad11-d.svg Discounts on products and services via Benefit Box https://rmkcdn.successfactors.com/147eb21f/a9609a3b-2700-4b3c-9d90-a.svg Participate in CSR programs and live life with a purpose https://rmkcdn.successfactors.com/147eb21f/c6e33851-710f-4634-bd69-f.svg Opportunities to grow and advance your career Note: The benefits listed above vary depending on the nature of your employment and the country where you work. Some benefits may be available in some countries but not in all.