職位描述
該職位還未進(jìn)行加V認(rèn)證,請仔細(xì)了解后再進(jìn)行投遞!
Position Overview
We are seeking an experienced Infrastructure Engineer to architect and manage our AI computing infrastructure. The ideal candidate will have extensive experience in building and scaling ML infrastructure, with particular emphasis on distributed training systems and GPU cluster management.
Key Responsibilities
Design and implement high-performance computing infrastructure for large-scale AI model training
Manage and optimize GPU clusters for distributed training workloads
Build and maintain container orchestration systems for ML workflows
Implement efficient resource allocation and scheduling systems
Design and maintain monitoring and alerting systems for compute infrastructure
Optimize infrastructure costs while maintaining performance
Collaborate with ML teams to support their computing needs
Ensure system reliability, security, and scalability
Required Qualifications
Master's degree in Computer Science, Systems Engineering, or related field
8+ years of experience in infrastructure engineering, with focus on ML/AI infrastructure
Strong experience with:
GPU cluster management and optimization
Kubernetes and container orchestration
Linux system administration
Infrastructure as Code (IaC)
Proven track record in building large-scale computing systems
Experience with major cloud providers (AWS/GCP/Azure or Alibaba Cloud/Tencent Cloud etc)
Preferred Qualifications
Experience with ML infrastructure at major tech companies
Knowledge of distributed training systems (PyTorch DDP, Horovod)
Familiarity with ML frameworks and their infrastructure requirements
Experience with high-performance networking (InfiniBand, RDMA)
Background in performance optimization and troubleshooting
Understanding of ML workload characteristics
Bilingual proficiency (English/Chinese)
Technical Skills
Computing Infrastructure
GPU Clusters: NVIDIA DGX, GPU management tools
Distributed Systems: Slurm, Kubernetes
ML Platforms: Kubeflow, Ray
Job Scheduling: YARN, Slurm
Cloud & Networking
Cloud Platforms:
International: AWS, GCP, Azure
China: Alibaba Cloud, Tencent Cloud
Networking: InfiniBand, RDMA, TCP/IP optimization
Load Balancing: HAProxy, NGINX
Infrastructure Management
Container Technologies: Docker, Kubernetes, Singularity
IaC: Terraform, Ansible, CloudFormation
CI/CD: Jenkins, GitLab CI
Monitoring: Prometheus, Grafana, ELK Stack
Development
Languages: Python, Go, Shell scripting
Version Control: Git
Documentation: Markdown, Confluence
What We Offer
Opportunity to build cutting-edge AI infrastructure
Competitive salary and equity package
Access to latest hardware and technologies
Professional development opportunities
Comprehensive health benefits
Learning and conference budget
Location
?Hong Kong (on-site, Hong Kong Science and Technology Park)
Expected Impact
Design and implement next-generation AI computing infrastructure
Optimize resource utilization and cost efficiency
Improve training speed and efficiency for AI models
Build scalable and reliable systems
Projects You'll Work On
Building automated GPU cluster management systems
Implementing efficient resource scheduling for ML workloads
Optimizing distributed training infrastructure
Setting up monitoring and observability systems
Designing disaster recovery and backup solutions
工作地點(diǎn)
地址:香港香港香港沙田區(qū)香港科學(xué)園10W棟317-318
求職提示:用人單位發(fā)布虛假招聘信息,或以任何名義向求職者收取財物(如體檢費(fèi)、置裝費(fèi)、押金、服裝費(fèi)、培訓(xùn)費(fèi)、身份證、畢業(yè)證等),均涉嫌違法,請求職者務(wù)必提高警惕。
職位發(fā)布者
張先生HR
Video Rebirth Limited
- 計算機(jī)軟件
- 11-20人
- 外商獨(dú)資·外企辦事處
- 香港科學(xué)園10W棟317-318
相似職位
-
環(huán)保運(yùn)維工程師 4000-7000元九原區(qū) 應(yīng)屆畢業(yè)生 大專內(nèi)蒙古盛煌環(huán)境科技有限公司
-
設(shè)備運(yùn)維工程師 面議九原區(qū) 應(yīng)屆畢業(yè)生 不限內(nèi)蒙古盛煌環(huán)境科技有限公司
-
軟件測試助理 面議青山區(qū) 應(yīng)屆畢業(yè)生 不限鄭州卓集傳媒有限公司
-
新媒體運(yùn)營(周末雙休) 面議昆都侖區(qū) 應(yīng)屆畢業(yè)生 不限廣東南油對外服務(wù)有限公司
-
售后運(yùn)維工程師 面議青山區(qū) 應(yīng)屆畢業(yè)生 不限江蘇鯨充新能源技術(shù)有限公司
-
技術(shù)研發(fā)工程師 6000-10000元昆都侖區(qū) 應(yīng)屆畢業(yè)生 本科北京麥戈龍科技有限公司