Ming-Hung Chen recently works on enabling large-scale AI/ML and high-performance Computing (HPC) workload support in IBM Cloud, which includes large-scale computing system design, hypervisor and operation system optimization, software-defined networking, communication protocols, AI application acceleration and fault tolerance, and composable systems. Multiple of his contributions to the cloud control plane and system design have been integrated into IBM Cloud production environment.
Computers have never been more important to the world. At IBM Research, we’re designing new systems that provide flexible, secure computing environments — from bits to neurons and qubits. We’re working on innovations in hybrid cloud infrastructure, operating systems, and software. Our goal is to create technologies that improve performance, security, and ease of use across hybrid and multi-cloud computing. We want to enable clients to dynamically compose best-of-breed services and applications freely and frictionlessly across distributed computing environments and accelerate data-driven innovations.
As the foundational technologies for composability, e.g., CXL and PCIe Gen5/6, become ready, we plan to re-evaluate the related technology to understand tradeoffs between the performance impacts and the flexibility, specifically on distributed AI workloads. This job will allow the participant to try out multiple external enclosures from different vendors with high-end datacenter-grade accelerators. The comprehensive evaluation result, including the performance and limitation of the existing solutions, could be valuable to many people in the academic and industry, and could be published as paper. The participant may also further develop a resource management mechanism and/or investigate the potential security concerns of existing composable solutions.
From June 9 to August 31, 2025 (adjustable at the discretion of the organisation)
Computers have never been more important to the world. At IBM Research, we’re designing new systems that provide flexible, secure computing environments — from bits to neurons and qubits. We're working on innovations in hybrid cloud infrastructure, operating systems, and software. Our goal is to create technologies that improve performance, security, and ease of use across hybrid and multi-cloud computing. We want to enable clients to dynamically compose best-of-breed services and applications freely and frictionlessly across distributed computing environments and accelerate data-driven innovations.
More: https://research.ibm.com/hybrid-cloud
One of the open issues on distributed AI model training is how to deal with unexpected hardware failure. This job is to investigate the state-of-art AI frameworks, e.g., PyTorch, and propose a mechanism to deal with common hardware failures such as GPU or node failure. We may implement a proof-of-concept prototype and test it with state-of-art GPU systems to validate the proposed solution. The result can be published as a paper, and the PoC source code may also contribute to the open-source community.
From June 9 to August 31, 2025 (adjustable at the discretion of the organisation)