Nebuly MPS Nvidia Device Plugin의 GPU 메모리 파티셔닝

TroubleShooting/Kubernetes

Nebuly MPS Nvidia Device Plugin의 GPU 메모리 파티셔닝

IT록흐 2025. 5. 7. 20:18

이슈

Nebuly 사에서 제공하는 오픈소스 Nvidia Device Plugin을 테스트 해본 결과, GPU가 1개일 때는 동적 파티셔닝을 제대로 이루어진다. GPU가 2개일 때는 GPU 0번의 파티셔닝은 제대로 이루어지지만 GPU 1번의 파티셔닝이 제대로 되지 않음을 OOM 테스트를 통해 알게 되었다. 예를들어, 4GB로 메모리를 파티셔닝하면 GPU 0번에서 가상화된 GPU는 4GB만 쓰고 OOM이 발생하지만 GPU 1번에서 가상화된 GPU는 4GB를 넘어서 물리 GPU 전체 메모리를 사용한 후 OOM이 발생한다.

내용

https://github.com/nebuly-ai/k8s-device-plugin/tree/v0.13.0?tab=readme-ov-file

GitHub - nebuly-ai/k8s-device-plugin: NVIDIA device plugin for Kubernetes

NVIDIA device plugin for Kubernetes. Contribute to nebuly-ai/k8s-device-plugin development by creating an account on GitHub.

github.com

GitHub에 공개된 13버전 리포지토리를 가져와 이미지를 빌드하여 테스트를 진행하였다. 이미지를 빌드해서 테스트해보니 GPU 0번도 메모리 파티셔닝이 되지 않았다. Helm으로 이미지를 가져와 테스트 할 때는 동일한 13버전으로 테스트했는데, GPU 0번이면 메모리 파티셔닝이 잘 되지만 GPU 1번이면 메모리 파티셔닝이 되지 않았다. ( 뭔가 문제가 있어 보인다. )

해결

이슈 ) GPU 0번, 1번 둘 다 메모리 파티셔닝이 되지 않음.

해결 ) ':' => '=' 로 변경하여 GPU 0번은 파티셔닝 되도록 해결

server.go

		memLimits := make([]string, 0)
		for _, mpsDevice := range requestedMPSDevices {
			// ':' => '=' 로 변경,
			//limit := fmt.Sprintf("%s:%dG", mpsDevice.Index, mpsDevice.AnnotatedID.GetMemoryGB())
			limit := fmt.Sprintf("%s=%dG", mpsDevice.Index, mpsDevice.AnnotatedID.GetMemoryGB())
			memLimits = append(memLimits, limit)

CUDA_MPS_PINNED_DEVICE_MEM_LIMIT 의 환경변수의 value값은 %s:%dG 가 아니라 %s=%dG 이다

https://docs.nvidia.com/deploy/mps/index.html#architecture

이슈 ) GPU 0번은 메모리 파티셔닝이 되는데, GPU 1번은 메모리 파티셔닝이 되지 않음.

해결 ) %s를 없애고 0으로 하드코딩하여 조치

		memLimits := make([]string, 0)
		for _, mpsDevice := range requestedMPSDevices {
			//limit := fmt.Sprintf("%s:%dG", mpsDevice.Index, mpsDevice.AnnotatedID.GetMemoryGB())
			//limit := fmt.Sprintf("%s=%dG", mpsDevice.Index, mpsDevice.AnnotatedID.GetMemoryGB())
			limit := fmt.Sprintf("0=%dG", mpsDevice.AnnotatedID.GetMemoryGB())
			memLimits = append(memLimits, limit)

Nvdia Device Plugin이 kubelet으로 보내는 Response를 보면 아래와 같다.

2025/04/01 01:44:17 server.go ---- response : {map[CUDA_MPS_PINNED_DEVICE_MEM_LIMIT:1=8G CUDA_MPS_PIPE_DIRECTORY:/tmp/nvidia-mps NVIDIA_VISIBLE_DEVICES:GPU-4ddaaf11-730a-0acf-5233-a8f5a5b891e3] [&Mount{ContainerPath:/tmp/nvidia-mps,HostPath:/tmp/nvidia-mps,ReadOnly:false,}] [] map[] {} 0}

kubelet은 환경변수, 마운트, 어노테이션 등의 정보를 Nvidia Device Plugin에서 받아 컨테이너를 생성한다.

이때 환경변수를 보면,

NVIDIA_VISIBLE_DEVICES:GPU-4ddaaf11-730a5b891e3

NVIDIA_VISIBLE_DEVICES 가 1개이면 Client 사이드에서 보는 GPU의 개수는 1개이다. 그러면 해당 GPU의 인덱스는 0이 되어야 한다.

그런데

CUDA_MPS_PINNED_DEVICE_MEM_LIMIT:1=8G CUDA_MPS_PINNED_DEVICE_MEM_LIMIT의 값은 GPU 1번 인덱스에 8GB를 할당해달라는 것이다. GPU 1번 인덱스는 서버 사이드 기준이지, 클라이언트 사이드 기준이 아니다. Nebuly Nvidia Device Plugin은 디폴트로 MPS Device를 컨테이너에 1개만 할당 가능하도록 제한한다.

server.go ( https://github.com/NVIDIA/k8s-device-plugin )

		// failRequestsGreaterThanOne = true이고 요청된 MPS Device가 2개 이상이면 에러 발생 ( 1개의 MPS Device만 할당 가능하도록 설정 )
		if plugin.config.Sharing.MPS.FailRequestsGreaterThanOne && len(requestedMPSDevices) > 1 {
			return nil, fmt.Errorf("request for '%v: %v' too large: maximum request size for shared resources is 1", plugin.rm.Resource(), len(req.DevicesIDs))
		}

그러므로 모든 컨테이너가 GPU를 1개만 할당받으므로, GPU 인덱스는 0으로 고정이다. 그래서 %s를 없애고 0으로 하드코딩하여 조치하였더니, GPU 1번도 메모리 파티셔닝이 제대로 적용되었다.

테스트

파드A - GPU0번 4GB 할당

파드B - GPU1번 8GB 할당

파드C - GPU1번 8GB 할당

GPU 0번 4GB 메모리 할당받은 파드A 로그 => 4GB 넘어가자 OOM 발생

root@k8s-m1:~/mingu/mps/test-pod$ kubectl logs mps-test-pod-1
Allocated Memory: 1.00 GB, Reserved Memory: 1.00 GB
2025-04-01 02:33:02,264 - Allocated Memory: 1.00 GB, Reserved Memory: 1.00 GB
Allocated Memory: 2.00 GB, Reserved Memory: 2.00 GB
2025-04-01 02:33:03,268 - Allocated Memory: 2.00 GB, Reserved Memory: 2.00 GB
Allocated Memory: 3.00 GB, Reserved Memory: 3.00 GB
2025-04-01 02:33:04,271 - Allocated Memory: 3.00 GB, Reserved Memory: 3.00 GB
gradually_increase_memory Error
2025-04-01 02:33:05,282 - Memory allocation failed: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacty of 11.76 GiB of which 778.05 MiB is free. Process 296294 has 28.06 MiB memory in use. Process 298284 has 3.24 GiB memory in use. Of the allocated memory 3.00 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Memory allocation failed: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacty of 11.76 GiB of which 778.05 MiB is free. Process 296294 has 28.06 MiB memory in use. Process 298284 has 3.24 GiB memory in use. Of the allocated memory 3.00 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

GPU 1번 8GB 메모리 할당받은 파드B 로그 => 8GB 넘어가자 OOM 발생

root@k8s-m1:~/mingu/mps/test-pod$ kubectl logs mps-test-pod-2
Allocated Memory: 1.00 GB, Reserved Memory: 1.00 GB
2025-04-01 02:33:08,300 - Allocated Memory: 1.00 GB, Reserved Memory: 1.00 GB
Allocated Memory: 2.00 GB, Reserved Memory: 2.00 GB
2025-04-01 02:33:09,303 - Allocated Memory: 2.00 GB, Reserved Memory: 2.00 GB
Allocated Memory: 3.00 GB, Reserved Memory: 3.00 GB
2025-04-01 02:33:10,305 - Allocated Memory: 3.00 GB, Reserved Memory: 3.00 GB
Allocated Memory: 4.00 GB, Reserved Memory: 4.00 GB
2025-04-01 02:33:11,308 - Allocated Memory: 4.00 GB, Reserved Memory: 4.00 GB
Allocated Memory: 5.00 GB, Reserved Memory: 5.00 GB
2025-04-01 02:33:12,311 - Allocated Memory: 5.00 GB, Reserved Memory: 5.00 GB
Allocated Memory: 6.00 GB, Reserved Memory: 6.00 GB
2025-04-01 02:33:13,314 - Allocated Memory: 6.00 GB, Reserved Memory: 6.00 GB
Allocated Memory: 7.00 GB, Reserved Memory: 7.00 GB
2025-04-01 02:33:14,317 - Allocated Memory: 7.00 GB, Reserved Memory: 7.00 GB
gradually_increase_memory Error
2025-04-01 02:33:15,329 - Memory allocation failed: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacty of 11.76 GiB of which 778.05 MiB is free. Process 296294 has 28.06 MiB memory in use. Process 298381 has 7.24 GiB memory in use. Of the allocated memory 7.00 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Memory allocation failed: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacty of 11.76 GiB of which 778.05 MiB is free. Process 296294 has 28.06 MiB memory in use. Process 298381 has 7.24 GiB memory in use. Of the allocated memory 7.00 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

GPU 1번 8GB 메모리 할당받은 파드C 로그 => 5GB 넘어가자 OOM 발생 ( 총 12GB인데 파드B가 8GB를 먼저 차지해서 5GB 쓰고 있는 중에 OOM 발생 )

root@k8s-m1:~/mingu/mps/test-pod$ kubectl logs mps-test-pod-3
Allocated Memory: 1.00 GB, Reserved Memory: 1.00 GB
2025-04-01 02:33:09,176 - Allocated Memory: 1.00 GB, Reserved Memory: 1.00 GB
Allocated Memory: 2.00 GB, Reserved Memory: 2.00 GB
2025-04-01 02:33:10,178 - Allocated Memory: 2.00 GB, Reserved Memory: 2.00 GB
Allocated Memory: 3.00 GB, Reserved Memory: 3.00 GB
2025-04-01 02:33:11,181 - Allocated Memory: 3.00 GB, Reserved Memory: 3.00 GB
Allocated Memory: 4.00 GB, Reserved Memory: 4.00 GB
2025-04-01 02:33:12,183 - Allocated Memory: 4.00 GB, Reserved Memory: 4.00 GB
Allocated Memory: 5.00 GB, Reserved Memory: 5.00 GB
2025-04-01 02:33:13,185 - Allocated Memory: 5.00 GB, Reserved Memory: 5.00 GB
gradually_increase_memory Error
2025-04-01 02:33:14,198 - Memory allocation failed: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacty of 11.76 GiB of which 249.69 MiB is free. Process 296294 has 28.06 MiB memory in use. Process 298381 has 6.24 GiB memory in use. Process 298450 has 5.24 GiB memory in use. Of the allocated memory 5.00 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

'TroubleShooting > Kubernetes' 카테고리의 다른 글

Host DNS 설정 변경 시, CoreDNS 파드 재시작 필요 (0)	2025.05.07
노드 재부팅 후 K8S 클러스터가 동작하지 않은 현상 ( swap off ) (0)	2025.05.07
BIRD is not ready: BGP not established 이슈 (0)	2025.05.07

현재글Nebuly MPS Nvidia Device Plugin의 GPU 메모리 파티셔닝

L.O.K