sysgpu: Vulkan memory allocation for buffers is broken #1253

New issue

Open

opened 2024-08-24 20:53:50 +00:00 by CoffeeImpliesCode · 5 comments

CoffeeImpliesCode commented

2024-08-24 20:53:50 +00:00

(Migrated from github.com)

For me (running nixos with nvidia-drm driver) all examples except run-triangle crash in line 1190:
github.com/hexops/mach@bfa3b069f7/src/sysgpu/vulkan.zig (L1188-L1192)
with api error Result.error_out_of_device_memory. This was weird, because my GPU (NVIDIA GeForce GTX 1660 Ti) has 6GiB of memory and only ~450MiB were used at the time. Printing out the arguments requirements.size and mem_type_index yields

Vkd: Allocate 67108864b on index 5
Vkd: Allocate 33554432b on index 5
Vkd: Allocate 33554432b on index 5
Vkd: Allocate 4194304b on index 5
Vkd: Allocate 128b on index 5
Vkd: Allocate 67108864b on index 5
Vkd: Allocate 67108864b on index 5
thread 26160 panic: api error

so after allocating 196MiB of GPU memory, all in mem_type_index 5, allocating another 64MiB fails with OoM. Enumerating the available memory types in findBestAllocator yields

Typeof 0: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{}, .heap_index = 1 }
Typeof 1: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{ .device_local_bit }, .heap_index = 0 }
Typeof 2: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{ .device_local_bit }, .heap_index = 0 }
Typeof 3: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{ .host_visible_bit, .host_coherent_bit }, .heap_index = 1 }
Typeof 4: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{ .host_visible_bit, .host_coherent_bit, .host_cached_bit }, .heap_index = 1 }
Typeof 5: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{ .device_local_bit, .host_visible_bit, .host_coherent_bit }, .heap_index = 2 }

of which findBestAllocator selects the "desired" combination of .device_local_bit and .host_visible_bit, giving CPU-writable GPU memory. Cross-referencing with vulkaninfo gives info on the 3 memory heaps:

       memoryHeaps[0]:
		size   = 6442450944 (0x180000000) (6.00 GiB)
		budget = 5766578176 (0x157b70000) (5.37 GiB)
		usage  = 0 (0x00000000) (0.00 B)
		flags: count = 1
			MEMORY_HEAP_DEVICE_LOCAL_BIT
	memoryHeaps[1]:
		size   = 25182732288 (0x5dd020000) (23.45 GiB)
		budget = 25182732288 (0x5dd020000) (23.45 GiB)
		usage  = 0 (0x00000000) (0.00 B)
		flags:
			None
	memoryHeaps[2]:
		size   = 257949696 (0x0f600000) (246.00 MiB)
		budget = 241500160 (0x0e650000) (230.31 MiB)
		usage  = 16449536 (0x00fb0000) (15.69 MiB)
		flags: count = 1
			MEMORY_HEAP_DEVICE_LOCAL_BIT

so heap_index 0 is my 6GiB of GPU RAM, 1 (a portion of) my CPU RAM and 2 a 230MiB CPU-accessible GPU memory region, which explains my crash when requesting 260MiB for buffers.

Hard-coding the selection to CPU mem makes the examples run fine (although in my understanding this memory will be re-streamed to the GPU every single frame, making this version slow, and I'm not sure from browsing if that is within vulkan spec or a nicety of the nvidia driver) and selecting the main GPU mem (predictably) fails with a later Result.err_memory_map_failed, as it is not CPU-writable.

Reading up a bit on Vulkan memory management, this is all by design and CPU-accessible GPU memory shouldn't really exist (afaict) and (based on some light reading) the way it's meant to be done is allocate both CPU and GPU memory, write data into the CPU one and then do a CPU-to-GPU memory transfer to have the buffer data in fast GPU mem.

Please excuse if I just mostly restated obvious facts, I just wanted to present the bug in the order I understood it. Cheers.

For me (running nixos with nvidia-drm driver) all examples except run-triangle crash in line 1190: https://github.com/hexops/mach/blob/bfa3b069f7d1e1e6b8cb5e0d9199e397931cec2e/src/sysgpu/vulkan.zig#L1188-L1192 with api error `Result.error_out_of_device_memory`. This was weird, because my GPU (NVIDIA GeForce GTX 1660 Ti) has 6GiB of memory and only ~450MiB were used at the time. Printing out the arguments `requirements.size` and `mem_type_index` yields ```sh Vkd: Allocate 67108864b on index 5 Vkd: Allocate 33554432b on index 5 Vkd: Allocate 33554432b on index 5 Vkd: Allocate 4194304b on index 5 Vkd: Allocate 128b on index 5 Vkd: Allocate 67108864b on index 5 Vkd: Allocate 67108864b on index 5 thread 26160 panic: api error ``` so after allocating 196MiB of GPU memory, all in mem_type_index 5, allocating another 64MiB fails with OoM. Enumerating the available memory types in `findBestAllocator` yields ```zig Typeof 0: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{}, .heap_index = 1 } Typeof 1: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{ .device_local_bit }, .heap_index = 0 } Typeof 2: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{ .device_local_bit }, .heap_index = 0 } Typeof 3: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{ .host_visible_bit, .host_coherent_bit }, .heap_index = 1 } Typeof 4: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{ .host_visible_bit, .host_coherent_bit, .host_cached_bit }, .heap_index = 1 } Typeof 5: vk.MemoryType{ .property_flags = vk.MemoryPropertyFlags{ .device_local_bit, .host_visible_bit, .host_coherent_bit }, .heap_index = 2 } ``` of which `findBestAllocator` selects the "desired" combination of `.device_local_bit` and `.host_visible_bit`, giving CPU-writable GPU memory. Cross-referencing with vulkaninfo gives info on the 3 memory heaps: ```sh memoryHeaps[0]: size = 6442450944 (0x180000000) (6.00 GiB) budget = 5766578176 (0x157b70000) (5.37 GiB) usage = 0 (0x00000000) (0.00 B) flags: count = 1 MEMORY_HEAP_DEVICE_LOCAL_BIT memoryHeaps[1]: size = 25182732288 (0x5dd020000) (23.45 GiB) budget = 25182732288 (0x5dd020000) (23.45 GiB) usage = 0 (0x00000000) (0.00 B) flags: None memoryHeaps[2]: size = 257949696 (0x0f600000) (246.00 MiB) budget = 241500160 (0x0e650000) (230.31 MiB) usage = 16449536 (0x00fb0000) (15.69 MiB) flags: count = 1 MEMORY_HEAP_DEVICE_LOCAL_BIT ``` so heap_index 0 is my 6GiB of GPU RAM, 1 (a portion of) my CPU RAM and 2 a 230MiB CPU-accessible GPU memory region, which explains my crash when requesting 260MiB for buffers. Hard-coding the selection to CPU mem makes the examples run fine (although in my understanding this memory will be re-streamed to the GPU every single frame, making this version slow, and I'm not sure from browsing if that is within vulkan spec or a nicety of the nvidia driver) and selecting the main GPU mem (predictably) fails with a later `Result.err_memory_map_failed`, as it is not CPU-writable. Reading up a bit on Vulkan memory management, this is all by design and CPU-accessible GPU memory shouldn't really exist (afaict) and (based on some light reading) the way it's meant to be done is allocate both CPU and GPU memory, write data into the CPU one and then do a CPU-to-GPU memory transfer to have the buffer data in fast GPU mem. Please excuse if I just mostly restated obvious facts, I just wanted to present the bug in the order I understood it. Cheers.

emidoots commented

2024-08-25 20:37:22 +00:00

(Migrated from github.com)

thanks for the detailed write-up!

❤️ 1

ronald-mz commented

2024-10-21 17:41:00 +00:00

(Migrated from github.com)

@CoffeeImpliesCode

I think there is still some changes to be done related to finding the best allocator, but would you mind checking if you still run out of memory after https://github.com/hexops/mach/pull/1289?

@CoffeeImpliesCode I think there is still some changes to be done related to finding the best allocator, but would you mind checking if you still run out of memory after https://github.com/hexops/mach/pull/1289?

CoffeeImpliesCode commented

2024-10-25 16:03:20 +00:00

(Migrated from github.com)

@RonaldZielaznicki nope, getting the same issue on main branch with latest nominated zig version. (also I'm hitting #1275 on gnome so I can exclusively test this on a wlroots WM , but that should not be related)

ronald-mz commented

2024-10-31 19:44:55 +00:00

(Migrated from github.com)

@CoffeeImpliesCode Yeah, using wayland(mutter) on gnome has problem with decorations right now. We've added a proper error for that until libdecor gets introduced.

And for some reason, I'd thought you were getting this with triangle as well. But your first sentence counters that.

But, had a thought that might be worth exploring. Does your computer have multiple devices? For instance, integrated graphics on the CPU as well as a dedicated card?

There's another bit of logic that checks the current performance mode of a computer and selects a device based off of that.

@CoffeeImpliesCode Yeah, using wayland(mutter) on gnome has problem with decorations right now. We've added a proper error for that until libdecor gets introduced. And for some reason, I'd thought you were getting this with triangle as well. But your *first sentence* counters that. But, had a thought that might be worth exploring. Does your computer have multiple devices? For instance, integrated graphics on the CPU as well as a dedicated card? There's another bit of logic that checks the current performance mode of a computer and selects a device based off of that.

CoffeeImpliesCode commented

2024-11-05 00:03:38 +00:00

(Migrated from github.com)

But, had a thought that might be worth exploring. Does your computer have multiple devices? For instance, integrated graphics on the CPU as well as a dedicated card?

No, there is only one "VGA compatible controller" on lspci, which is the NVIDIA GPU.

There's another bit of logic that checks the current performance mode of a computer and selects a device based off of that.

I usually run on the "performance" power profile as set with powerprofilesctl, so I don't think this is a throttling issue.

As I said, the current (as of last looking at the memory management code 3 months ago) pattern of querying for chunks of both CPU-accessible and GPU-accessible memory in the memory buffer selection just goes against the spirit of modern graphics APIs, they want to make that explicit (as it is one of the primary bottlenecks). So I wouldn't say that the NVIDIA GPU driver is lying, and other vendors probably allow for this behavior for compat reasons, but really that memory management should be handled correctly.

> But, had a thought that might be worth exploring. Does your computer have multiple devices? For instance, integrated graphics on the CPU as well as a dedicated card? No, there is only one "VGA compatible controller" on lspci, which is the NVIDIA GPU. > There's another bit of logic that checks the current performance mode of a computer and selects a device based off of that. I usually run on the "performance" power profile as set with powerprofilesctl, so I don't think this is a throttling issue. As I said, the current (as of last looking at the memory management code 3 months ago) pattern of querying for chunks of both CPU-accessible and GPU-accessible memory in the memory buffer selection just goes against the spirit of modern graphics APIs, they want to make that explicit (as it _is_ one of the primary bottlenecks). So I wouldn't say that the NVIDIA GPU driver is lying, and other vendors probably allow for this behavior for compat reasons, but really that memory management should be handled correctly.