Acceleration Structure Conversion#790
Merged
Merged
Conversation
Note that pointer/build param encoding stuff shouldn't be in the CPU side but don't touch anything. Also fix a typo, change the SRange to a std::span, and add default SPIR-V optimizer if none provided to asset converter.
…their storage. Change more stuff to span in `ICPUBottomLevelAccelerationStructure` Use a semantically better typedef/alias in `ILogicalDevice::createBottomLevelAccelerationStructure`
Comment on lines
+1103
to
+1104
| // finally the contents | ||
| //TODO: hasher << lookup.asset->getContentHash(); |
Member
Author
There was a problem hiding this comment.
note to self, need to make the ICPUBottomLevelAccelerationStructure and IPreHashed
Comment on lines
+2392
to
+2512
| using mem_prop_f = IDeviceMemoryAllocation::E_MEMORY_PROPERTY_FLAGS; | ||
| const auto deviceBuildMemoryTypes = device->getPhysicalDevice()->getMemoryTypeBitsFromMemoryTypeFlags(mem_prop_f::EMPF_DEVICE_LOCAL_BIT); | ||
| const auto hostBuildMemoryTypes = device->getPhysicalDevice()->getMemoryTypeBitsFromMemoryTypeFlags(mem_prop_f::EMPF_DEVICE_LOCAL_BIT|mem_prop_f::EMPF_HOST_WRITABLE_BIT|mem_prop_f::EMPF_HOST_CACHED_BIT); | ||
|
|
||
| constexpr bool IsTLAS = std::is_same_v<AssetType,ICPUTopLevelAccelerationStructure>; | ||
| accelerationStructureParams[IsTLAS].resize(gpuObjects.size()); | ||
| for (auto& entry : conversionRequests) | ||
| for (auto i=0ull; i<entry.second.copyCount; i++) | ||
| { | ||
| const auto* as = entry.second.canonicalAsset; | ||
| const auto& patch = dfsCache.nodes[entry.second.patchIndex.value].patch; | ||
| const bool motionBlur = as->usesMotion(); | ||
| // we will need to temporarily store the build input buffers somewhere | ||
| size_t inputSize = 0; | ||
| ILogicalDevice::AccelerationStructureBuildSizes sizes = {}; | ||
| { | ||
| const auto buildFlags = patch.getBuildFlags(as); | ||
| if constexpr (IsTLAS) | ||
| { | ||
| AssetVisitor<GetDependantVisit<ICPUTopLevelAccelerationStructure>> visitor = { | ||
| {visitBase}, | ||
| {asset,uniqueCopyGroupID}, | ||
| patch | ||
| }; | ||
| if (!visitor()) | ||
| continue; | ||
| const auto instanceCount = as->getInstances().size(); | ||
| sizes = device->getAccelerationStructureBuildSizes(patch.hostBuild,buildFlags,motionBlur,instanceCount); | ||
| inputSize = (motionBlur ? sizeof(IGPUTopLevelAccelerationStructure::DevicePolymorphicInstance):sizeof(IGPUTopLevelAccelerationStructure::DeviceStaticInstance))*instanceCount; | ||
| } | ||
| else | ||
| { | ||
| const uint32_t* pMaxPrimitiveCounts = as->getGeometryPrimitiveCounts().data(); | ||
| // the code here is not pretty, but DRY-ing is of this is for later | ||
| if (buildFlags.hasFlags(ICPUBottomLevelAccelerationStructure::BUILD_FLAGS::GEOMETRY_TYPE_IS_AABB_BIT)) | ||
| { | ||
| const auto geoms = as->getAABBGeometries(); | ||
| if (patch.hostBuild) | ||
| { | ||
| const std::span<const IGPUBottomLevelAccelerationStructure::Triangles<const IGPUBuffer>> cpuGeoms = { | ||
| reinterpret_cast<const IGPUBottomLevelAccelerationStructure::Triangles<const IGPUBuffer>*>(geoms.data()),geoms.size() | ||
| }; | ||
| sizes = device->getAccelerationStructureBuildSizes(buildFlags,motionBlur,cpuGeoms,pMaxPrimitiveCounts); | ||
| } | ||
| else | ||
| { | ||
| const std::span<const IGPUBottomLevelAccelerationStructure::Triangles<const ICPUBuffer>> cpuGeoms = { | ||
| reinterpret_cast<const IGPUBottomLevelAccelerationStructure::Triangles<const ICPUBuffer>*>(geoms.data()),geoms.size() | ||
| }; | ||
| sizes = device->getAccelerationStructureBuildSizes(buildFlags,motionBlur,cpuGeoms,pMaxPrimitiveCounts); | ||
| // TODO: check if the strides need to be aligned to 4 bytes for AABBs | ||
| for (const auto& geom : geoms) | ||
| if (const auto aabbCount=*(pMaxPrimitiveCounts++); aabbCount) | ||
| inputSize = core::roundUp(inputSize,sizeof(float))+aabbCount*geom.stride; | ||
| } | ||
| } | ||
| else | ||
| { | ||
| core::map<uint32_t,size_t> allocationsPerStride; | ||
| const auto geoms = as->getTriangleGeometries(); | ||
| if (patch.hostBuild) | ||
| { | ||
| const std::span<const IGPUBottomLevelAccelerationStructure::Triangles<const IGPUBuffer>> cpuGeoms = { | ||
| reinterpret_cast<const IGPUBottomLevelAccelerationStructure::Triangles<const IGPUBuffer>*>(geoms.data()),geoms.size() | ||
| }; | ||
| sizes = device->getAccelerationStructureBuildSizes(buildFlags,motionBlur,cpuGeoms,pMaxPrimitiveCounts); | ||
| } | ||
| else | ||
| { | ||
| const std::span<const IGPUBottomLevelAccelerationStructure::Triangles<const ICPUBuffer>> cpuGeoms = { | ||
| reinterpret_cast<const IGPUBottomLevelAccelerationStructure::Triangles<const ICPUBuffer>*>(geoms.data()),geoms.size() | ||
| }; | ||
| sizes = device->getAccelerationStructureBuildSizes(buildFlags,motionBlur,cpuGeoms,pMaxPrimitiveCounts); | ||
| // TODO: check if the strides need to be aligned to 4 bytes for AABBs | ||
| for (const auto& geom : geoms) | ||
| if (const auto triCount=*(pMaxPrimitiveCounts++); triCount) | ||
| { | ||
| switch (geom.indexType) | ||
| { | ||
| case E_INDEX_TYPE::EIT_16BIT: | ||
| allocationsPerStride[sizeof(uint16_t)] += triCount*3; | ||
| break; | ||
| case E_INDEX_TYPE::EIT_32BIT: | ||
| allocationsPerStride[sizeof(uint32_t)] += triCount*3; | ||
| break; | ||
| default: | ||
| break; | ||
| } | ||
| size_t bytesPerVertex = geom.vertexStride; | ||
| if (geom.vertexData[1]) | ||
| bytesPerVertex += bytesPerVertex; | ||
| allocationsPerStride[geom.vertexStride] += geom.maxVertex; | ||
| } | ||
| } | ||
| for (const auto& entry : allocationsPerStride) | ||
| inputSize = core::roundUp<size_t>(inputSize,entry.first)+entry.first*entry.second; | ||
| } | ||
| } | ||
| } | ||
| if (!sizes) | ||
| continue; | ||
| // this is where it gets a bit weird, we need to create a buffer to back the acceleration structure | ||
| IGPUBuffer::SCreationParams params = {}; | ||
| constexpr size_t MinASBufferAlignment = 256u; | ||
| params.size = core::roundUp(sizes.accelerationStructureSize,MinASBufferAlignment); | ||
| params.usage = IGPUBuffer::E_USAGE_FLAGS::EUF_ACCELERATION_STRUCTURE_STORAGE_BIT|IGPUBuffer::E_USAGE_FLAGS::EUF_SHADER_DEVICE_ADDRESS_BIT; | ||
| // concurrent ownership if any | ||
| const auto outIx = i+entry.second.firstCopyIx; | ||
| const auto uniqueCopyGroupID = gpuObjUniqueCopyGroupIDs[outIx]; | ||
| const auto queueFamilies = inputs.getSharedOwnershipQueueFamilies(uniqueCopyGroupID,as,patch); | ||
| params.queueFamilyIndexCount = queueFamilies.size(); | ||
| params.queueFamilyIndices = queueFamilies.data(); | ||
| // we need to save the buffer in a side-channel for later | ||
| auto& out = accelerationStructureParams[IsTLAS][baseOffset+entry.second.firstCopyIx+i]; | ||
| out = { | ||
| .storage = device->createBuffer(std::move(params)), | ||
| .scratchSize = sizes.buildScratchSize, | ||
| .motionBlur = motionBlur, | ||
| .compactAfterBuild = patch.compactAfterBuild, | ||
| .inputSize = inputSize | ||
| }; |
Member
Author
There was a problem hiding this comment.
this needs some love from me
Comment on lines
+2953
to
+2957
| // This gets deferred till AFTER the Buffer Memory Allocations and Binding for Acceleration Structures | ||
| if constexpr (!std::is_same_v<AssetType,ICPUBottomLevelAccelerationStructure> && !std::is_same_v<AssetType,ICPUTopLevelAccelerationStructure>) | ||
| dfsCache.for_each([&](const instance_t<AssetType>& instance, dfs_cache<AssetType>::created_t& created)->void | ||
| { | ||
| auto& stagingCache = std::get<SReserveResult::staging_cache_t<AssetType>>(retval.m_stagingCaches); |
Member
Author
There was a problem hiding this comment.
need to pack up the lambda and defer it
Comment on lines
+3251
to
+3304
| // Deal with Deferred Creation of Acceleration structures | ||
| { | ||
| for (auto asLevel=0; asLevel<2; asLevel++) | ||
| { | ||
| // each of these stages must have a barrier inbetween | ||
| size_t scratchSizeFullParallelBuild = 0; | ||
| size_t scratchSizeFullParallelCompact = 0; | ||
| // we collect that stats AFTER making sure that the BLAS / TLAS can actually be created | ||
| for (const auto& deferredParams : accelerationStructureParams[asLevel]) | ||
| { | ||
| // buffer failed to create/allocate | ||
| if (!deferredParams.storage.get()) | ||
| continue; | ||
| IGPUAccelerationStructure::SCreationParams baseParams; | ||
| { | ||
| auto* buf = deferredParams.storage.get(); | ||
| const auto bufSz = buf->getSize(); | ||
| using create_f = IGPUAccelerationStructure::SCreationParams::FLAGS; | ||
| baseParams = { | ||
| .bufferRange = {.offset=0,.size=bufSz,.buffer=smart_refctd_ptr<IGPUBuffer>(buf)}, | ||
| .flags = deferredParams.motionBlur ? create_f::MOTION_BIT:create_f::NONE | ||
| }; | ||
| } | ||
| smart_refctd_ptr<IGPUAccelerationStructure> as; | ||
| if (asLevel) | ||
| { | ||
| as = device->createBottomLevelAccelerationStructure({baseParams,deferredParams.maxInstanceCount}); | ||
| } | ||
| else | ||
| { | ||
| as = device->createTopLevelAccelerationStructure({baseParams,deferredParams.maxInstanceCount}); | ||
| } | ||
| // note that in order to compact an AS you need to allocate a buffer range whose size is known only after the build | ||
| const auto buildSize = deferredParams.inputSize+deferredParams.scratchSize; | ||
| // sizes for building 1-by-1 vs parallel, note that | ||
| retval.m_minASBuildScratchSize = core::max(buildSize,retval.m_minASBuildScratchSize); | ||
| scratchSizeFullParallelBuild += buildSize; | ||
| if (deferredParams.compactAfterBuild) | ||
| scratchSizeFullParallelCompact += deferredParams.scratchSize; | ||
| // triangles, AABBs or Instance Transforms will need to be supplied from VRAM | ||
| // TODO: also mark somehow that we'll need a BUILD INPUT READ ONLY BUFFER WITH XFER usage | ||
| if (deferredParams.inputSize) | ||
| retval.m_queueFlags |= IQueue::FAMILY_FLAGS::TRANSFER_BIT; | ||
| } | ||
| // | ||
| retval.m_maxASBuildScratchSize = core::max(core::max(scratchSizeFullParallelBuild,scratchSizeFullParallelCompact),retval.m_maxASBuildScratchSize); | ||
| } | ||
| // | ||
| if (retval.m_minASBuildScratchSize) | ||
| { | ||
| retval.m_queueFlags |= IQueue::FAMILY_FLAGS::COMPUTE_BIT; | ||
| retval.m_maxASBuildScratchSize = core::max(core::max(scratchSizeFullParallelBLASBuild,scratchSizeFullParallelBLASCompact),core::max(scratchSizeFullParallelTLASBuild,scratchSizeFullParallelTLASCompact)); | ||
| } | ||
| } |
Member
Author
There was a problem hiding this comment.
needs some love from me
…ice and Host build requests separately
Also update comments about what ends up in `m_gpuObjects`
Comment on lines
+75
to
+82
| #ifndef _NBL_DEBUG | ||
| if (!params.optimizer) | ||
| { | ||
| using pass_e = asset::ISPIRVOptimizer::E_OPTIMIZER_PASS; | ||
| // shall we do others? | ||
| params.optimizer = core::make_smart_rectd_ptr<asset::ISPIRVOptimizer>({EOP_STRIP_DEBUG_INFO}); | ||
| } | ||
| #endif |
Member
Author
There was a problem hiding this comment.
@kevyuu we should move this to your ISPIRVDebloater (or trimmer as I'd like to call it) and make it an option to not run the SPIR-V optimizer multiple times for no reason
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Conversion of ICPU BLAS and TLAS to IGPU including building.
We may need to support a list of IGPUBLAS in IGPUTLAS for sanity/lifetime coupling, but only if update/rebuild is not allowed or something (need to make a separate issue out of it because I have no clue how that's gonna be structured).
Testing
TODO list: