8. Shaders
A shader specifies programmable operations that execute for each vertex, control point, tessellated vertex, primitive, fragment, or workgroup in the corresponding stage(s) of the graphics and compute pipelines.
Graphics pipelines include vertex shader execution as a result of primitive assembly, followed, if enabled, by tessellation control and evaluation shaders operating on patches, geometry shaders, if enabled, operating on primitives, and fragment shaders, if present, operating on fragments generated by Rasterization. In this specification, vertex, tessellation control, tessellation evaluation and geometry shaders are collectively referred to as vertex processing stages and occur in the logical pipeline before rasterization. The fragment shader occurs logically after rasterization.
Only the compute shader stage is included in a compute pipeline. Compute shaders operate on compute invocations in a workgroup.
Shaders can read from input variables, and read from and write to output variables. Input and output variables can be used to transfer data between shader stages, or to allow the shader to interact with values that exist in the execution environment. Similarly, the execution environment provides constants that describe capabilities.
Shader variables are associated with execution environment-provided inputs and outputs using built-in decorations in the shader. The available decorations for each stage are documented in the following subsections.
8.1. Shader Modules
Shader modules contain shader code and one or more entry points. Shaders are selected from a shader module by specifying an entry point as part of pipeline creation. The stages of a pipeline can use shaders that come from different modules. The shader code defining a shader module must be in the SPIR-V format, as described by the Vulkan Environment for SPIR-V appendix.
Shader modules are represented by VkShaderModule handles:
VK_DEFINE_NON_DISPATCHABLE_HANDLE(VkShaderModule)
To create a shader module, call:
VkResult vkCreateShaderModule(
VkDevice device,
const VkShaderModuleCreateInfo* pCreateInfo,
const VkAllocationCallbacks* pAllocator,
VkShaderModule* pShaderModule);
-
deviceis the logical device that creates the shader module. -
pCreateInfois a pointer to an instance of theVkShaderModuleCreateInfostructure. -
pAllocatorcontrols host memory allocation as described in the Memory Allocation chapter. -
pShaderModulepoints to a VkShaderModule handle in which the resulting shader module object is returned.
Once a shader module has been created, any entry points it contains can be used in pipeline shader stages as described in Compute Pipelines and Graphics Pipelines.
If the shader stage fails to compile VK_ERROR_INVALID_SHADER_NV will
be generated and the compile log will be reported back to the application by
VK_EXT_debug_report if enabled.
The VkShaderModuleCreateInfo structure is defined as:
typedef struct VkShaderModuleCreateInfo {
VkStructureType sType;
const void* pNext;
VkShaderModuleCreateFlags flags;
size_t codeSize;
const uint32_t* pCode;
} VkShaderModuleCreateInfo;
-
sTypeis the type of this structure. -
pNextisNULLor a pointer to an extension-specific structure. -
flagsis reserved for future use. -
codeSizeis the size, in bytes, of the code pointed to bypCode. -
pCodepoints to code that is used to create the shader module. The type and format of the code is determined from the content of the memory addressed bypCode.
typedef VkFlags VkShaderModuleCreateFlags;
VkShaderModuleCreateFlags is a bitmask type for setting a mask, but is
currently reserved for future use.
To use a VkValidationCacheEXT to cache shader validation results, add
a VkShaderModuleValidationCacheCreateInfoEXT to the pNext chain
of the VkShaderModuleCreateInfo structure, specifying the cache object
to use.
The VkShaderModuleValidationCacheCreateInfoEXT struct is defined as:
typedef struct VkShaderModuleValidationCacheCreateInfoEXT {
VkStructureType sType;
const void* pNext;
VkValidationCacheEXT validationCache;
} VkShaderModuleValidationCacheCreateInfoEXT;
-
sTypeis the type of this structure. -
pNextisNULLor a pointer to an extension-specific structure. -
validationCacheis the validation cache object from which the results of prior validation attempts will be written, and to which new validation results for this VkShaderModule will be written (if not already present).
To destroy a shader module, call:
void vkDestroyShaderModule(
VkDevice device,
VkShaderModule shaderModule,
const VkAllocationCallbacks* pAllocator);
-
deviceis the logical device that destroys the shader module. -
shaderModuleis the handle of the shader module to destroy. -
pAllocatorcontrols host memory allocation as described in the Memory Allocation chapter.
A shader module can be destroyed while pipelines created using its shaders are still in use.
8.2. Shader Execution
At each stage of the pipeline, multiple invocations of a shader may execute simultaneously. Further, invocations of a single shader produced as the result of different commands may execute simultaneously. The relative execution order of invocations of the same shader type is undefined. Shader invocations may complete in a different order than that in which the primitives they originated from were drawn or dispatched by the application. However, fragment shader outputs are written to attachments in rasterization order.
The relative execution order of invocations of different shader types is largely undefined. However, when invoking a shader whose inputs are generated from a previous pipeline stage, the shader invocations from the previous stage are guaranteed to have executed far enough to generate input values for all required inputs.
8.3. Shader Memory Access Ordering
The order in which image or buffer memory is read or written by shaders is largely undefined. For some shader types (vertex, tessellation evaluation, and in some cases, fragment), even the number of shader invocations that may perform loads and stores is undefined.
In particular, the following rules apply:
-
Vertex and tessellation evaluation shaders will be invoked at least once for each unique vertex, as defined in those sections.
-
Fragment shaders will be invoked zero or more times, as defined in that section.
-
The relative execution order of invocations of the same shader type is undefined. A store issued by a shader when working on primitive B might complete prior to a store for primitive A, even if primitive A is specified prior to primitive B. This applies even to fragment shaders; while fragment shader outputs are always written to the framebuffer in rasterization order, stores executed by fragment shader invocations are not.
-
The relative execution order of invocations of different shader types is largely undefined.
|
Note
The above limitations on shader invocation order make some forms of synchronization between shader invocations within a single set of primitives unimplementable. For example, having one invocation poll memory written by another invocation assumes that the other invocation has been launched and will complete its writes in finite time. |
The Memory Model appendix defines the terminology and rules for how to correctly communicate between shader invocations, such as when a write is Visible-To a read, and what constitutes a Data Race.
Applications must not cause a data race.
8.4. Shader Inputs and Outputs
Data is passed into and out of shaders using variables with input or output
storage class, respectively.
User-defined inputs and outputs are connected between stages by matching
their Location decorations.
Additionally, data can be provided by or communicated to special functions
provided by the execution environment using BuiltIn decorations.
In many cases, the same BuiltIn decoration can be used in multiple
shader stages with similar meaning.
The specific behavior of variables decorated as BuiltIn is documented
in the following sections.
8.5. Task Shaders
Task shaders operate in conjunction with the mesh shaders to produce a collection of primitives that will be processed by subsequent stages of the graphics pipeline. Its primary purpose is to create a variable amount of subsequent mesh shader invocations.
Task shaders are invoked via the execution of the programmable mesh shading pipeline.
The task shader has no fixed-function inputs other than variables identifying the specific workgroup and invocation. The only fixed output of the task shader is a task count, identifying the number of mesh shader workgroups to create. The task shader can write additional outputs to task memory, which can be read by all of the mesh shader workgroups it created.
8.5.1. Task Shader Execution
Task workloads are formed from groups of work items called workgroups and
processed by the task shader in the current graphics pipeline.
A workgroup is a collection of shader invocations that execute the same
shader, potentially in parallel.
Task shaders execute in global workgroups which are divided into a number
of local workgroups with a size that can be set by assigning a value to
the LocalSize execution mode or via an object decorated by the
WorkgroupSize decoration.
An invocation within a local workgroup can share data with other members of
the local workgroup through shared variables and issue memory and control
flow barriers to synchronize with other members of the local workgroup.
8.6. Mesh Shaders
Mesh shaders operate in workgroups to produce a collection of primitives that will be processed by subsequent stages of the graphics pipeline. Each workgroup emits zero or more output primitives and the group of vertices and their associated data required for each output primitive.
Mesh shaders are invoked via the execution of the programmable mesh shading pipeline.
The only inputs available to the mesh shader are variables identifying the specific workgroup and invocation and, if applicable, any outputs written to task memory by the task shader that spawned the mesh shader’s workgroup. The mesh shader can operate without a task shader as well.
The invocations of the mesh shader workgroup write an output mesh, comprising a set of primitives with per-primitive attributes, a set of vertices with per-vertex attributes, and an array of indices identifying the mesh vertices that belong to each primitive. The primitives of this mesh are then processed by subsequent graphics pipeline stages, where the outputs of the mesh shader form an interface with the fragment shader.
8.6.1. Mesh Shader Execution
Mesh workloads are formed from groups of work items called workgroups and
processed by the mesh shader in the current graphics pipeline.
A workgroup is a collection of shader invocations that execute the same
shader, potentially in parallel.
Mesh shaders execute in global workgroups which are divided into a number
of local workgroups with a size that can be set by assigning a value to
the LocalSize execution mode or via an object decorated by the
WorkgroupSize decoration.
An invocation within a local workgroup can share data with other members of
the local workgroup through shared variables and issue memory and control
flow barriers to synchronize with other members of the local workgroup.
The global workgroups may be generated explcitly via the API, or implicitly through the task shader’s work creation mechanism.
8.7. Vertex Shaders
Each vertex shader invocation operates on one vertex and its associated vertex attribute data, and outputs one vertex and associated data. Graphics pipelines using primitive shading must include a vertex shader, and the vertex shader stage is always the first shader stage in the graphics pipeline.
8.7.1. Vertex Shader Execution
A vertex shader must be executed at least once for each vertex specified by a draw command. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view. During execution, the shader is presented with the index of the vertex and instance for which it has been invoked. Input variables declared in the vertex shader are filled by the implementation with the values of vertex attributes associated with the invocation being executed.
If the same vertex is specified multiple times in a draw command (e.g. by including the same index value multiple times in an index buffer) the implementation may reuse the results of vertex shading if it can statically determine that the vertex shader invocations will produce identical results.
|
Note
It is implementation-dependent when and if results of vertex shading are
reused, and thus how many times the vertex shader will be executed.
This is true also if the vertex shader contains stores or atomic operations
(see |
8.8. Tessellation Control Shaders
The tessellation control shader is used to read an input patch provided by
the application and to produce an output patch.
Each tessellation control shader invocation operates on an input patch
(after all control points in the patch are processed by a vertex shader) and
its associated data, and outputs a single control point of the output patch
and its associated data, and can also output additional per-patch data.
The input patch is sized according to the patchControlPoints member of
VkPipelineTessellationStateCreateInfo, as part of input assembly.
The size of the output patch is controlled by the OpExecutionMode
OutputVertices specified in the tessellation control or tessellation
evaluation shaders, which must be specified in at least one of the shaders.
The size of the input and output patches must each be greater than zero and
less than or equal to
VkPhysicalDeviceLimits::maxTessellationPatchSize.
8.8.1. Tessellation Control Shader Execution
A tessellation control shader is invoked at least once for each output vertex in a patch. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.
Inputs to the tessellation control shader are generated by the vertex
shader.
Each invocation of the tessellation control shader can read the attributes
of any incoming vertices and their associated data.
The invocations corresponding to a given patch execute logically in
parallel, with undefined relative execution order.
However, the OpControlBarrier instruction can be used to provide
limited control of the execution order by synchronizing invocations within a
patch, effectively dividing tessellation control shader execution into a set
of phases.
Tessellation control shaders will read undefined values if one invocation
reads a per-vertex or per-patch attribute written by another invocation at
any point during the same phase, or if two invocations attempt to write
different values to the same per-patch output in a single phase.
8.9. Tessellation Evaluation Shaders
The Tessellation Evaluation Shader operates on an input patch of control points and their associated data, and a single input barycentric coordinate indicating the invocation’s relative position within the subdivided patch, and outputs a single vertex and its associated data.
8.9.1. Tessellation Evaluation Shader Execution
A tessellation evaluation shader is invoked at least once for each unique vertex generated by the tessellator. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.
8.10. Geometry Shaders
The geometry shader operates on a group of vertices and their associated data assembled from a single input primitive, and emits zero or more output primitives and the group of vertices and their associated data required for each output primitive.
8.10.1. Geometry Shader Execution
A geometry shader is invoked at least once for each primitive produced by the tessellation stages, or at least once for each primitive generated by primitive assembly when tessellation is not in use. A shader can request that the geometry shader runs multiple instances. A geometry shader is invoked at least once for each instance. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.
8.11. Fragment Shaders
Fragment shaders are invoked as the result of rasterization in a graphics pipeline. Each fragment shader invocation operates on a single fragment and its associated data. With few exceptions, fragment shaders do not have access to any data associated with other fragments and are considered to execute in isolation of fragment shader invocations associated with other fragments.
8.11.1. Fragment Shader Execution
For each fragment generated by rasterization, a fragment shader may be invoked. A fragment shader must not be invoked if the Early Per-Fragment Tests cause it to have no coverage. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.
Furthermore, if it is determined that a fragment generated as the result of rasterizing a first primitive will have its outputs entirely overwritten by a fragment generated as the result of rasterizing a second primitive in the same subpass, and the fragment shader used for the fragment has no other side effects, then the fragment shader may not be executed for the fragment from the first primitive.
Relative ordering of execution of different fragment shader invocations is not defined.
For each fragment generated by a primitive, the number of times the fragment shader is invoked is implementation-dependent, but must obey the following constraints:
-
Each covered sample is included in a single fragment shader invocation.
-
When sample shading is not enabled, there is at least one fragment shader invocation.
-
When sample shading is enabled, the minimum number of fragment shader invocations is as defined in Shading Rate Image and Sample Shading.
When there is more than one fragment shader invocation per fragment, the association of samples to invocations is implementation-dependent.
In addition to the conditions outlined above for the invocation of a fragment shader, a fragment shader invocation may be produced as a helper invocation. A helper invocation is a fragment shader invocation that is created solely for the purposes of evaluating derivatives for use in non-helper fragment shader invocations. Stores and atomics performed by helper invocations must not have any effect on memory, and values returned by atomic instructions in helper invocations are undefined.
If the render pass has a fragment density map attachment, more than one
fragment shader invocation may be invoked for each covered sample.
Stores and atomics performed by these additional invocations have the normal
effect.
Such additional invocations are only produced if
VkPhysicalDeviceFragmentDensityMapPropertiesEXT::fragmentDensityInvocations
is VK_TRUE.
|
Note
Implementations may generate these additional fragment shader invocations in order to make transitions between fragment areas with different fragment densities more smooth. |
8.11.2. Early Fragment Tests
An explicit control is provided to allow fragment shaders to enable early
fragment tests.
If the fragment shader specifies the EarlyFragmentTests
OpExecutionMode, the per-fragment tests described in
Early Fragment Test Mode are performed prior to
fragment shader execution.
Otherwise, they are performed after fragment shader execution.
If the fragment shader additionally specifies the PostDepthCoverage
OpExecutionMode, the value of a variable decorated with the
SampleMask built-in
reflects the coverage after the early fragment tests.
Otherwise, it reflects the coverage before the early fragment tests.
8.11.3. Fragment Shader Interlock
In normal operation, it is possible for more than one fragment shader invocation to be executed simultaneously for the same pixel if there are overlapping primitives. If the fragmentShaderSampleInterlock, fragmentShaderPixelInterlock, or fragmentShaderShadingRateInterlock features are enabled, it is possible to define a critical section within the fragment shader that is guaranteed to not run simultaneously with another fragment shader invocation for the same sample(s) or pixel(s). It is also possible to control the relative ordering of execution of these critical sections across different fragment shader invovations.
If the FragmentShaderSampleInterlockEXT, FragmentShaderPixelInterlockEXT,
or FragmentShaderShadingRateInterlockEXT capabilities are declared in
the fragment shader, the OpBeginInvocationInterlockEXT and
OpEndInvocationInterlockEXT instructions must be used to delimit a
critical section of fragment shader code.
To ensure each invocation of the critical section is executed in
primitive order, declare one of the
PixelInterlockOrderedEXT, SampleInterlockOrderedEXT, or
ShadingRateInterlockOrderedEXT execution modes.
If the order of execution of each invocation of the critical section does
not matter, declare one of the PixelInterlockUnorderedEXT,
SampleInterlockUnorderedEXT, or ShadingRateInterlockUnorderedEXT
execution modes.
The PixelInterlockOrderedEXT and PixelInterlockUnorderedEXT
execution modes provide mutual exclusion in the critical section for any
pair of fragments corresponding to the same pixel, or pixels if the fragment
covers more than one pixel.
With sample shading enabled, these execution modes are treated like
SampleInterlockOrderedEXT or SampleInterlockUnorderedEXT
respectively.
The SampleInterlockOrderedEXT and SampleInterlockUnorderedEXT
execution modes only provide mutual exclusion for pairs of fragments that
both cover at least one common sample in the same pixel; these are
recommended for performance if shaders use per-sample data structures.
If these execution modes are used in single-sample mode they are treated
like PixelInterlockOrderedEXT or PixelInterlockUnorderedEXT
respectively.
The ShadingRateInterlockOrderedEXT and
ShadingRateInterlockUnorderedEXT execution modes provide mutual
exclusion for pairs of fragments that both have at least one common sample
in the same pixel, even if none of the common samples are covered by both
fragments.
With sample shading enabled, these execution modes are treated like
SampleInterlockOrderedEXT or SampleInterlockUnorderedEXT
respectively.
8.12. Compute Shaders
Compute shaders are invoked via vkCmdDispatch and vkCmdDispatchIndirect commands. In general, they have access to similar resources as shader stages executing as part of a graphics pipeline.
Compute workloads are formed from groups of work items called workgroups and
processed by the compute shader in the current compute pipeline.
A workgroup is a collection of shader invocations that execute the same
shader, potentially in parallel.
Compute shaders execute in global workgroups which are divided into a
number of local workgroups with a size that can be set by assigning a
value to the LocalSize execution mode or via an object decorated by the
WorkgroupSize decoration.
An invocation within a local workgroup can share data with other members of
the local workgroup through shared variables and issue memory and control
flow barriers to synchronize with other members of the local workgroup.
8.13. Interpolation Decorations
Interpolation decorations control the behavior of attribute interpolation in
the fragment shader stage.
Interpolation decorations can be applied to Input storage class
variables in the fragment shader stage’s interface, and control the
interpolation behavior of those variables.
Inputs that could be interpolated can be decorated by at most one of the following decorations:
Fragment input variables decorated with neither Flat nor
NoPerspective use perspective-correct interpolation (for
lines and
polygons).
The presence of and type of interpolation is controlled by the above
interpolation decorations as well as the auxiliary decorations Centroid
and Sample.
A variable decorated with Flat will not be interpolated.
Instead, it will have the same value for every fragment within a triangle.
This value will come from a single provoking
vertex.
A variable decorated with Flat can also be decorated with
Centroid or Sample, which will mean the same thing as decorating
it only as Flat.
For fragment shader input variables decorated with neither Centroid nor
Sample, the assigned variable may be interpolated anywhere within the
fragment and a single value may be assigned to each sample within the
fragment.
If a fragment shader input is decorated with Centroid, a single value
may be assigned to that variable for all samples in the fragment, but that
value must be interpolated to a location that lies in both the fragment and
in the primitive being rendered, including any of the fragment’s samples
covered by the primitive.
Because the location at which the variable is interpolated may be different
in neighboring fragments, and derivatives may be computed by computing
differences between neighboring fragments, derivatives of centroid-sampled
inputs may be less accurate than those for non-centroid interpolated
variables.
If
VkPipelineViewportShadingRateImageStateCreateInfoNV::shadingRateImageEnable
is enabled, implementations may estimate derivatives using differencing
without dividing by the distance between adjacent sample locations when the
fragment size is larger than one pixel.
The PostDepthCoverage
execution mode does not affect the determination of the centroid location.
If a fragment shader input is decorated with Sample, a separate value
must be assigned to that variable for each covered sample in the fragment,
and that value must be sampled at the location of the individual sample.
When rasterizationSamples is VK_SAMPLE_COUNT_1_BIT, the fragment
center must be used for Centroid, Sample, and undecorated
attribute interpolation.
Fragment shader inputs that are signed or unsigned integers, integer
vectors, or any double-precision floating-point type must be decorated with
Flat.
When the VK_AMD_shader_explicit_vertex_parameter device extension is
enabled inputs can be also decorated with the CustomInterpAMD
interpolation decoration, including fragment shader inputs that are signed
or unsigned integers, integer vectors, or any double-precision
floating-point type.
Inputs decorated with CustomInterpAMD can only be accessed by the
extended instruction InterpolateAtVertexAMD and allows accessing the
value of the input for individual vertices of the primitive.
When the fragmentShaderBarycentric feature is enabled, inputs can be
also decorated with the PerVertexNV interpolation decoration, including
fragment shader inputs that are signed or unsigned integers, integer
vectors, or any double-precision floating-point type.
Inputs decorated with PerVertexNV can only be accessed using an extra
array dimension, where the extra index identifies one of the vertices of the
primitive that produced the fragment.
8.14. Ray Generation Shaders
A ray generation shader is similar to a compute shader.
Its main purpose is to execute ray tracing queries using OpTraceNV
instructions and process the results.
8.14.1. Ray Generation Shader Execution
One ray generation shader is executed per ray tracing dispatch.
Its location in the shader binding table (see Shader
Binding Table for details) is passed directly into vkCmdTraceRaysNV
using the raygenShaderBindingTableBuffer and
raygenShaderBindingOffset parameters.
8.15. Intersection Shaders
Intersection shaders enable the implementation of arbitrary, application defined geometric primitives. An intersection shader for a primitive is executed whenever its axis-aligned bounding box is hit by a ray.
A built-in intersection shader for triangle primitives that is used
automatically whenever geometry of type VK_GEOMETRY_TYPE_TRIANGLES_NV
is specified.
Like other ray tracing shader domains, an intersection shader operates on a
single ray at a time.
It also operates on a single primitive at a time.
It is therefore the purpose of an intersection shader to compute the
ray-primitive intersections and report them.
To report an intersection, the shader calls the OpReportIntersectionNV
instruction.
An intersection shader communicates with any-hit and closest shaders by generating attribute values that they can read. Intersection shaders cannot read or modify the ray payload.
8.15.1. Intersection Shader Execution
The order in which intersections are found along a ray, and therefore the order in which intersection shaders are executed, is unspecified.
The intersection shader of the closest AABB which intersects the ray is guaranteed to be executed at some point during traversal, unless the ray is forcibly terminated.
8.16. Any-Hit Shaders
The any-hit shader is executed after the intersection shader reports an
intersection that lies within the current [tmin,tmax] of the ray.
The main use of any-hit shaders is to programmatically decide whether or not
an intersection will be accepted.
The intersection will be accepted unless the shader calls the
OpIgnoreIntersectionNV instruction.
8.16.1. Any-Hit Shader Execution
The order in which intersections are found along a ray, and therefore the order in which any-hit shaders are executed, is unspecified.
The any-hit shader of the closest hit is guaranteed to be executed at some point during traversal, unless the ray is forcibly terminated.
8.17. Closest Hit Shaders
Closest hit shaders have read-only access to the attributes generated by the
corresponding intersection shader, and can read or modify the ray payload.
They also have access to a number of system-generated values.
Closest hit shaders can call OpTraceNV to recursively trace rays.
8.17.1. Closest Hit Shader Execution
Exactly one closest hit shader is executed when traversal is finished and an intersection has been found and accepted.
8.18. Miss Shaders
Miss shaders can access the ray payload and can trace new rays through the
OpTraceNV instruction, but cannot access attributes since they are not
associated with an intersection.
8.18.1. Miss Shader Execution
A miss shader is executed instead of a closest hit shader if no intersection was found during traversal.
8.19. Callable Shaders
Callable shaders can access a callable payload that works similarly to ray payloads to do subroutine work.
8.19.1. Callable Shader Execution
A callable shader is executed by calling OpExecuteCallableNV from an
allowed shader stage.
8.20. Static Use
A SPIR-V module declares a global object in memory using the OpVariable
instruction, which results in a pointer x to that object.
A specific entry point in a SPIR-V module is said to statically use that
object if that entry point’s call tree contains a function that contains a
memory instruction or image instruction with x as an id operand.
See the “Memory Instructions” and “Image Instructions” subsections of
section 3 “Binary Form” of the SPIR-V specification for the complete list
of SPIR-V memory instructions.
Static use is not used to control the behavior of variables with Input
and Output storage.
The effects of those variables are applied based only on whether they are
present in a shader entry point’s interface.
8.21. Invocation and Derivative Groups
An invocation group (see the subsection “Control Flow” of section 2 of
the SPIR-V specification) for a compute shader is the set of invocations in
a single local workgroup.
For graphics shaders, an invocation group is an implementation-dependent
subset of the set of shader invocations of a given shader stage which are
produced by a single drawing command.
For indirect drawing commands with drawCount greater than one,
invocations from separate draws are in distinct invocation groups.
|
Note
Because the partitioning of invocations into invocation groups is implementation-dependent and not observable, applications generally need to assume the worst case of all invocations in a draw belonging to a single invocation group. |
A derivative group (see the subsection “Control Flow” of section 2 of
the SPIR-V 1.00 Revision 4 specification) is a set of invocations which are
used together to compute a derivative.
For a fragment shader, a derivative group is generated by a single primitive
(point, line, or triangle) and includes any helper invocations needed to
compute derivatives.
If the subgroupSize field of VkPhysicalDeviceSubgroupProperties
is at least 4, a derivative group for a fragment shader corresponds to a
single subgroup quad.
Otherwise, a derivative group is the set of invocations generated by a
single primitive.
A derivative group for a compute shader is a single local workgroup.
Derivative values are undefined for a sampled image instruction if the instruction is in flow control that is not uniform across the derivative group.
8.22. Subgroups
A subgroup (see the subsection “Control Flow” of section 2 of the SPIR-V 1.3 Revision 1 specification) is a set of invocations that can synchronize and share data with each other efficiently. An invocation group is partitioned into one or more subgroups.
Subgroup operations are divided into various categories as described in VkSubgroupFeatureFlagBits.
8.22.1. Basic Subgroup Operations
The basic subgroup operations allow two classes of functionality within
shaders
- elect and barrier.
Invocations within a subgroup can choose a single invocation to perform
some task for the subgroup as a whole using elect.
Invocations within a subgroup can perform a subgroup barrier to ensure the
ordering of execution or memory accesses within a subgroup.
Barriers can be performed on buffer memory accesses, WorkgroupLocal
memory accesses, and image memory accesses to ensure that any results
written are visible by other invocations within the subgroup.
An OpControlBarrier can also be used to perform a full execution
control barrier.
A full execution control barrier will ensure that each active invocation
within the subgroup reaches a point of execution before any are allowed to
continue.
8.22.2. Vote Subgroup Operations
The vote subgroup operations allow invocations within a subgroup to compare values across a subgroup. The types of votes enabled are:
-
Do all active subgroup invocations agree that an expression is true?
-
Do any active subgroup invocations evaluate an expression to true?
-
Do all active subgroup invocations have the same value of an expression?
|
Note
These operations are useful in combination with control flow in that they allow for developers to check whether conditions match across the subgroup and choose potentially faster code-paths in these cases. |
8.22.3. Arithmetic Subgroup Operations
The arithmetic subgroup operations allow invocations to perform scan and reduction operations across a subgroup. For reduction operations, each invocation in a subgroup will obtain the same result of these arithmetic operations applied across the subgroup. For scan operations, each invocation in the subgroup will perform an inclusive or exclusive scan, cumulatively applying the operation across the invocations in a subgroup in an implementation-defined order. The operations supported are add, mul, min, max, and, or, xor.
8.22.4. Ballot Subgroup Operations
The ballot subgroup operations allow invocations to perform more complex votes across the subgroup. The ballot functionality allows all invocations within a subgroup to provide a boolean value and get as a result what each invocation provided as their boolean value. The broadcast functionality allows values to be broadcast from an invocation to all other invocations within the subgroup, given that the invocation to be broadcast from is known at pipeline creation time.
8.22.5. Shuffle Subgroup Operations
The shuffle subgroup operations allow invocations to read values from other invocations within a subgroup.
8.22.6. Shuffle Relative Subgroup Operations
The shuffle relative subgroup operations allow invocations to read values from other invocations within the subgroup relative to the current invocation in the group. The relative operations supported allow data to be shifted up and down through the invocations within a subgroup.
8.22.7. Clustered Subgroup Operations
The clustered subgroup operations allow invocations to perform an operation among partitions of a subgroup, such that the operation is only performed within the subgroup invocations within a partition. The partitions for clustered subgroup operations are consecutive power-of-two size groups of invocations and the cluster size must be known at pipeline creation time. The operations supported are add, mul, min, max, and, or, xor.
8.22.8. Quad Subgroup Operations
The quad subgroup operations allow clusters of 4 invocations (a quad), to
share data efficiently with each other.
For fragment shaders, if the subgroupSize field of
VkPhysicalDeviceSubgroupProperties is at least 4, each quad
corresponds to one of the groups of four shader invocations used for
derivatives.
For compute shaders using the DerivativeGroupQuadsNV or
DerivativeGroupLinearNV execution modes, each quad corresponds to one
of the groups of four shader invocations used for
derivatives.
The invocations in each quad are ordered to have attribute values of
Pi0,j0, Pi1,j0, Pi0,j1, and Pi1,j1, respectively.
8.22.9. Partitioned Subgroup Operations
The partitioned subgroup operations allow a subgroup to partition its invocations into disjoint subsets and to perform scan and reduce operations among invocations belonging to the same subset. The partitions for partitioned subgroup operations are specified by a ballot operation and can be computed at runtime. The operations supported are add, mul, min, max, and, or, xor.
8.23. Cooperative Matrices
A cooperative matrix type is a SPIR-V type where the storage for and computations performed on the matrix are spread across a set of invocations such as a subgroup. These types give the implementation freedom in how to optimize matrix multiplies.
SPIR-V defines the types and instructions, but does not specify rules about what sizes/combinations are valid, and it is expected that different implementations may support different sizes.
To enumerate the supported cooperative matrix types and operations, call:
VkResult vkGetPhysicalDeviceCooperativeMatrixPropertiesNV(
VkPhysicalDevice physicalDevice,
uint32_t* pPropertyCount,
VkCooperativeMatrixPropertiesNV* pProperties);
-
physicalDeviceis the physical device. -
pPropertyCountis a pointer to an integer related to the number of cooperative matrix properties available or queried. -
pPropertiesis eitherNULLor a pointer to an array of VkCooperativeMatrixPropertiesNV structures.
If pProperties is NULL, then the number of cooperative matrix
properties available is returned in pPropertyCount.
Otherwise, pPropertyCount must point to a variable set by the user to
the number of elements in the pProperties array, and on return the
variable is overwritten with the number of structures actually written to
pProperties.
If pPropertyCount is less than the number of cooperative matrix
properties available, at most pPropertyCount structures will be
written.
If pPropertyCount is smaller than the number of cooperative matrix
properties available, VK_INCOMPLETE will be returned instead of
VK_SUCCESS, to indicate that not all the available cooperative matrix
properties were returned.
Each VkCooperativeMatrixPropertiesNV structure describes a single
supported combination of types for a matrix multiply/add operation
(OpCooperativeMatrixMulAddNV).
The multiply can be described in terms of the following variables and types
(in SPIR-V pseudocode):
%A is of type OpTypeCooperativeMatrixNV %AType %scope %MSize %KSize
%B is of type OpTypeCooperativeMatrixNV %BType %scope %KSize %NSize
%C is of type OpTypeCooperativeMatrixNV %CType %scope %MSize %NSize
%D is of type OpTypeCooperativeMatrixNV %DType %scope %MSize %NSize
%D = %A * %B + %C // using OpCooperativeMatrixMulAddNV
A matrix multiply with these dimensions is known as an MxNxK matrix multiply.
The VkCooperativeMatrixPropertiesNV structure is defined as:
typedef struct VkCooperativeMatrixPropertiesNV {
VkStructureType sType;
void* pNext;
uint32_t MSize;
uint32_t NSize;
uint32_t KSize;
VkComponentTypeNV AType;
VkComponentTypeNV BType;
VkComponentTypeNV CType;
VkComponentTypeNV DType;
VkScopeNV scope;
} VkCooperativeMatrixPropertiesNV;
-
sTypeis the type of this structure. -
pNextisNULLor a pointer to an extension-specific structure. -
MSizeis the number of rows in matrices A, C, and D. -
KSizeis the number of columns in matrix A and rows in matrix B. -
NSizeis the number of columns in matrices B, C, D. -
ATypeis the component type of matrix A, of type VkComponentTypeNV. -
BTypeis the component type of matrix B, of type VkComponentTypeNV. -
CTypeis the component type of matrix C, of type VkComponentTypeNV. -
DTypeis the component type of matrix D, of type VkComponentTypeNV. -
scopeis the scope of all the matrix types, of type VkScopeNV.
If some types are preferred over other types (e.g. for performance), they should appear earlier in the list enumerated by vkGetPhysicalDeviceCooperativeMatrixPropertiesNV.
At least one entry in the list must have power of two values for all of
MSize, KSize, and NSize.
Possible values for VkScopeNV include:
typedef enum VkScopeNV {
VK_SCOPE_DEVICE_NV = 1,
VK_SCOPE_WORKGROUP_NV = 2,
VK_SCOPE_SUBGROUP_NV = 3,
VK_SCOPE_QUEUE_FAMILY_NV = 5,
VK_SCOPE_MAX_ENUM_NV = 0x7FFFFFFF
} VkScopeNV;
-
VK_SCOPE_DEVICE_NVcorresponds to SPIR-VDevicescope. -
VK_SCOPE_WORKGROUP_NVcorresponds to SPIR-VWorkgroupscope. -
VK_SCOPE_SUBGROUP_NVcorresponds to SPIR-VSubgroupscope. -
VK_SCOPE_QUEUE_FAMILY_NVcorresponds to SPIR-VQueueFamilyKHRscope.
All enum values match the corresponding SPIR-V value.
Possible values for VkComponentTypeNV include:
typedef enum VkComponentTypeNV {
VK_COMPONENT_TYPE_FLOAT16_NV = 0,
VK_COMPONENT_TYPE_FLOAT32_NV = 1,
VK_COMPONENT_TYPE_FLOAT64_NV = 2,
VK_COMPONENT_TYPE_SINT8_NV = 3,
VK_COMPONENT_TYPE_SINT16_NV = 4,
VK_COMPONENT_TYPE_SINT32_NV = 5,
VK_COMPONENT_TYPE_SINT64_NV = 6,
VK_COMPONENT_TYPE_UINT8_NV = 7,
VK_COMPONENT_TYPE_UINT16_NV = 8,
VK_COMPONENT_TYPE_UINT32_NV = 9,
VK_COMPONENT_TYPE_UINT64_NV = 10,
VK_COMPONENT_TYPE_MAX_ENUM_NV = 0x7FFFFFFF
} VkComponentTypeNV;
-
VK_COMPONENT_TYPE_FLOAT16_NVcorresponds to SPIR-VOpTypeFloat16. -
VK_COMPONENT_TYPE_FLOAT32_NVcorresponds to SPIR-VOpTypeFloat32. -
VK_COMPONENT_TYPE_FLOAT64_NVcorresponds to SPIR-VOpTypeFloat64. -
VK_COMPONENT_TYPE_SINT8_NVcorresponds to SPIR-VOpTypeInt8 1. -
VK_COMPONENT_TYPE_SINT16_NVcorresponds to SPIR-VOpTypeInt16 1. -
VK_COMPONENT_TYPE_SINT32_NVcorresponds to SPIR-VOpTypeInt32 1. -
VK_COMPONENT_TYPE_SINT64_NVcorresponds to SPIR-VOpTypeInt64 1. -
VK_COMPONENT_TYPE_UINT8_NVcorresponds to SPIR-VOpTypeInt8 0. -
VK_COMPONENT_TYPE_UINT16_NVcorresponds to SPIR-VOpTypeInt16 0. -
VK_COMPONENT_TYPE_UINT32_NVcorresponds to SPIR-VOpTypeInt32 0. -
VK_COMPONENT_TYPE_UINT64_NVcorresponds to SPIR-VOpTypeInt64 0.
8.24. Validation Cache
Validation cache objects allow the result of internal validation to be reused, both within a single application run and between multiple runs. Reuse within a single run is achieved by passing the same validation cache object when creating supported Vulkan objects. Reuse across runs of an application is achieved by retrieving validation cache contents in one run of an application, saving the contents, and using them to preinitialize a validation cache on a subsequent run. The contents of the validation cache objects are managed by the validation layers. Applications can manage the host memory consumed by a validation cache object and control the amount of data retrieved from a validation cache object.
Validation cache objects are represented by VkValidationCacheEXT
handles:
VK_DEFINE_NON_DISPATCHABLE_HANDLE(VkValidationCacheEXT)
To create validation cache objects, call:
VkResult vkCreateValidationCacheEXT(
VkDevice device,
const VkValidationCacheCreateInfoEXT* pCreateInfo,
const VkAllocationCallbacks* pAllocator,
VkValidationCacheEXT* pValidationCache);
-
deviceis the logical device that creates the validation cache object. -
pCreateInfois a pointer to a VkValidationCacheCreateInfoEXT structure that contains the initial parameters for the validation cache object. -
pAllocatorcontrols host memory allocation as described in the Memory Allocation chapter. -
pValidationCacheis a pointer to a VkValidationCacheEXT handle in which the resulting validation cache object is returned.
|
Note
Applications can track and manage the total host memory size of a
validation cache object using the |
Once created, a validation cache can be passed to the
vkCreateShaderModule command as part of the
VkShaderModuleCreateInfo pNext chain.
If a VkShaderModuleValidationCacheCreateInfoEXT object is part of the
VkShaderModuleCreateInfo::pNext chain, and its
validationCache field is not VK_NULL_HANDLE, the implementation
will query it for possible reuse opportunities and update it with new
content.
The use of the validation cache object in these commands is internally
synchronized, and the same validation cache object can be used in multiple
threads simultaneously.
|
Note
Implementations should make every effort to limit any critical sections to
the actual accesses to the cache, which is expected to be significantly
shorter than the duration of the |
The VkValidationCacheCreateInfoEXT structure is defined as:
typedef struct VkValidationCacheCreateInfoEXT {
VkStructureType sType;
const void* pNext;
VkValidationCacheCreateFlagsEXT flags;
size_t initialDataSize;
const void* pInitialData;
} VkValidationCacheCreateInfoEXT;
-
sTypeis the type of this structure. -
pNextisNULLor a pointer to an extension-specific structure. -
flagsis reserved for future use. -
initialDataSizeis the number of bytes inpInitialData. IfinitialDataSizeis zero, the validation cache will initially be empty. -
pInitialDatais a pointer to previously retrieved validation cache data. If the validation cache data is incompatible (as defined below) with the device, the validation cache will be initially empty. IfinitialDataSizeis zero,pInitialDatais ignored.
typedef VkFlags VkValidationCacheCreateFlagsEXT;
VkValidationCacheCreateFlagsEXT is a bitmask type for setting a mask,
but is currently reserved for future use.
Validation cache objects can be merged using the command:
VkResult vkMergeValidationCachesEXT(
VkDevice device,
VkValidationCacheEXT dstCache,
uint32_t srcCacheCount,
const VkValidationCacheEXT* pSrcCaches);
-
deviceis the logical device that owns the validation cache objects. -
dstCacheis the handle of the validation cache to merge results into. -
srcCacheCountis the length of thepSrcCachesarray. -
pSrcCachesis an array of validation cache handles, which will be merged intodstCache. The previous contents ofdstCacheare included after the merge.
|
Note
The details of the merge operation are implementation dependent, but implementations should merge the contents of the specified validation caches and prune duplicate entries. |
Data can be retrieved from a validation cache object using the command:
VkResult vkGetValidationCacheDataEXT(
VkDevice device,
VkValidationCacheEXT validationCache,
size_t* pDataSize,
void* pData);
-
deviceis the logical device that owns the validation cache. -
validationCacheis the validation cache to retrieve data from. -
pDataSizeis a pointer to a value related to the amount of data in the validation cache, as described below. -
pDatais eitherNULLor a pointer to a buffer.
If pData is NULL, then the maximum size of the data that can be
retrieved from the validation cache, in bytes, is returned in
pDataSize.
Otherwise, pDataSize must point to a variable set by the user to the
size of the buffer, in bytes, pointed to by pData, and on return the
variable is overwritten with the amount of data actually written to
pData.
If pDataSize is less than the maximum size that can be retrieved by
the validation cache, at most pDataSize bytes will be written to
pData, and vkGetValidationCacheDataEXT will return
VK_INCOMPLETE.
Any data written to pData is valid and can be provided as the
pInitialData member of the VkValidationCacheCreateInfoEXT
structure passed to vkCreateValidationCacheEXT.
Two calls to vkGetValidationCacheDataEXT with the same parameters
must retrieve the same data unless a command that modifies the contents of
the cache is called between them.
Applications can store the data retrieved from the validation cache, and
use these data, possibly in a future run of the application, to populate new
validation cache objects.
The results of validation, however, may depend on the vendor ID, device ID,
driver version, and other details of the device.
To enable applications to detect when previously retrieved data is
incompatible with the device, the initial bytes written to pData must
be a header consisting of the following members:
| Offset | Size | Meaning |
|---|---|---|
0 |
4 |
length in bytes of the entire validation cache header written as a stream of bytes, with the least significant byte first |
4 |
4 |
a VkValidationCacheHeaderVersionEXT value written as a stream of bytes, with the least significant byte first |
8 |
|
a layer commit ID expressed as a UUID, which uniquely identifies the version of the validation layers used to generate these validation results |
The first four bytes encode the length of the entire validation cache header, in bytes. This value includes all fields in the header including the validation cache version field and the size of the length field.
The next four bytes encode the validation cache version, as described for VkValidationCacheHeaderVersionEXT. A consumer of the validation cache should use the cache version to interpret the remainder of the cache header.
If pDataSize is less than what is necessary to store this header,
nothing will be written to pData and zero will be written to
pDataSize.
Possible values of the second group of four bytes in the header returned by vkGetValidationCacheDataEXT, encoding the validation cache version, are:
typedef enum VkValidationCacheHeaderVersionEXT {
VK_VALIDATION_CACHE_HEADER_VERSION_ONE_EXT = 1,
VK_VALIDATION_CACHE_HEADER_VERSION_MAX_ENUM_EXT = 0x7FFFFFFF
} VkValidationCacheHeaderVersionEXT;
-
VK_VALIDATION_CACHE_HEADER_VERSION_ONE_EXTspecifies version one of the validation cache.
To destroy a validation cache, call:
void vkDestroyValidationCacheEXT(
VkDevice device,
VkValidationCacheEXT validationCache,
const VkAllocationCallbacks* pAllocator);
-
deviceis the logical device that destroys the validation cache object. -
validationCacheis the handle of the validation cache to destroy. -
pAllocatorcontrols host memory allocation as described in the Memory Allocation chapter.