You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extend DeviceReduce::Sum with requirements API for opt-in reproducibility
Works only for input iterator value_type of float/double
Future work:
Support half/bfloat by upconverting to float/double
Support for custom types containing floating point values by using something similar to the decomposer approach we used for radix sort to decompose a custom type into something like tuple<float, double, ...>.
Extend BlockReduce with reproducible algorithm
Extend to Scan
For scan, we'd ideally like to fit the aggregate state in 128 bits, which would be tricky because for k=2 (from @maddyscientist's algorithm) we'd need 128 bits trivially, but could potentially reserve 2 of the bits in the aggregator type to use for the decoupled lookback signaling (see [FEA]: Intrusive Decoupled Look-Back #220)
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
I would like reproducible reductions for floating-point values.
Describe the solution you'd like
@maddyscientist has a proof-of-concept implementation here: https://github.com/maddyscientist/reproducible_floating_sums/tree/feature/cuda
MVP:
DeviceReduce::Sumwith requirements API for opt-in reproducibilityvalue_typeoffloat/doubleFuture work:
tuple<float, double, ...>.Describe alternatives you've considered
No response
Additional context
No response
Tasks