Timeline for answer to Near constant time rotate that does not violate the standards by Peter Cordes
Current License: CC BY-SA 4.0
Post Revisions
24 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jun 24, 2025 at 6:39 | history | edited | Peter Cordes | CC BY-SA 4.0 |
replace goo.gl shortlink
|
| May 23, 2017 at 12:33 | history | edited | URL Rewriter Bot |
replaced http://stackoverflow.com/ with https://stackoverflow.com/
|
|
| Oct 7, 2015 at 20:06 | vote | accept | jww | ||
| Aug 17, 2015 at 23:06 | comment | added | Peter Cordes | @jww: Yes, the accepted answer is a correct answer. But mine is IMO a better correct answer. Not all sufficient answers are equal. | |
| Aug 17, 2015 at 22:51 | comment | added | jww | Peter - I agree; I think you are right. I've had more time to research it, and this is the "portable rotate pattern". Intel has not sounded in yet (see Circular rotate that does not violate C/C++ standard? on the Intel forums). The problem I have is the accepted answer does answers the question I asked. Let me think about what to do.... | |
| Aug 17, 2015 at 20:49 | comment | added | Peter Cordes | @jww: I think this is a better answer than the one you accepted. It compiles to the same ideal code with a wider range of compilers. (i.e. gcc, not just clang.) I haven't tested MSVC with either. I've pointed a lot of other rotate questions at this answer. meta.stackoverflow.com/questions/302695/…, so it'd be nice if this was marked accepted. (unless it's terrible on MSVC or something, in which case it shouldn't be until we have a workaround.) | |
| Aug 17, 2015 at 20:13 | comment | added | Peter Cordes |
@supercat: Intel and AMD special-case RCR/L r, 1. SnB: 2 cycles, with any other count running in 8 cycles. If variable-latency operations weren't such a problem for the out-of-order machinery, RCL r, 2 could probably run in only one more cycle without more execution-unit complexity. This isn't the only case where Sandybridge has a higher-than-probably-needed latency for something, just to standardize possible latencies. As you say, special-casing other counts, or count-from-a-register, isn't going to be worth it.
|
|
| Aug 17, 2015 at 20:05 | comment | added | supercat | @PeterCordes: It's sufficiently rare for code to use RCR/RCL with any shift amount other than 1 that I don't think there's much need for such instructions to run in constant time for other shift values. Hardware to compute RCR with an arbitrary 8-bit shift count in constant time would be significantly more complicated than hardware which could support all five of the operations whose result is unaffected by carry-in, and I don't think it would be nearly useful enough to justify such cost. | |
| Aug 17, 2015 at 19:13 | history | edited | Peter Cordes | CC BY-SA 3.0 |
un-complicate by using `unsigned int` for the rotate count. assert message logic fix. update godbolt link with wikipedia link.
|
| Aug 17, 2015 at 17:33 | history | edited | Peter Cordes | CC BY-SA 3.0 |
link the community-wiki question
|
| Aug 17, 2015 at 16:33 | history | edited | Peter Cordes | CC BY-SA 3.0 |
clarify wording
|
| Jul 24, 2015 at 21:48 | comment | added | Peter Cordes | Having the standard say that sounds like a sensible way to get fairly nice behaviour, but without requiring anything that would take extra instructions. It's annoying how much stuff the C standard leaves undefined. Some of it is to avoid assuming two's complement, unless that's finally changed. (I think I read that adding atomics to the standards finally resulted in standardizing some two's complement behaviour, but I forget the details.) | |
| Jul 24, 2015 at 21:36 | comment | added | Peter Cordes | I'm going by Agner Fog's table of instruction timings. (agner.org/optimize) Modern CPUs have a big enough transistor budget that they can afford constant-time shifters. (e.g. en.wikipedia.org/wiki/Barrel_shifter). With pipelined execution units, it's actually a big problem to not know when the results will be ready. Handling cases where multiple results are ready in the same cycle from the same ALU takes extra transistors, and I think is usually done by buffering the results until the next cycle. (else you'd need 2x the register write ports). | |
| Jul 24, 2015 at 19:37 | comment | added | supercat | @PeterCordes: In the 80386, rcl/rcr were performed iteratively, so rotating an 8-bit register by 31 would take 30 more cycles than rotating by 1. Did you time shifts of different variable amounts? Otherwise, can you see any reason the C standard shouldn't say that a shift precisely equal to the bit size of the operand must yield either the original value or the result of shifting by size-1 and then shifting one more, with the choice made in arbitrary fashion? There are few platforms where such a rule would impose any cost, and I can think of none where the cost would be meaningful. | |
| Jul 23, 2015 at 17:11 | comment | added | Peter Cordes |
Heh, it looks like they had to maintain backwards compat with 286 for rcl/rcr. If operand size is 8 or 16, the count is masked and then mod 9 or mod 17. Otherwise (32 or 64bit) it's just masked. I didn't realize there were so many weird semantics. RCR r,cl is 3x slower than ROR r,cl on Haswell, 17 times on Steamroller. (Partly because ROR r,cl is 1 cycle latency/ 2/cycle throughput on AMD, but RCL is 17 macro-ops, 7 cycle latency.)
|
|
| Jul 23, 2015 at 16:28 | comment | added | supercat | ...and I suspect Intel wanted to avoid having an instruction take too long if N was 255. On the other hand, the behavior of shifting a 32-bit register right through carry by 32 or 33 bits would be different from shifting right by any smaller amount (including zero or one), and so masking yields totally wrong semantics. Interestingly, the ARM allows shift-by-N and 32-bit rotate-by-N, but only allows 32+1-bit rotate through carry to be done one bit at a time. | |
| Jul 23, 2015 at 16:20 | comment | added | supercat | @PeterCordes: On the 8086, the shift count is specified in an 8-bit register. Consequently, even if Intel had included the hardware to saturate shifts of length (regsize) up to 255, shifts that are slightly larger than multiples of 256 would have their lengths masked. That having been said, there would have been considerable value in having shifts with lengths up to 255 behave in saturating fashion; the value of extending that to longer shifts would be much less. BTW, the rotate-through-carry instructions were much slower on the 80836 than the other shifts (extra cycles proportional to N)... | |
| Jul 20, 2015 at 16:05 | comment | added | Peter Cordes | So you'd get zero from a shift count higher than register width? Intel's vector shifts do work this way. Masking (rather than saturating) the count makes sense for a rotate, but it does seem silly for shifts. | |
| Jul 20, 2015 at 15:49 | comment | added | supercat |
Although x86 presently defines their shift operators as masking the size, that was a behavioral change in the 80386 [IMHO not a good one]. On earlier processors, shifting was done mod 256 (since the shift amount was specified using a byte register); on processors like the ARM where shifts of 32-255 bits are equivalent to (but faster than) a series of one-bit shifts, such semantics can be useful in graphics programming (e.g. row1 |= dat >> y; row2 |= dat >> (y-32). In any case, on a vintage 8086 or 80286, the behavior of shift instructions uses the whole value of CL.
|
|
| Jul 18, 2015 at 23:15 | comment | added | jww | Thank you very much. I know John from his (and Peng Li's) Integer Overflow Checker (long before Clang had sanitizers). I used to use it extensively when self test would fail after compiling with Intel's ICC. ICC was ruthless about removing this sort of undefined behavior. | |
| Jul 18, 2015 at 7:43 | history | edited | Peter Cordes | CC BY-SA 3.0 |
Link to pabigot's more C++ified version
|
| Jul 18, 2015 at 7:26 | history | edited | Peter Cordes | CC BY-SA 3.0 |
test different type widths. n &= mask to avoid UD with NDEBUG.
|
| Jul 18, 2015 at 6:02 | history | edited | Peter Cordes | CC BY-SA 3.0 |
use CHAR_BIT. Tested 64b version, and ARM
|
| Jul 18, 2015 at 5:38 | history | answered | Peter Cordes | CC BY-SA 3.0 |