Timeline for answer to Near constant time rotate that does not violate the standards by Peter Cordes

Current License: CC BY-SA 4.0

Post Revisions

24 events

when toggle format	what		by	license	comment
Jun 24, 2025 at 6:39	history	edited	Peter Cordes	CC BY-SA 4.0	replace goo.gl shortlink
May 23, 2017 at 12:33	history	edited	URL Rewriter Bot		replaced http://stackoverflow.com/ with https://stackoverflow.com/
Oct 7, 2015 at 20:06	vote	accept	jww
Aug 17, 2015 at 23:06	comment	added	Peter Cordes		@jww: Yes, the accepted answer is a correct answer. But mine is IMO a better correct answer. Not all sufficient answers are equal.
Aug 17, 2015 at 22:51	comment	added	jww		Peter - I agree; I think you are right. I've had more time to research it, and this is the "portable rotate pattern". Intel has not sounded in yet (see Circular rotate that does not violate C/C++ standard? on the Intel forums). The problem I have is the accepted answer does answers the question I asked. Let me think about what to do....
Aug 17, 2015 at 20:49	comment	added	Peter Cordes		@jww: I think this is a better answer than the one you accepted. It compiles to the same ideal code with a wider range of compilers. (i.e. gcc, not just clang.) I haven't tested MSVC with either. I've pointed a lot of other rotate questions at this answer. meta.stackoverflow.com/questions/302695/…, so it'd be nice if this was marked accepted. (unless it's terrible on MSVC or something, in which case it shouldn't be until we have a workaround.)
Aug 17, 2015 at 20:13	comment	added	Peter Cordes		@supercat: Intel and AMD special-case `RCR/L r, 1`. SnB: 2 cycles, with any other count running in 8 cycles. If variable-latency operations weren't such a problem for the out-of-order machinery, `RCL r, 2` could probably run in only one more cycle without more execution-unit complexity. This isn't the only case where Sandybridge has a higher-than-probably-needed latency for something, just to standardize possible latencies. As you say, special-casing other counts, or count-from-a-register, isn't going to be worth it.
Aug 17, 2015 at 20:05	comment	added	supercat		@PeterCordes: It's sufficiently rare for code to use RCR/RCL with any shift amount other than 1 that I don't think there's much need for such instructions to run in constant time for other shift values. Hardware to compute RCR with an arbitrary 8-bit shift count in constant time would be significantly more complicated than hardware which could support all five of the operations whose result is unaffected by carry-in, and I don't think it would be nearly useful enough to justify such cost.
Aug 17, 2015 at 19:13	history	edited	Peter Cordes	CC BY-SA 3.0	un-complicate by using `unsigned int` for the rotate count. assert message logic fix. update godbolt link with wikipedia link.
Aug 17, 2015 at 17:33	history	edited	Peter Cordes	CC BY-SA 3.0	link the community-wiki question
Aug 17, 2015 at 16:33	history	edited	Peter Cordes	CC BY-SA 3.0	clarify wording
Jul 24, 2015 at 21:48	comment	added	Peter Cordes		Having the standard say that sounds like a sensible way to get fairly nice behaviour, but without requiring anything that would take extra instructions. It's annoying how much stuff the C standard leaves undefined. Some of it is to avoid assuming two's complement, unless that's finally changed. (I think I read that adding atomics to the standards finally resulted in standardizing some two's complement behaviour, but I forget the details.)
Jul 24, 2015 at 21:36	comment	added	Peter Cordes		I'm going by Agner Fog's table of instruction timings. (agner.org/optimize) Modern CPUs have a big enough transistor budget that they can afford constant-time shifters. (e.g. en.wikipedia.org/wiki/Barrel_shifter). With pipelined execution units, it's actually a big problem to not know when the results will be ready. Handling cases where multiple results are ready in the same cycle from the same ALU takes extra transistors, and I think is usually done by buffering the results until the next cycle. (else you'd need 2x the register write ports).
Jul 24, 2015 at 19:37	comment	added	supercat		@PeterCordes: In the 80386, rcl/rcr were performed iteratively, so rotating an 8-bit register by 31 would take 30 more cycles than rotating by 1. Did you time shifts of different variable amounts? Otherwise, can you see any reason the C standard shouldn't say that a shift precisely equal to the bit size of the operand must yield either the original value or the result of shifting by size-1 and then shifting one more, with the choice made in arbitrary fashion? There are few platforms where such a rule would impose any cost, and I can think of none where the cost would be meaningful.
Jul 23, 2015 at 17:11	comment	added	Peter Cordes		Heh, it looks like they had to maintain backwards compat with 286 for `rcl/rcr`. If operand size is 8 or 16, the count is masked and then mod 9 or mod 17. Otherwise (32 or 64bit) it's just masked. I didn't realize there were so many weird semantics. `RCR r,cl` is 3x slower than `ROR r,cl` on Haswell, 17 times on Steamroller. (Partly because `ROR r,cl` is 1 cycle latency/ 2/cycle throughput on AMD, but `RCL` is 17 macro-ops, 7 cycle latency.)
Jul 23, 2015 at 16:28	comment	added	supercat		...and I suspect Intel wanted to avoid having an instruction take too long if N was 255. On the other hand, the behavior of shifting a 32-bit register right through carry by 32 or 33 bits would be different from shifting right by any smaller amount (including zero or one), and so masking yields totally wrong semantics. Interestingly, the ARM allows shift-by-N and 32-bit rotate-by-N, but only allows 32+1-bit rotate through carry to be done one bit at a time.
Jul 23, 2015 at 16:20	comment	added	supercat		@PeterCordes: On the 8086, the shift count is specified in an 8-bit register. Consequently, even if Intel had included the hardware to saturate shifts of length (regsize) up to 255, shifts that are slightly larger than multiples of 256 would have their lengths masked. That having been said, there would have been considerable value in having shifts with lengths up to 255 behave in saturating fashion; the value of extending that to longer shifts would be much less. BTW, the rotate-through-carry instructions were much slower on the 80836 than the other shifts (extra cycles proportional to N)...
Jul 20, 2015 at 16:05	comment	added	Peter Cordes		So you'd get zero from a shift count higher than register width? Intel's vector shifts do work this way. Masking (rather than saturating) the count makes sense for a rotate, but it does seem silly for shifts.
Jul 20, 2015 at 15:49	comment	added	supercat		Although `x86` presently defines their shift operators as masking the size, that was a behavioral change in the 80386 [IMHO not a good one]. On earlier processors, shifting was done mod 256 (since the shift amount was specified using a byte register); on processors like the ARM where shifts of 32-255 bits are equivalent to (but faster than) a series of one-bit shifts, such semantics can be useful in graphics programming (e.g. row1 \|= dat >> y; row2 \|= dat >> (y-32). In any case, on a vintage 8086 or 80286, the behavior of shift instructions uses the whole value of CL.
Jul 18, 2015 at 23:15	comment	added	jww		Thank you very much. I know John from his (and Peng Li's) Integer Overflow Checker (long before Clang had sanitizers). I used to use it extensively when self test would fail after compiling with Intel's ICC. ICC was ruthless about removing this sort of undefined behavior.
Jul 18, 2015 at 7:43	history	edited	Peter Cordes	CC BY-SA 3.0	Link to pabigot's more C++ified version
Jul 18, 2015 at 7:26	history	edited	Peter Cordes	CC BY-SA 3.0	test different type widths. n &= mask to avoid UD with NDEBUG.
Jul 18, 2015 at 6:02	history	edited	Peter Cordes	CC BY-SA 3.0	use CHAR_BIT. Tested 64b version, and ARM
Jul 18, 2015 at 5:38	history	answered	Peter Cordes	CC BY-SA 3.0

toggle format

Collectives™ on Stack Overflow

Timeline for answer to Near constant time rotate that does not violate the standards by Peter Cordes

Current License: CC BY-SA 4.0

Post Revisions