AMD says it will not fix the issue primarily because nearly three years without turning off a server is rather a long time. The workaround is to reboot before 1,044 days of uptime, which resets the CPU to restart your 1,044-day "timer," or disable the CC6 sleep state.
But Tom’s Hardware says you can learn a lot about the chip from the flaw.
The issue stems from the core failing to exit the CC6 sleep state, but AMD says the timing of the failure could vary based on the spread spectrum and REFCLK frequency which is the reference clock that helps the chip keep track of time.
The TSC ticks at 2800 MHz, and 2800 * 10**6 * 1042.5 days almost equals 0x380000000000000, which has too many zeros not to be a coincidence."
With billions of transistors in play, issues are inevitable: It isn't uncommon for a chip to have a thousand or more errata/bugs corrected in newer chip steppings or with firmware tweaks before launch. These errata can encompass all types of bugs, from security holes to malfunctioning flags and cache tags that don't operate correctly, and the chipmakers do their best to stomp them out before launch.
However, some errata always remain, even in shipping chips. For instance, Intel's 8th-gen has more than 150 listed errata that remain, and those chips were launched in 2017. It is unclear how many errata the Rome chips have had because AMD has removed the listings for errata that have been solved. However, we do know that 39 errata remain, which doesn't seem too bad against the Intel backdrop.
AMD did not find the bug earlier because 2.93 years is longer than the validation and qual cycles, and it isn't clear if accelerated aging testing, which often involves testing the equipment at higher-than-usual temps over long periods to simulate the aging process, could catch the bug, either.
The AMD EPYC Rome chips were released in late 2018, so perhaps some of AMD's customers have already encountered the issue the hard way — in deployment.