June 27, 2012

Sharing - Leap Second Bug in RedHat

The UTC time standard, which is widely used for international timekeeping on computer systems, uses the international standard definition of the second, based on atomic clocks. However, the duration
of one mean solar day is slightly longer than 86,400 seconds (a UTC day). The purpose of a leap second is to compensate for this drift, by scheduling days with 86401 or 86399 international standard seconds.
Because the Earth's rotation speed varies in response to natural events, UTC leap seconds are irregularly spaced and unpredictable. The last leap second occured at 23:59:59 UTC on 31 December 2008. Leap seconds occur based on UTC time, and therefore are timezone independent and occur around the world at the same moment, regardless of local time.

By default in modern kernels every 1/1000th of a second (1ms) the kernel can make new process scheduling decisions. This 1ms interval is called a kernel "tick".

* How the event is triggered *
  1. The system inherits a leap second flag from an upstream NTP server with knowledge of the upcoming leap second. (This occurs on the day of the leap second event and cannot be unset.)
  2. At 23:59:59 UTC on the leap second day, the kernel sees the leap second flag and causes 23:59:59 UTC to occur twice
  3. In order to process the leap second event a lock is acquired to access the current time
  4. While processing the leap second the kernel issues a printk to notify the user that the leap second has occurred
  5. The printk triggers klogd to wake up so that it can process the new kernel message
  6. klogd attempts to acquire a lock to access the current kernel time.

If step 3 happens on the same core and during the same tick as step 6 then a deadlock occurs (on xtime_lock).

* Likelihood of Occurrence *

It's exceptionally unlikely that the triggering events would happen as required to cause a hang. It is
extremely difficult to trigger this issue during reproduction attempts, even when those reproduction attempts included artificially introducing high printk loads to attempt to trigger the hang.

* Workarounds *

Updating to kernel version kernel-2.6.9-89.EL (RHEL4) or kernel-2.6.18-164.el5 in RHEL5, or any later RHEL kernel is the most reliable method to avoid any impact from this bug. The bug has been
patched in these kernel versions. If your environment includes systems with a kernel version //lower//
than the those patched kernels and you remain concerned even with the low probability of encountering this issue, there are several workarounds available to further mitigate the risk of encountering this bug.

  1. Manually adjust the system time so that 2012-06-30 T23:59:59 UTC never occurs.
  2. Disable NTP clients on the affected system at least a full day ahead of the leap second so that the leap second flag is never inherited.
Then, re-enable NTP on those systems after the leap second has occured. It's important to insure that the tzdata package installed on the system has not been updated to include the 2012-06-30 leap second, as the
system can inherit the leap second flag from the tzdata file as well, even if NTP is disabled.

No comments:

Sponsor Links

Free Iphone?
Or Free Ipad?
Learn how to get free gadgets

Want to make money from Iphone/Ipad apps?
Affiliate yourself with apps developer to make money

Search Optimize your website
and win free gadget?
SEO Marketing