mirror of
				git://git.yoctoproject.org/linux-yocto.git
				synced 2025-10-22 23:13:01 +02:00 
			
		
		
		
	 4c093cbb89
			
		
	
	
		4c093cbb89
		
	
	
	
	
		
			
			Non-scalar time was removed from the ktime hybrid union in v3.17, and the union itself followed suit in v4.10. Make it clear that ktime_t is always a 64bit scalar type, to avoid confusing the casual reader. While at it, fix a spelling mistake. Fixes:24e4a8c3e8("ktime: Kill non-scalar ktime_t implementation for 2038") Fixes:2456e85535("ktime: Get rid of the union") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/59250a3d1c2c827b5c1833169a6e652ca6a784e6.1683021785.git.geert+renesas@glider.be Signed-off-by: Jonathan Corbet <corbet@lwn.net>
		
			
				
	
	
		
			174 lines
		
	
	
		
			8.4 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			174 lines
		
	
	
		
			8.4 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| ======================================================
 | |
| hrtimers - subsystem for high-resolution kernel timers
 | |
| ======================================================
 | |
| 
 | |
| This patch introduces a new subsystem for high-resolution kernel timers.
 | |
| 
 | |
| One might ask the question: we already have a timer subsystem
 | |
| (kernel/timers.c), why do we need two timer subsystems? After a lot of
 | |
| back and forth trying to integrate high-resolution and high-precision
 | |
| features into the existing timer framework, and after testing various
 | |
| such high-resolution timer implementations in practice, we came to the
 | |
| conclusion that the timer wheel code is fundamentally not suitable for
 | |
| such an approach. We initially didn't believe this ('there must be a way
 | |
| to solve this'), and spent a considerable effort trying to integrate
 | |
| things into the timer wheel, but we failed. In hindsight, there are
 | |
| several reasons why such integration is hard/impossible:
 | |
| 
 | |
| - the forced handling of low-resolution and high-resolution timers in
 | |
|   the same way leads to a lot of compromises, macro magic and #ifdef
 | |
|   mess. The timers.c code is very "tightly coded" around jiffies and
 | |
|   32-bitness assumptions, and has been honed and micro-optimized for a
 | |
|   relatively narrow use case (jiffies in a relatively narrow HZ range)
 | |
|   for many years - and thus even small extensions to it easily break
 | |
|   the wheel concept, leading to even worse compromises. The timer wheel
 | |
|   code is very good and tight code, there's zero problems with it in its
 | |
|   current usage - but it is simply not suitable to be extended for
 | |
|   high-res timers.
 | |
| 
 | |
| - the unpredictable [O(N)] overhead of cascading leads to delays which
 | |
|   necessitate a more complex handling of high resolution timers, which
 | |
|   in turn decreases robustness. Such a design still leads to rather large
 | |
|   timing inaccuracies. Cascading is a fundamental property of the timer
 | |
|   wheel concept, it cannot be 'designed out' without inevitably
 | |
|   degrading other portions of the timers.c code in an unacceptable way.
 | |
| 
 | |
| - the implementation of the current posix-timer subsystem on top of
 | |
|   the timer wheel has already introduced a quite complex handling of
 | |
|   the required readjusting of absolute CLOCK_REALTIME timers at
 | |
|   settimeofday or NTP time - further underlying our experience by
 | |
|   example: that the timer wheel data structure is too rigid for high-res
 | |
|   timers.
 | |
| 
 | |
| - the timer wheel code is most optimal for use cases which can be
 | |
|   identified as "timeouts". Such timeouts are usually set up to cover
 | |
|   error conditions in various I/O paths, such as networking and block
 | |
|   I/O. The vast majority of those timers never expire and are rarely
 | |
|   recascaded because the expected correct event arrives in time so they
 | |
|   can be removed from the timer wheel before any further processing of
 | |
|   them becomes necessary. Thus the users of these timeouts can accept
 | |
|   the granularity and precision tradeoffs of the timer wheel, and
 | |
|   largely expect the timer subsystem to have near-zero overhead.
 | |
|   Accurate timing for them is not a core purpose - in fact most of the
 | |
|   timeout values used are ad-hoc. For them it is at most a necessary
 | |
|   evil to guarantee the processing of actual timeout completions
 | |
|   (because most of the timeouts are deleted before completion), which
 | |
|   should thus be as cheap and unintrusive as possible.
 | |
| 
 | |
| The primary users of precision timers are user-space applications that
 | |
| utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
 | |
| users like drivers and subsystems which require precise timed events
 | |
| (e.g. multimedia) can benefit from the availability of a separate
 | |
| high-resolution timer subsystem as well.
 | |
| 
 | |
| While this subsystem does not offer high-resolution clock sources just
 | |
| yet, the hrtimer subsystem can be easily extended with high-resolution
 | |
| clock capabilities, and patches for that exist and are maturing quickly.
 | |
| The increasing demand for realtime and multimedia applications along
 | |
| with other potential users for precise timers gives another reason to
 | |
| separate the "timeout" and "precise timer" subsystems.
 | |
| 
 | |
| Another potential benefit is that such a separation allows even more
 | |
| special-purpose optimization of the existing timer wheel for the low
 | |
| resolution and low precision use cases - once the precision-sensitive
 | |
| APIs are separated from the timer wheel and are migrated over to
 | |
| hrtimers. E.g. we could decrease the frequency of the timeout subsystem
 | |
| from 250 Hz to 100 HZ (or even smaller).
 | |
| 
 | |
| hrtimer subsystem implementation details
 | |
| ----------------------------------------
 | |
| 
 | |
| the basic design considerations were:
 | |
| 
 | |
| - simplicity
 | |
| 
 | |
| - data structure not bound to jiffies or any other granularity. All the
 | |
|   kernel logic works at 64-bit nanoseconds resolution - no compromises.
 | |
| 
 | |
| - simplification of existing, timing related kernel code
 | |
| 
 | |
| another basic requirement was the immediate enqueueing and ordering of
 | |
| timers at activation time. After looking at several possible solutions
 | |
| such as radix trees and hashes, we chose the red black tree as the basic
 | |
| data structure. Rbtrees are available as a library in the kernel and are
 | |
| used in various performance-critical areas of e.g. memory management and
 | |
| file systems. The rbtree is solely used for time sorted ordering, while
 | |
| a separate list is used to give the expiry code fast access to the
 | |
| queued timers, without having to walk the rbtree.
 | |
| 
 | |
| (This separate list is also useful for later when we'll introduce
 | |
| high-resolution clocks, where we need separate pending and expired
 | |
| queues while keeping the time-order intact.)
 | |
| 
 | |
| Time-ordered enqueueing is not purely for the purposes of
 | |
| high-resolution clocks though, it also simplifies the handling of
 | |
| absolute timers based on a low-resolution CLOCK_REALTIME. The existing
 | |
| implementation needed to keep an extra list of all armed absolute
 | |
| CLOCK_REALTIME timers along with complex locking. In case of
 | |
| settimeofday and NTP, all the timers (!) had to be dequeued, the
 | |
| time-changing code had to fix them up one by one, and all of them had to
 | |
| be enqueued again. The time-ordered enqueueing and the storage of the
 | |
| expiry time in absolute time units removes all this complex and poorly
 | |
| scaling code from the posix-timer implementation - the clock can simply
 | |
| be set without having to touch the rbtree. This also makes the handling
 | |
| of posix-timers simpler in general.
 | |
| 
 | |
| The locking and per-CPU behavior of hrtimers was mostly taken from the
 | |
| existing timer wheel code, as it is mature and well suited. Sharing code
 | |
| was not really a win, due to the different data structures. Also, the
 | |
| hrtimer functions now have clearer behavior and clearer names - such as
 | |
| hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
 | |
| equivalent to timer_delete() and timer_delete_sync()] - so there's no direct
 | |
| 1:1 mapping between them on the algorithmic level, and thus no real
 | |
| potential for code sharing either.
 | |
| 
 | |
| Basic data types: every time value, absolute or relative, is in a
 | |
| special nanosecond-resolution 64bit type: ktime_t.
 | |
| (Originally, the kernel-internal representation of ktime_t values and
 | |
| operations was implemented via macros and inline functions, and could be
 | |
| switched between a "hybrid union" type and a plain "scalar" 64bit
 | |
| nanoseconds representation (at compile time). This was abandoned in the
 | |
| context of the Y2038 work.)
 | |
| 
 | |
| hrtimers - rounding of timer values
 | |
| -----------------------------------
 | |
| 
 | |
| the hrtimer code will round timer events to lower-resolution clocks
 | |
| because it has to. Otherwise it will do no artificial rounding at all.
 | |
| 
 | |
| one question is, what resolution value should be returned to the user by
 | |
| the clock_getres() interface. This will return whatever real resolution
 | |
| a given clock has - be it low-res, high-res, or artificially-low-res.
 | |
| 
 | |
| hrtimers - testing and verification
 | |
| -----------------------------------
 | |
| 
 | |
| We used the high-resolution clock subsystem on top of hrtimers to verify
 | |
| the hrtimer implementation details in praxis, and we also ran the posix
 | |
| timer tests in order to ensure specification compliance. We also ran
 | |
| tests on low-resolution clocks.
 | |
| 
 | |
| The hrtimer patch converts the following kernel functionality to use
 | |
| hrtimers:
 | |
| 
 | |
|  - nanosleep
 | |
|  - itimers
 | |
|  - posix-timers
 | |
| 
 | |
| The conversion of nanosleep and posix-timers enabled the unification of
 | |
| nanosleep and clock_nanosleep.
 | |
| 
 | |
| The code was successfully compiled for the following platforms:
 | |
| 
 | |
|  i386, x86_64, ARM, PPC, PPC64, IA64
 | |
| 
 | |
| The code was run-tested on the following platforms:
 | |
| 
 | |
|  i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
 | |
| 
 | |
| hrtimers were also integrated into the -rt tree, along with a
 | |
| hrtimers-based high-resolution clock implementation, so the hrtimers
 | |
| code got a healthy amount of testing and use in practice.
 | |
| 
 | |
| 	Thomas Gleixner, Ingo Molnar
 |