~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~ [ freetext search ] ~ [ file search ] ~

Linux Cross Reference
Linux/Documentation/DocBook/kernel-locking.tmpl

Version: ~ [ 2.4.0 ] ~
Architecture: ~ [ i386 ] ~ [ alpha ] ~ [ m68k ] ~ [ mips ] ~ [ ppc ] ~ [ sparc ] ~ [ sparc64 ] ~

  1 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V3.1//EN"[]>
  2 
  3 <book id="LKLockingGuide">
  4  <bookinfo>
  5   <title>Unreliable Guide To Locking</title>
  6   
  7   <authorgroup>
  8    <author>
  9     <firstname>Paul</firstname>
 10     <othername>Rusty</othername>
 11     <surname>Russell</surname>
 12     <affiliation>
 13      <address>
 14       <email>rusty@linuxcare.com</email>
 15      </address>
 16     </affiliation>
 17    </author>
 18   </authorgroup>
 19 
 20   <copyright>
 21    <year>2000</year>
 22    <holder>Paul Russell</holder>
 23   </copyright>
 24 
 25   <legalnotice>
 26    <para>
 27      This documentation is free software; you can redistribute
 28      it and/or modify it under the terms of the GNU General Public
 29      License as published by the Free Software Foundation; either
 30      version 2 of the License, or (at your option) any later
 31      version.
 32    </para>
 33       
 34    <para>
 35      This program is distributed in the hope that it will be
 36      useful, but WITHOUT ANY WARRANTY; without even the implied
 37      warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 38      See the GNU General Public License for more details.
 39    </para>
 40       
 41    <para>
 42      You should have received a copy of the GNU General Public
 43      License along with this program; if not, write to the Free
 44      Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
 45      MA 02111-1307 USA
 46    </para>
 47       
 48    <para>
 49      For more details see the file COPYING in the source
 50      distribution of Linux.
 51    </para>
 52   </legalnotice>
 53  </bookinfo>
 54 
 55  <toc></toc>
 56   <chapter id="intro">
 57    <title>Introduction</title>
 58    <para>
 59      Welcome, to Rusty's Remarkably Unreliable Guide to Kernel
 60      Locking issues.  This document describes the locking systems in
 61      the Linux Kernel as we approach 2.4.
 62    </para>
 63    <para>
 64      It looks like <firstterm linkend="gloss-smp"><acronym>SMP</acronym>
 65      </firstterm> is here to stay; so everyone hacking on the kernel 
 66      these days needs to know the fundamentals of concurrency and locking 
 67      for SMP.
 68    </para>
 69 
 70    <sect1 id="races">
 71     <title>The Problem With Concurrency</title>
 72     <para>
 73       (Skip this if you know what a Race Condition is).
 74     </para>
 75     <para>
 76       In a normal program, you can increment a counter like so:
 77     </para>
 78     <programlisting>
 79       very_important_count++;
 80     </programlisting>
 81 
 82     <para>
 83       This is what they would expect to happen:
 84     </para>
 85 
 86     <table>
 87      <title>Expected Results</title>
 88 
 89      <tgroup cols=2 align=left>
 90 
 91       <thead>
 92        <row>
 93         <entry>Instance 1</entry>
 94         <entry>Instance 2</entry>
 95        </row>
 96       </thead>
 97 
 98       <tbody>
 99        <row>
100         <entry>read very_important_count (5)</entry>
101         <entry></entry>
102        </row>
103        <row>
104         <entry>add 1 (6)</entry>
105         <entry></entry>
106        </row>
107        <row>
108         <entry>write very_important_count (6)</entry>
109         <entry></entry>
110        </row>
111        <row>
112         <entry></entry>
113         <entry>read very_important_count (6)</entry>
114        </row>
115        <row>
116         <entry></entry>
117         <entry>add 1 (7)</entry>
118        </row>
119        <row>
120         <entry></entry>
121         <entry>write very_important_count (7)</entry>
122        </row>
123       </tbody>
124 
125      </tgroup>
126     </table>
127 
128     <para>
129      This is what might happen:
130     </para>
131 
132     <table>
133      <title>Possible Results</title>
134 
135      <tgroup cols=2 align=left>
136       <thead>
137        <row>
138         <entry>Instance 1</entry>
139         <entry>Instance 2</entry>
140        </row>
141       </thead>
142 
143       <tbody>
144        <row>
145         <entry>read very_important_count (5)</entry>
146         <entry></entry>
147        </row>
148        <row>
149         <entry></entry>
150         <entry>read very_important_count (5)</entry>
151        </row>
152        <row>
153         <entry>add 1 (6)</entry>
154         <entry></entry>
155        </row>
156        <row>
157         <entry></entry>
158         <entry>add 1 (6)</entry>
159        </row>
160        <row>
161         <entry>write very_important_count (6)</entry>
162         <entry></entry>
163        </row>
164        <row>
165         <entry></entry>
166         <entry>write very_important_count (6)</entry>
167        </row>
168       </tbody>
169      </tgroup>
170     </table>
171 
172     <para>
173       This overlap, where what actually happens depends on the
174       relative timing of multiple tasks, is called a race condition.
175       The piece of code containing the concurrency issue is called a
176       critical region.  And especially since Linux starting running
177       on SMP machines, they became one of the major issues in kernel
178       design and implementation.
179     </para>
180     <para>
181       The solution is to recognize when these simultaneous accesses
182       occur, and use locks to make sure that only one instance can
183       enter the critical region at any time.  There are many
184       friendly primitives in the Linux kernel to help you do this.
185       And then there are the unfriendly primitives, but I'll pretend
186       they don't exist.
187     </para>
188    </sect1>
189   </chapter>
190 
191   <chapter id="locks">
192    <title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title>
193 
194    <para>
195      There are two main types of kernel locks.  The fundamental type
196      is the spinlock 
197      (<filename class=headerfile>include/asm/spinlock.h</filename>), 
198      which is a very simple single-holder lock: if you can't get the 
199      spinlock, you keep trying (spinning) until you can.  Spinlocks are 
200      very small and fast, and can be used anywhere.
201    </para>
202    <para>
203      The second type is a semaphore
204      (<filename class=headerfile>include/asm/semaphore.h</filename>): it
205      can have more than one holder at any time (the number decided at
206      initialization time), although it is most commonly used as a
207      single-holder lock (a mutex).  If you can't get a semaphore,
208      your task will put itself on the queue, and be woken up when the
209      semaphore is released.  This means the CPU will do something
210      else while you are waiting, but there are many cases when you
211      simply can't sleep (see <xref linkend="sleeping-things">), and so
212      have to use a spinlock instead.
213    </para>
214    <para>
215      Neither type of lock is recursive: see
216      <xref linkend="techniques-deadlocks">.
217    </para>
218  
219    <sect1 id="uniprocessor">
220     <title>Locks and Uniprocessor Kernels</title>
221 
222     <para>
223       For kernels compiled without <symbol>CONFIG_SMP</symbol>, spinlocks 
224       do not exist at all.  This is an excellent design decision: when
225       no-one else can run at the same time, there is no reason to
226       have a lock at all.
227     </para>
228 
229     <para>
230       You should always test your locking code with <symbol>CONFIG_SMP</symbol>
231       enabled, even if you don't have an SMP test box, because it
232       will still catch some (simple) kinds of deadlock.
233     </para>
234 
235     <para>
236       Semaphores still exist, because they are required for
237       synchronization between <firstterm linkend="gloss-usercontext">user 
238       contexts</firstterm>, as we will see below.
239     </para>
240    </sect1>
241 
242    <sect1 id="rwlocks">
243     <title>Read/Write Lock Variants</title>
244 
245     <para>
246       Both spinlocks and semaphores have read/write variants:
247       <type>rwlock_t</type> and <structname>struct rw_semaphore</structname>. 
248       These divide users into two classes: the readers and the writers.  If 
249       you are only reading the data, you can get a read lock, but to write to 
250       the data you need the write lock.  Many people can hold a read lock,
251       but a writer must be sole holder.
252     </para>
253 
254     <para>
255       This means much smoother locking if your code divides up
256       neatly along reader/writer lines.  All the discussions below
257       also apply to read/write variants.
258     </para>
259    </sect1>
260 
261     <sect1 id="usercontextlocking">
262      <title>Locking Only In User Context</title>
263 
264      <para>
265        If you have a data structure which is only ever accessed from
266        user context, then you can use a simple semaphore
267        (<filename>linux/asm/semaphore.h</filename>) to protect it.  This 
268        is the most trivial case: you initialize the semaphore to the number 
269        of resources available (usually 1), and call
270        <function>down_interruptible()</function> to grab the semaphore, and 
271        <function>up()</function> to release it.  There is also a 
272        <function>down()</function>, which should be avoided, because it 
273        will not return if a signal is received.
274      </para>
275 
276      <para>
277        Example: <filename>linux/net/core/netfilter.c</filename> allows 
278        registration of new <function>setsockopt()</function> and 
279        <function>getsockopt()</function> calls, with
280        <function>nf_register_sockopt()</function>.  Registration and 
281        de-registration are only done on module load and unload (and boot 
282        time, where there is no concurrency), and the list of registrations 
283        is only consulted for an unknown <function>setsockopt()</function>
284        or <function>getsockopt()</function> system call.  The 
285        <varname>nf_sockopt_mutex</varname> is perfect to protect this,
286        especially since the setsockopt and getsockopt calls may well
287        sleep.
288      </para>
289    </sect1>
290 
291    <sect1 id="lock-user-bh">
292     <title>Locking Between User Context and BHs</title>
293 
294     <para>
295       If a <firstterm linkend="gloss-bh">bottom half</firstterm> shares 
296       data with user context, you have two problems.  Firstly, the current 
297       user context can be interrupted by a bottom half, and secondly, the 
298       critical region could be entered from another CPU.  This is where
299       <function>spin_lock_bh()</function> 
300       (<filename class=headerfile>include/linux/spinlock.h</filename>) is 
301       used.  It disables bottom halves on that CPU, then grabs the lock.
302       <function>spin_unlock_bh()</function> does the reverse.
303     </para>
304 
305     <para>
306       This works perfectly for <firstterm linkend="gloss-up"><acronym>UP
307       </acronym></firstterm> as well: the spin lock vanishes, and this macro 
308       simply becomes <function>local_bh_disable()</function>
309       (<filename class=headerfile>include/asm/softirq.h</filename>), which 
310       protects you from the bottom half being run.
311     </para>
312    </sect1>
313 
314    <sect1 id="lock-user-tasklet">
315     <title>Locking Between User Context and Tasklets/Soft IRQs</title>
316 
317     <para>
318       This is exactly the same as above, because
319       <function>local_bh_disable()</function> actually disables all 
320       softirqs and <firstterm linkend="gloss-tasklet">tasklets</firstterm>
321       on that CPU as well.  It should really be 
322       called `local_softirq_disable()', but the name has been preserved 
323       for historical reasons.  Similarly,
324       <function>spin_lock_bh()</function> would now be called 
325       spin_lock_softirq() in a perfect world.
326     </para>
327    </sect1>
328 
329    <sect1 id="lock-bh">
330     <title>Locking Between Bottom Halves</title>
331 
332     <para>
333       Sometimes a bottom half might want to share data with
334       another bottom half (especially remember that timers are run
335       off a bottom half).
336     </para>
337 
338     <sect2 id="lock-bh-same">
339      <title>The Same BH</title>
340 
341      <para>
342        Since a bottom half is never run on two CPUs at once, you
343        don't need to worry about your bottom half being run twice
344        at once, even on SMP.
345      </para>
346     </sect2>
347 
348     <sect2 id="lock-bh-different">
349      <title>Different BHs</title>
350 
351      <para>
352        Since only one bottom half ever runs at a time once, you
353        don't need to worry about race conditions with other bottom
354        halves.  Beware that things might change under you, however,
355        if someone changes your bottom half to a tasklet.  If you
356        want to make your code future-proof, pretend you're already
357        running from a tasklet (see below), and doing the extra
358        locking.  Of course, if it's five years before that happens,
359        you're gonna look like a damn fool.
360      </para>
361     </sect2>
362    </sect1>
363 
364    <sect1 id="lock-tasklets">
365     <title>Locking Between Tasklets</title>
366 
367     <para>
368       Sometimes a tasklet might want to share data with another
369       tasklet, or a bottom half.
370     </para>
371 
372     <sect2 id="lock-tasklets-same">
373      <title>The Same Tasklet</title>
374      <para>
375        Since a tasklet is never run on two CPUs at once, you don't
376        need to worry about your tasklet being reentrant (running
377        twice at once), even on SMP.
378      </para>
379     </sect2>
380 
381     <sect2 id="lock-tasklets-different">
382      <title>Different Tasklets</title>
383      <para>
384        If another tasklet (or bottom half, such as a timer) wants
385        to share data with your tasklet, you will both need to use
386        <function>spin_lock()</function> and
387        <function>spin_unlock()</function> calls.  
388        <function>spin_lock_bh()</function> is
389        unnecessary here, as you are already in a a tasklet, and
390        none will be run on the same CPU.
391      </para>
392     </sect2>
393    </sect1>
394 
395    <sect1 id="lock-softirqs">
396     <title>Locking Between Softirqs</title>
397 
398     <para>
399       Often a <firstterm linkend="gloss-softirq">softirq</firstterm> might 
400       want to share data with itself, a tasklet, or a bottom half.
401     </para>
402 
403     <sect2 id="lock-softirqs-same">
404      <title>The Same Softirq</title>
405 
406      <para>
407        The same softirq can run on the other CPUs: you can use a
408        per-CPU array (see <xref linkend="per-cpu">) for better
409        performance.  If you're going so far as to use a softirq,
410        you probably care about scalable performance enough
411        to justify the extra complexity.
412      </para>
413 
414      <para>
415        You'll need to use <function>spin_lock()</function> and 
416        <function>spin_unlock()</function> for shared data.
417      </para>
418     </sect2>
419 
420     <sect2 id="lock-softirqs-different">
421      <title>Different Softirqs</title>
422 
423      <para>
424        You'll need to use <function>spin_lock()</function> and 
425        <function>spin_unlock()</function> for shared data, whether it 
426        be a timer (which can be running on a different CPU), bottom half, 
427        tasklet or the same or another softirq.
428      </para>
429     </sect2>
430    </sect1>
431   </chapter>
432 
433   <chapter id="hardirq-context">
434    <title>Hard IRQ Context</title>
435 
436    <para>
437      Hardware interrupts usually communicate with a bottom half,
438      tasklet or softirq.  Frequently this involves putting work in a
439      queue, which the BH/softirq will take out.
440    </para>
441 
442    <sect1 id="hardirq-softirq">
443     <title>Locking Between Hard IRQ and Softirqs/Tasklets/BHs</title>
444 
445     <para>
446       If a hardware irq handler shares data with a softirq, you have
447       two concerns.  Firstly, the softirq processing can be
448       interrupted by a hardware interrupt, and secondly, the
449       critical region could be entered by a hardware interrupt on
450       another CPU.  This is where <function>spin_lock_irq()</function> is 
451       used.  It is defined to disable interrupts on that cpu, then grab 
452       the lock. <function>spin_unlock_irq()</function> does the reverse.
453     </para>
454 
455     <para>
456       This works perfectly for UP as well: the spin lock vanishes,
457       and this macro simply becomes <function>local_irq_disable()</function>
458       (<filename class=headerfile>include/asm/smp.h</filename>), which 
459       protects you from the softirq/tasklet/BH being run.
460     </para>
461 
462     <para>
463       <function>spin_lock_irqsave()</function> 
464       (<filename>include/linux/spinlock.h</filename>) is a variant
465       which saves whether interrupts were on or off in a flags word,
466       which is passed to <function>spin_lock_irqrestore()</function>.  This 
467       means that the same code can be used inside an hard irq handler (where
468       interrupts are already off) and in softirqs (where the irq
469       disabling is required).
470     </para>
471    </sect1>
472   </chapter>
473 
474   <chapter id="common-techniques">
475    <title>Common Techniques</title>
476 
477    <para>
478      This section lists some common dilemmas and the standard
479      solutions used in the Linux kernel code.  If you use these,
480      people will find your code simpler to understand.
481    </para>
482 
483    <para>
484      If I could give you one piece of advice: never sleep with anyone
485      crazier than yourself.  But if I had to give you advice on
486      locking: <emphasis>keep it simple</emphasis>.
487    </para>
488 
489    <para>
490      Lock data, not code.
491    </para>
492 
493    <para>
494      Be reluctant to introduce new locks.
495    </para>
496 
497    <para>
498      Strangely enough, this is the exact reverse of my advice when
499      you <emphasis>have</emphasis> slept with someone crazier than yourself.
500    </para>
501 
502    <sect1 id="techniques-no-writers">
503     <title>No Writers in Interrupt Context</title>
504 
505     <para>
506       There is a fairly common case where an interrupt handler needs
507       access to a critical region, but does not need write access.
508       In this case, you do not need to use
509       <function>read_lock_irq()</function>, but only
510       <function>read_lock()</function> everywhere (since if an interrupt 
511       occurs, the irq handler will only try to grab a read lock, which 
512       won't deadlock).  You will still need to use 
513       <function>write_lock_irq()</function>.
514     </para>
515 
516     <para>
517       Similar logic applies to locking between softirqs/tasklets/BHs
518       which never need a write lock, and user context: 
519       <function>read_lock()</function> and
520       <function>write_lock_bh()</function>.
521     </para>
522    </sect1>
523 
524    <sect1 id="techniques-deadlocks">
525     <title>Deadlock: Simple and Advanced</title>
526 
527     <para>
528       There is a coding bug where a piece of code tries to grab a
529       spinlock twice: it will spin forever, waiting for the lock to
530       be released (spinlocks, rwlocks and semaphores are not
531       recursive in Linux).  This is trivial to diagnose: not a
532       stay-up-five-nights-talk-to-fluffy-code-bunnies kind of
533       problem.
534     </para>
535 
536     <para>
537       For a slightly more complex case, imagine you have a region
538       shared by a bottom half and user context.  If you use a
539       <function>spin_lock()</function> call to protect it, it is 
540       possible that the user context will be interrupted by the bottom 
541       half while it holds the lock, and the bottom half will then spin 
542       forever trying to get the same lock.
543     </para>
544 
545     <para>
546       Both of these are called deadlock, and as shown above, it can
547       occur even with a single CPU (although not on UP compiles,
548       since spinlocks vanish on kernel compiles with 
549       <symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption 
550       in the second example).
551     </para>
552 
553     <para>
554       This complete lockup is easy to diagnose: on SMP boxes the
555       watchdog timer or compiling with <symbol>DEBUG_SPINLOCKS</symbol> set
556       (<filename>include/linux/spinlock.h</filename>) will show this up 
557       immediately when it happens.
558     </para>
559 
560     <para>
561       A more complex problem is the so-called `deadly embrace',
562       involving two or more locks.  Say you have a hash table: each
563       entry in the table is a spinlock, and a chain of hashed
564       objects.  Inside a softirq handler, you sometimes want to
565       alter an object from one place in the hash to another: you
566       grab the spinlock of the old hash chain and the spinlock of
567       the new hash chain, and delete the object from the old one,
568       and insert it in the new one.
569     </para>
570 
571     <para>
572       There are two problems here.  First, if your code ever
573       tries to move the object to the same chain, it will deadlock
574       with itself as it tries to lock it twice.  Secondly, if the
575       same softirq on another CPU is trying to move another object
576       in the reverse direction, the following could happen:
577     </para>
578 
579     <table>
580      <title>Consequences</title>
581 
582      <tgroup cols=2 align=left>
583 
584       <thead>
585        <row>
586         <entry>CPU 1</entry>
587         <entry>CPU 2</entry>
588        </row>
589       </thead>
590 
591       <tbody>
592        <row>
593         <entry>Grab lock A -&gt; OK</entry>
594         <entry>Grab lock B -&gt; OK</entry>
595        </row>
596        <row>
597         <entry>Grab lock B -&gt; spin</entry>
598         <entry>Grab lock A -&gt; spin</entry>
599        </row>
600       </tbody>
601      </tgroup>
602     </table>
603 
604     <para>
605       The two CPUs will spin forever, waiting for the other to give up
606       their lock.  It will look, smell, and feel like a crash.
607     </para>
608 
609     <sect2 id="techs-deadlock-prevent">
610      <title>Preventing Deadlock</title>
611 
612      <para>
613        Textbooks will tell you that if you always lock in the same
614        order, you will never get this kind of deadlock.  Practice
615        will tell you that this approach doesn't scale: when I
616        create a new lock, I don't understand enough of the kernel
617        to figure out where in the 5000 lock hierarchy it will fit.
618      </para>
619 
620      <para>
621        The best locks are encapsulated: they never get exposed in
622        headers, and are never held around calls to non-trivial
623        functions outside the same file.  You can read through this
624        code and see that it will never deadlock, because it never
625        tries to grab another lock while it has that one.  People
626        using your code don't even need to know you are using a
627        lock.
628      </para>
629 
630      <para>
631        A classic problem here is when you provide callbacks or
632        hooks: if you call these with the lock held, you risk simple
633        deadlock, or a deadly embrace (who knows what the callback
634        will do?).  Remember, the other programmers are out to get
635        you, so don't do this.
636      </para>
637     </sect2>
638 
639     <sect2 id="techs-deadlock-overprevent">
640      <title>Overzealous Prevention Of Deadlocks</title>
641 
642      <para>
643        Deadlocks are problematic, but not as bad as data
644        corruption.  Code which grabs a read lock, searches a list,
645        fails to find what it wants, drops the read lock, grabs a
646        write lock and inserts the object has a race condition.
647      </para>
648 
649      <para>
650        If you don't see why, please stay the fuck away from my code.
651      </para>
652     </sect2>
653    </sect1>
654 
655    <sect1 id="per-cpu">
656     <title>Per-CPU Data</title>
657       
658     <para>
659       A great technique for avoiding locking which is used fairly
660       widely is to duplicate information for each CPU.  For example,
661       if you wanted to keep a count of a common condition, you could
662       use a spin lock and a single counter.  Nice and simple.
663     </para>
664 
665     <para>
666       If that was too slow [it's probably not], you could instead
667       use a counter for each CPU [don't], then none of them need an
668       exclusive lock [you're wasting your time here].  To make sure
669       the CPUs don't have to synchronize caches all the time, align
670       the counters to cache boundaries by appending
671       `__cacheline_aligned' to the declaration
672       (<filename class=headerfile>include/linux/cache.h</filename>). 
673       [Can't you think of anything better to do?]
674     </para>
675 
676     <para>
677       They will need a read lock to access their own counters,
678       however.  That way you can use a write lock to grant exclusive
679       access to all of them at once, to tally them up.
680     </para>
681    </sect1>
682 
683    <sect1 id="brlock">
684     <title>Big Reader Locks</title>
685 
686     <para>
687       A classic example of per-CPU information is Ingo's `big
688       reader' locks 
689       (<filename class=headerfile>linux/include/brlock.h</filename>).  These 
690       use the Per-CPU Data techniques described above to create a lock which 
691       is very fast to get a read lock, but agonizingly slow for a write
692       lock.
693     </para>
694 
695     <para>
696       Fortunately, there are a limited number of these locks
697       available: you have to go through a strict interview process
698       to get one.
699     </para>
700    </sect1>
701 
702    <sect1 id="lock-avoidance-rw">
703     <title>Avoiding Locks: Read And Write Ordering</title>
704 
705     <para>
706       Sometimes it is possible to avoid locking.  Consider the
707       following case from the 2.2 firewall code, which inserted an
708       element into a single linked list in user context:
709     </para>
710 
711     <programlisting>
712         new-&gt;next = i-&gt;next;
713         i-&gt;next = new;
714     </programlisting>
715 
716     <para>
717       Here the author (Alan Cox, who knows what he's doing) assumes
718       that the pointer assignments are atomic.  This is important,
719       because networking packets would traverse this list on bottom
720       halves without a lock.  Depending on their exact timing, they
721       would either see the new element in the list with a valid 
722       <structfield>next</structfield> pointer, or it would not be in the 
723       list yet.
724     </para>
725 
726     <para>
727       Of course, the writes <emphasis>must</emphasis> be in this
728       order, otherwise the new element appears in the list with an
729       invalid <structfield>next</structfield> pointer, and any other 
730       CPU iterating at the wrong time will jump through it into garbage.  
731       Because modern CPUs reorder, Alan's code actually read as follows:
732     </para>
733       
734     <programlisting>
735         new-&gt;next = i-&gt;next;
736         wmb();
737         i-&gt;next = new;
738     </programlisting>
739 
740     <para>
741       The <function>wmb()</function> is a write memory barrier 
742       (<filename class=headerfile>include/asm/system.h</filename>): neither 
743       the compiler nor the CPU will allow any writes to memory after the 
744       <function>wmb()</function> to be visible to other hardware
745       before any of the writes before the <function>wmb()</function>.
746     </para>
747 
748     <para>
749       As i386 does not do write reordering, this bug would never
750       show up on that platform.  On other SMP platforms, however, it
751       will.
752     </para>
753 
754     <para>
755       There is also <function>rmb()</function> for read ordering: to ensure 
756       any previous variable reads occur before following reads.  The simple
757       <function>mb()</function> macro combines both 
758       <function>rmb()</function> and <function>wmb()</function>.
759     </para>
760 
761     <para>
762       Any atomic operation is defined to act as a memory barrier
763       (ie. as per the <function>mb()</function> macro).  Also,
764       spinlock operations act as partial barriers: operations after
765       gaining a spinlock will never be moved to precede the
766       <function>spin_lock()</function> call, and operations before
767       releasing a spinlock will never be moved after the
768       <function>spin_unlock()</function> call.
769       <!-- Manfred Spraul <manfreds@colorfullife.com>
770            24 May 2000 2.3.99-pre9 -->
771     </para>
772    </sect1>
773 
774    <sect1 id="lock-avoidance-atomic-ops">
775     <title>Avoiding Locks: Atomic Operations</title>
776 
777     <para>
778       There are a number of atomic operations defined in
779       <filename class=headerfile>include/asm/atomic.h</filename>: these 
780       are guaranteed to be seen atomically from all CPUs in the system, thus 
781       avoiding races. If your shared data consists of a single counter, say, 
782       these operations might be simpler than using spinlocks (although for
783       anything non-trivial using spinlocks is clearer).
784     </para>
785 
786     <para>
787       Note that the atomic operations are defined to act as both
788       read and write barriers on all platforms.
789     </para>
790    </sect1>
791 
792    <sect1 id="ref-counts">
793     <title>Protecting A Collection of Objects: Reference Counts</title>
794 
795     <para>
796       Locking a collection of objects is fairly easy: you get a
797       single spinlock, and you make sure you grab it before
798       searching, adding or deleting an object.
799     </para>
800 
801     <para>
802       The purpose of this lock is not to protect the individual
803       objects: you might have a separate lock inside each one for
804       that.  It is to protect the <emphasis>data structure
805       containing the objects</emphasis> from race conditions.  Often
806       the same lock is used to protect the contents of all the
807       objects as well, for simplicity, but they are inherently
808       orthogonal (and many other big words designed to confuse).
809     </para>
810 
811     <para>
812       Changing this to a read-write lock will often help markedly if
813       reads are far more common that writes.  If not, there is
814       another approach you can use to reduce the time the lock is
815       held: reference counts.
816     </para>
817 
818     <para>
819       In this approach, an object has an owner, who sets the
820       reference count to one.  Whenever you get a pointer to the
821       object, you increment the reference count (a `get' operation).
822       Whenever you relinquish a pointer, you decrement the reference
823       count (a `put' operation).  When the owner wants to destroy
824       it, they mark it dead, and do a put.
825     </para>
826 
827     <para>
828       Whoever drops the reference count to zero (usually implemented
829       with <function>atomic_dec_and_test()</function>) actually cleans 
830       up and frees the object.
831     </para>
832 
833     <para>
834       This means that you are guaranteed that the object won't
835       vanish underneath you, even though you no longer have a lock
836       for the collection.
837     </para>
838 
839     <para>
840       Here's some skeleton code:
841     </para>
842 
843     <programlisting>
844         void create_foo(struct foo *x)
845         {
846                 atomic_set(&amp;x-&gt;use, 1);
847                 spin_lock_bh(&amp;list_lock);
848                 ... insert in list ...
849                 spin_unlock_bh(&amp;list_lock);
850         }
851 
852         struct foo *get_foo(int desc)
853         {
854                 struct foo *ret;
855 
856                 spin_lock_bh(&amp;list_lock);
857                 ... find in list ...
858                 if (ret) atomic_inc(&amp;ret-&gt;use);
859                 spin_unlock_bh(&amp;list_lock);
860 
861                 return ret;
862         }
863 
864         void put_foo(struct foo *x)
865         {
866                 if (atomic_dec_and_test(&amp;x-&gt;use))
867                         kfree(foo);
868         }
869 
870         void destroy_foo(struct foo *x)
871         {
872                 spin_lock_bh(&amp;list_lock);
873                 ... remove from list ...
874                 spin_unlock_bh(&amp;list_lock);
875 
876                 put_foo(x);
877         }
878     </programlisting>
879 
880     <sect2 id="helpful-macros">
881      <title>Macros To Help You</title>
882      <para>
883        There are a set of debugging macros tucked inside
884        <filename class=headerfile>include/linux/netfilter_ipv4/lockhelp.h</filename>
885        and <filename class=headerfile>listhelp.h</filename>: these are very
886        useful for ensuring that locks are held in the right places to protect
887        infrastructure.
888      </para>
889     </sect2>
890    </sect1>
891    
892    <sect1 id="sleeping-things">
893     <title>Things Which Sleep</title>
894 
895     <para>
896       You can never call the following routines while holding a
897       spinlock, as they may sleep.  This also means you need to be in
898       user context.
899     </para>
900 
901     <itemizedlist>
902      <listitem>
903       <para>
904         Accesses to 
905         <firstterm linkend="gloss-userspace">userspace</firstterm>:
906       </para>
907       <itemizedlist>
908        <listitem>
909         <para>
910           <function>copy_from_user()</function>
911         </para>
912        </listitem>
913        <listitem>
914         <para>
915           <function>copy_to_user()</function>
916         </para>
917        </listitem>
918        <listitem>
919         <para>
920           <function>get_user()</function>
921         </para>
922        </listitem>
923        <listitem>
924         <para>
925           <function> put_user()</function>
926         </para>
927        </listitem>
928       </itemizedlist>
929      </listitem>
930 
931      <listitem>
932       <para>
933         <function>kmalloc(GFP_KERNEL)</function>
934       </para>
935      </listitem>
936 
937      <listitem>
938       <para>
939       <function>down_interruptible()</function> and
940       <function>down()</function>
941       </para>
942       <para>
943        There is a <function>down_trylock()</function> which can be
944        used inside interrupt context, as it will not sleep.
945        <function>up()</function> will also never sleep.
946       </para>
947      </listitem>
948     </itemizedlist>
949 
950     <para>
951      <function>printk()</function> can be called in
952      <emphasis>any</emphasis> context, interestingly enough.
953     </para>
954    </sect1>
955 
956    <sect1 id="sparc">
957     <title>The Fucked Up Sparc</title>
958 
959     <para>
960       Alan Cox says <quote>the irq disable/enable is in the register
961       window on a sparc</quote>.  Andi Kleen says <quote>when you do
962       restore_flags in a different function you mess up all the
963       register windows</quote>.
964     </para>
965 
966     <para>
967       So never pass the flags word set by 
968       <function>spin_lock_irqsave()</function> and brethren to another 
969       function (unless it's declared <type>inline</type>.  Usually no-one 
970       does this, but now you've been warned.  Dave Miller can never do 
971       anything in a straightforward manner (I can say that, because I have
972       pictures of him and a certain PowerPC maintainer in a compromising 
973       position).
974     </para>
975    </sect1>
976 
977    <sect1 id="racing-timers">
978     <title>Racing Timers: A Kernel Pastime</title>
979 
980     <para>
981       Timers can produce their own special problems with races.
982       Consider a collection of objects (list, hash, etc) where each
983       object has a timer which is due to destroy it.
984     </para>
985 
986     <para>
987       If you want to destroy the entire collection (say on module
988       removal), you might do the following:
989     </para>
990 
991     <programlisting>
992         /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE
993            HUNGARIAN NOTATION */
994         spin_lock_bh(&amp;list_lock);
995 
996         while (list) {
997                 struct foo *next = list-&gt;next;
998                 del_timer(&amp;list-&gt;timer);
999                 kfree(list);
1000                 list = next;
1001         }
1002 
1003         spin_unlock_bh(&amp;list_lock);
1004     </programlisting>
1005 
1006     <para>
1007       Sooner or later, this will crash on SMP, because a timer can
1008       have just gone off before the <function>spin_lock_bh()</function>, 
1009       and it will only get the lock after we 
1010       <function>spin_unlock_bh()</function>, and then try to free
1011       the element (which has already been freed!).
1012     </para>
1013 
1014     <para>
1015       This can be avoided by checking the result of 
1016       <function>del_timer()</function>: if it returns
1017       <returnvalue>1</returnvalue>, the timer has been deleted.  
1018       If <returnvalue>0</returnvalue>, it means (in this
1019       case) that it is currently running, so we can do:
1020     </para>
1021 
1022     <programlisting>
1023         retry:  
1024                 spin_lock_bh(&amp;list_lock);
1025 
1026                 while (list) {
1027                         struct foo *next = list-&gt;next;
1028                         if (!del_timer(&amp;list-&gt;timer)) {
1029                                 /* Give timer a chance to delete this */
1030                                 spin_unlock_bh(&amp;list_lock);
1031                                 goto retry;
1032                         }
1033                         kfree(list);
1034                         list = next;
1035                 }
1036 
1037                 spin_unlock_bh(&amp;list_lock);
1038     </programlisting>
1039 
1040     <para>
1041       Another common problem is deleting timers which restart
1042       themselves (by calling <function>add_timer()</function> at the end 
1043       of their timer function).  Because this is a fairly common case 
1044       which is prone to races, you can put a call to
1045       <function>timer_exit()</function> at the very end of your timer function,
1046       and user <function>del_timer_sync()</function> 
1047       (<filename class=headerfile>include/linux/timer.h</filename>)
1048       to handle this case.  It returns the number of times the timer 
1049       had to be deleted before we finally stopped it from adding itself back 
1050       in.
1051     </para>
1052    </sect1>
1053   </chapter>
1054 
1055   <chapter id="references">
1056    <title>Further reading</title>
1057 
1058    <itemizedlist>
1059     <listitem>
1060      <para>
1061        <filename>Documentation/spinlocks.txt</filename>: 
1062        Linus Torvalds' spinlocking tutorial in the kernel sources.
1063      </para>
1064     </listitem>
1065 
1066     <listitem>
1067      <para>
1068        Unix Systems for Modern Architectures: Symmetric
1069        Multiprocessing and Caching for Kernel Programmers:
1070      </para>
1071 
1072      <para>
1073        Curt Schimmel's very good introduction to kernel level
1074        locking (not written for Linux, but nearly everything
1075        applies).  The book is expensive, but really worth every
1076        penny to understand SMP locking. [ISBN: 0201633388]
1077      </para>
1078     </listitem>
1079    </itemizedlist>
1080   </chapter>
1081 
1082   <chapter id="thanks">
1083     <title>Thanks</title>
1084 
1085     <para>
1086       Thanks to Telsa Gwynne for DocBooking, neatening and adding
1087       style.
1088     </para>
1089 
1090     <para>
1091       Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul
1092       Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul and Tim
1093       Waugh for proofreading, correcting, flaming, commenting.
1094     </para>
1095 
1096     <para>
1097       Thanks to the cabal for having no influence on this document.
1098     </para>
1099   </chapter>
1100 
1101   <glossary id="glossary">
1102    <title>Glossary</title>
1103 
1104    <glossentry id="gloss-bh">
1105     <glossterm>bh</glossterm>
1106      <glossdef>
1107       <para>
1108         Bottom Half: for historical reasons, functions with
1109         `_bh' in them often now refer to any software interrupt, e.g.
1110         <function>spin_lock_bh()</function> blocks any software interrupt 
1111         on the current CPU.  Bottom halves are deprecated, and will 
1112         eventually be replaced by tasklets.  Only one bottom half will be 
1113         running at any time.
1114      </para>
1115     </glossdef>
1116    </glossentry>
1117 
1118    <glossentry id="gloss-hwinterrupt">
1119     <glossterm>Hardware Interrupt / Hardware IRQ</glossterm>
1120     <glossdef>
1121      <para>
1122        Hardware interrupt request.  <function>in_irq()</function> returns 
1123        <returnvalue>true</returnvalue> in a hardware interrupt handler (it 
1124        also returns true when interrupts are blocked).
1125      </para>
1126     </glossdef>
1127    </glossentry>
1128 
1129    <glossentry id="gloss-interruptcontext">
1130     <glossterm>Interrupt Context</glossterm>
1131     <glossdef>
1132      <para>
1133        Not user context: processing a hardware irq or software irq.
1134        Indicated by the <function>in_interrupt()</function> macro 
1135        returning <returnvalue>true</returnvalue> (although it also
1136        returns true when interrupts or BHs are blocked).
1137      </para>
1138     </glossdef>
1139    </glossentry>
1140 
1141    <glossentry id="gloss-smp">
1142     <glossterm><acronym>SMP</acronym></glossterm>
1143     <glossdef>
1144      <para>
1145        Symmetric Multi-Processor: kernels compiled for multiple-CPU
1146        machines.  (CONFIG_SMP=y).
1147      </para>
1148     </glossdef>
1149    </glossentry>
1150 
1151    <glossentry id="gloss-softirq">
1152     <glossterm>softirq</glossterm>
1153     <glossdef>
1154      <para>
1155        Strictly speaking, one of up to 32 enumerated software
1156        interrupts which can run on multiple CPUs at once.
1157        Sometimes used to refer to tasklets and bottom halves as
1158        well (ie. all software interrupts).
1159      </para>
1160     </glossdef>
1161    </glossentry>
1162 
1163    <glossentry id="gloss-swinterrupt">
1164     <glossterm>Software Interrupt / Software IRQ</glossterm>
1165     <glossdef>
1166      <para>
1167        Software interrupt handler.  <function>in_irq()</function> returns 
1168        <returnvalue>false</returnvalue>; <function>in_softirq()</function>
1169        returns <returnvalue>true</returnvalue>.  Tasklets, softirqs and 
1170        bottom halves all fall into the category of `software interrupts'.
1171      </para>
1172     </glossdef>
1173    </glossentry>
1174 
1175    <glossentry id="gloss-tasklet">
1176     <glossterm>tasklet</glossterm>
1177     <glossdef>
1178      <para>
1179        A dynamically-registrable software interrupt,
1180        which is guaranteed to only run on one CPU at a time.
1181      </para>
1182     </glossdef>
1183    </glossentry>
1184 
1185    <glossentry id="gloss-up">
1186     <glossterm><acronym>UP</acronym></glossterm>
1187     <glossdef>
1188      <para>
1189        Uni-Processor: Non-SMP.  (CONFIG_SMP=n).
1190      </para>
1191     </glossdef>
1192    </glossentry>
1193 
1194    <glossentry id="gloss-usercontext">
1195     <glossterm>User Context</glossterm>
1196     <glossdef>
1197      <para>
1198        The kernel executing on behalf of a particular
1199        process or kernel thread (given by the <function>current()</function>
1200        macro.)  Not to be confused with userspace.  Can be interrupted by 
1201        software  or hardware interrupts.
1202      </para>
1203     </glossdef>
1204    </glossentry>
1205 
1206    <glossentry id="gloss-userspace">
1207     <glossterm>Userspace</glossterm>
1208     <glossdef>
1209      <para>
1210        A process executing its own code outside the kernel.
1211      </para>
1212     </glossdef>
1213    </glossentry>      
1214 
1215   </glossary>
1216 </book>
1217 

~ [ source navigation ] ~ [ diff markup ] ~ [ identifier search ] ~ [ freetext search ] ~ [ file search ] ~

This page was automatically generated by the LXR engine.
Visit the LXR main site for more information.