Subject: Network Stack Locking This is a status email, don't sweat it The high level view, for those less willing to wade through a greater level of detail, is that we have a substantial work in progress with a lot of our bases covered, and that we're looking for broader exposure for the work. We've been merging smaller parts of the work (supporting infrastructure, fine-grained locking for specific leaf dependencies), and are starting to think about larger scale merging over the next month or two. There are some known serious issues in the current work, but we've also identified some areas that need attention outside of the stack in order to make serious progress on merging. There are also some important tasks that require owners moving forward, and a solicitation for those areas. I don't attempt to capture everything, in particular things like locking strategies in this e-mail. You will find patch URLs and perforce references. As many of you are aware, I've become the latest inheritor of the omnibus "Network Stack Locking" task of SMPng. This work has a pretty long history that I won't attempt to go into here, other than to observe that the vast majority of work that will be discussed in this e-mail is the product of significant contributions of others, including: Jonathan Lemon, Jennifer Yang, Jeffrey Hsu, and Sam Leffler, and a large number of other contributors (many of whom are named in recent status reports). The goal of this e-mail is to provide a bit of high level information about what is going on to increase awareness, solicit involvement in a variety of areas, and throw around words like "merge schedule". Warning: this is a work in progress, and you will find rough parts. This is being worked on actively, but by bringing this up during the process, we can improve the work. If you see things that scare you, that's a reasonable response. Now into the details: Those following the last few status reports will know that recent work has focused in the following areas: - Introducing and refining data based locking for the top levels of the network stack (sockets, socket buffers, et al). - Refining and testing locking for lower pieces of the stack that already have locking. - Locking for UNIX domain sockets, FIFOs, etc. - Iterating through pseudo-interfaces and network interfaces to identify and correct locking problems. - Allow Giant to be conditionally acquired across the entire stack using a Giant Toggle Switch. - Address interactions with tightly coupled support infrastructure for the stack, including the MAC Framework, kqueue, sigio, select() general signaling primitives, et al. - Investigating and in many cases locking of less popular/less widely used stack components that were previously unaddressed, such as IPv6, netatalk, netipx, et al. - Some local changes used to monitor and assert locks at a finer granularity than in the main tree. Specifically, sampling of callouts and timeouts to measure what we're grabbing Giant for, and in certain branches, the addition of a great many assertions. This work is occurring in a number of Perforce branches. The primary branch that is actively worked on is "rwatson_netperf", which may be found at the following patch: //depot/users/rwatson/netperf/... Additional work is taking place to explore socket locking issues in: //depot/users/rwatson/net2/... A number of other developers have branches off of these branches to explore locking for particular subsystems. There are also some larger unintegrated patch sets for data-based NFS locking, fixing the user space build, etc. You can find a non-Perforce version at: http://www.watson.org/~robert/freebsd/netperf/ This includes a basic change log and incrementally generated patches, work sets, etc. Perforce is the preferred way to get to the work as it provides easier access to my working notes, the ability to maintain local changes, get the most recent version, etc. I try to drop patches fairly regularly -- several times a week against HEAD, but due to travel to BSDCan, I'm about two weeks behind. I hope to make substantial headway this weekend in updating the patch set and integrating a number of recent socket locking changes from various work branches. This work is currently a work in progress, and has a number of known issues, including some lock order reversal problems, known deficiencies in socket locking coverage of socket variables, etc. However, it's been being reviewed and worked on by an increasingly broad population of FreeBSD developers, so I wanted to move to a more general patch posting process and attempt to identify additional "hired hands" for areas that require additional work. Here are current known tasks and current owners: Task Developer ---- --------- Sockets Robert Watson Synthetic network interfaces Robert Watson Netinet6 George Neville-Neil Netatalk Robert Watson Netipx Robert Watson Interface Locking Max Laier, Luigi Rizzo, Maurycy Pawlowski-Wieronski, Brooks Davis Routing Cleanup Luigi Rizzo KQueue (subsystem lock) Brian Feldman KQueue (data locking) John-Mark Gurney NFS Server (subsystem lock) Robert Watson NFS Server (data locking) Rick Macklem SPPP Roman Kurakin Userspace build Roman Kurakin VFS/fifofs interactions Don Lewis Performance measurement Pawel Jakub Dawidek And of course, I can't neglect to mention the on-going work of Kris Kennaway to test out these changes on high-load systems :-). Some noted absences in the above, and areas where I'd like to see additional people helping out are: - Reviewing Netgraph modules for correct interactions with locking in the remainder of the system. I've started pushing some locking into ng_ksocket.c and ng_socket.c, and some of the basic infrastructure that needed it, but each module will need to be reviewed for correct locking. - ATM -- Harti? :-) - Network device drivers -- some have locking, some have correct locking, some have potential interactions with other pieces of the system (such as the USB stack). Note that for a driver to work correctly with a Giant-free system, it must be safe to invoke ifp->if_start() without holding Giant, and for if_start() to be aware that it cannot acquire Giant without generating a lock order issue. It's OK for if_input() to be called with Giant, although undesirable generally. Some drivers also have locking that is commented out by default due to use of recursive locks, but I'm not sure this is necessarily sufficient problem not to just turn on the locking. - Complete coverage of synthetic/pseudo-interfaces. In particular, careful addressing of if_gif and other "cross-layer" and protocol aware pieces. - mbuma -- Bosko's work looks good to me, we need to make sure all the pieces work with each other. Getting down to one large memory allocator would be great. I'm interested in exploring uniprocessor optimizations here -- I notice that a lot of the locks getting acquired in profiling are for memory allocation. Exploring using critical sections, per-cpu variables/caching, and pinning both seem like reasonable approaches to reduce synchronization costs here. Note that there are some serious issues with the current locking changes: - Socket locking is deficient in a number of ways -- primarily that there are several important socket fields that are currently insufficiently or inconsistently synchronized. I'm in the throes of correcting this, but that requires a line-by-line review of all use of sockets, which will take me at least another week or two to complete. I'm also addressing some races between listen sockets and the sockets hung off of them during the new connection setup and accept process. Currently there is no defined lock order between multiple sockets, and if possible I'd like to keep it that way. - Based on the BSD/OS strategy, there are two mutexes on a socket: each socket buffer has a mutex (send, receive), and then the basic socket fields are locked using SOCK_LOCK(), which actually uses the receive socket buffer mutex. This reduces the locking overhead while helping to address ordering issues in the upward and downward paths. However, there are also some issues of locking correctness and redundancy, and I'm looking into these as part of an overall review of the strategy. It's worth noting that the BSD/OS snapshot we have has substantially incomplete and non-functional socket locking, so unlike some other pieces of the network stack, it was not possible to use the strategy whole-cloth. In the long term, the socket locking model may require substantial revision. - Per some recent discussions on -CURRENT, I've been exploring mitigating locking costs through coalescing activities on multiple packets. I.e., effectively passing in queues of packet chains across API boundaries, as well as creating local work queues. It's a bit early to commit to this approach because the performance numbers have not confirmed the benefit, but it's important to keep that possible approach in mind across all other locking work, as it trades off work queue latency with synchronization cost. My earlier experimentation occurred at the end of 2003, so I hope to revisit this now that more of the locking is in place to offer us advantages in preemption and parallelism. - They enable net.isr.enable by default, which provides inbound packet parallelism through running to completion in the ithread. This has other down sides, and while we should provide the option, I think we should continue to support forcing use of the netisr. One of the problems with the netisr approach is how to accomplish inbound processing parallelism without sacrificing the currently strong ordering properties, which could cause bad TCP behavior, etc. We should seriously consider at least some aspects of Jeffrey Hsu's work on DragonFly to explore providing for multiple netisr's bound to CPUs, then directing traffic based on protocol aware hashing that permits us to maintain sufficient ordering to meeting higher level protocol requirements while avoiding the cost of maintaining full ordering. This isn't something we have to do immediately, but exploiting parallelism requires both effective synchronization and effective balancing of load. In the short term, I'm less interested in the avoidance of synchronization of data adopted in the DragonFly approach, since I'd like to see that approach validated on a larger chunk of the stack (i.e., across the more incestuous pieces of the network stack), and also to see performance numbers that confirm the claims. The approach we're currently taking is tried and true across a broad array of systems (almost every commercial UNIX vendor, for example), and offers many benefits (such as a very strong assertion model). However, as aspects of the DFBSD approach are validated (or not, as the case may be), we should consider adopting things as they make sense. The approaches offer quite a bit of promise, but are also very experimental and will require a lot of validation, needless to say. - There are still some serious issues in the timely processing and scheduling of device driver interrupts, and these affect performance in a number of ways. They also change the degree of effective coalescing of interrupts, making it harder to evaluate strategies to lower costs. These issues aren't limited to the network stack work, but I wanted to make sure it was on the list of concerns. - There are issues relating to upcalls from the socket layer: while many consumers of sockets simply sleep for wakeups on socket pointers, so_upcall() permits the network stack to "upcall" into other components of the system. I believe this was introduced initially for the NFS server to allow initial processing of RPCs to occur in the netisr rather than waiting on a context switch to the NFS server threads. However, it's now also used for accept sockets, and I'm aware of outstanding changes that modify the NFS client to use it as well. We need to establish what locks will be held over the upcall, if any, and what expectations are in place for implementers of upcall functions. At the very least, they have to be MPSAFE, but there are also potential lock order issues. - Locking for KQueue is critical to success. Without locking down the event infrastructure, we can't remove Giant from the many interesting pieces of the network stack. KQueue is an example of a high level of incestuousness between levels, and will require careful handling. Brian's approach adopts a "single subsystem" for KQueue and as such offers a low hanging fruit approach, but comes at a number of costs, not least is parallelism loss and functional loss. John-Mark's approach appears to offer a more granular locking approach offering higher parallelism, but at the cost of complexity. I've not yet had the opportunity to review either in any detail, but I know Brian has integrated a work branch in Perforce that combines both the locking in rwatson_netperf, and perform testing. There's obviously more work to go on here, and it is required to get to "Giant-free operation". For more complete changes and history, I would refer you to the last few FreeBSD Status Reports on network stack locking. I would also encourage you to contact me if you would like to claim some section of the stack for work so I can coordinate activities. These patch sets have been pounded heavily in a wide variety of environments, but there are several known issues so I would recommend using them cautiously. In terms of merging: I've been gradually merging a lot of the infrastructure pieces as I went along. The next big chunks to consider merging are: - Socket locking. This needs to wait until I'm more happy with the strategy. - UNIX domain socket locking. This is probably an early candidate, but because of potential interactions with socket locking changes, I've been deferring the merge. - NFS server locking. I had planned to merge the current subsystem lock quickly, but then Rick turned up with fine-grained data based locking of the NFS server, and NFSv4 server code when I asked him for review of the subsystem lock, so I've been holding off. - Additional general infrastructure, such as more psuedo-interface locking, fifofs stuff, etc. I'll continue on the gradual incremental merge path as I have been for the past few months. It's obviously desirable to get things merged as soon as they are ready, even with Giant remaining over the stack, so that we can get broad exercising of the locking assertions in INVARIANTS and WITNESS. As such, over the next month I anticipate an increasing number of merges, and increasing usability of "debug.mpsafenet" in the main tree. Turning off Giant will likely lead to problems for some time to come, but the sooner we get exposure, the better life will be. We've done a lot of heavy testing of common code paths, but working out the edge cases will take some time. We're prepared to live in a world with a dual-mode stack for some period, but that has to be an interim measure. So I guess the upshot is "Stuff is going on, be aware, volunteer to help!". Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Senior Research Scientist, McAfee Research