Fixing the delay when errors occur in serio resume
While testing out various platforms I came across a peculiar resume delay on an ivy bridge laptop. The 3.2 version of analyzesuspend just showed an empty block of about 1.5 seconds in the resume_complete phase. It seems that the delay wasn't actually happening inside of a device_complete callback, but rather in the dpm_complete call one layer above (very curious). So I patched the kernel to expand the trace coverage so I could identify the culprit (and pushed the changes upstream). It turns out the delay was caused by a serio device. On this laptop there are two i8042 peripherals: keyboard and mouse. Both are configured as serio devices and are resumed in series one after the other. The keyboard resumes in about 40ms, but the mouse seems to take nearly 2 seconds! Here's the full timeline with some additional kprobe data instrumented to get the full picture of what's happening.
When serio devices are resumed, a reconnect command is queued up asynchronously in kseriod. The serio_handle_event call is the worker thread that instantiates to take care of the two reconnect events (keyboard and mouse). It first calls the serio_reconnect_port function for the keyboard (the smallest green block in the 2nd row of resume) and succeeds after about 40ms. It then calls serio_reconnect_port for the mouse and runs for much longer. After about 260ms the mouse reconnect fails (not visible in the timeline). When this happens, the serio driver initiates a full rescan for the device which entails a disconnect and re-initialize (the blocks in red).The first red block is the call which disconnects the device. The second red block is the serio_find_driver call which fully re-initializes the device (and this is where the trouble happens).
The serio_find_driver call locks the device object before re-initializing it. The problem is the PM subsystem is still executing, and it also needs to get a mutex lock on the device in order to begin the resume complete phase. The PM dpm_complete function loops through all devices in the resume queue, checking for and calling their complete callbacks one by one. This check reaches serio1 just a few milliseconds after the start of the resume complete phase (the border between dark yellow and light yellow). The mutex lock attempt fails and the resume complete phase is delayed until the serio1 device's mutex lock is released. Thus the resume is backed up for everyone until serio_find_driver is finished!
One possible solution is to push the serio error handling to the complete phase instead of forcing the system to do it all in resume. Serio doesn't currently implement a complete callback. The serio device struct can include a simple resume_error flag which is set to indicate that steps need to be taken later to finish up. A complete call can be added which checks this flag and issues the disconnect/reinitialize to kseriod. The following is a timeline on the same system with the kernel patched in this way.
This completely removes the effects of the serio resume error from the user's perspective. It also presents a vast performance improvement (from 2300 ms to 980 ms!). The serio_handle_event call happens the same as before, but the reconnect function simply returns after the error occurs. The error is handled just as before, but it occurs in the "post resume" phase, which is execution that happens after user mode has been restored.
The only minor drawback is that the mouse will take slightly longer to come back online, but as you can see it's only on the order of a couple hundred milliseconds. I think this falls under the threshold of what a user would notice.