fdsync()
and open()
system calls when Sendmail creates queue entries with a single queue directory I've set about tracking down what that bottleneck is.To confirm that these system calls were the ones most dramatically affected by multiple queue directories I re-ran the tests but with 10 queue directories. Everything else about the environment was the same.
This was test 11, and the raw data for test 11 is available.
These next charts show the results from this test compared against the results from test 10. This first chart compares the average execution time of each system call. The results from test 10 are in green, those from test 11 in red.
As expected, given the results from yesterday this shows the dramatic difference when using multiple queue directories.
Totalling up the execution time for the system calls paints a similar picture. Again, this compares test 10 against test 11, with the test 10 results in green.
It's obvious that
fdsync()
and open()
should be instrumented in more detail.To do this I created queue-create-fbt.d. This script traces kernel functions using the
fbt
provider (hence the name) that are called by a particular syscall.Test 12 - 1 queue directory, 1 concurrent sender, fdsync()
View raw data
The purpose of this test is to see what
fdsync()
is doing in the simple case of 1 queue directory and 1 sender. We already know that this is very fast. These results can then be compared against a second test that uses multiple concurrent senders to see which kernel functions are taking the most time.Test 13 - 1 queue directory, 1 concurrent sender, open()
View raw data
This is identical to the previous test, except that calls initated by
open()
were instrumented.Test 14 - 1 queue directory, 10 concurrent senders, fdsync()
View raw data
This is the same as test 12, but with 10 concurrent senders rather than 1.
Test 15 - 1 queue directory, 10 concurrent senders, open()
View raw data
This is the same as test 13, but with 10 concurrent senders rather than 1.
Results - fdsync()
I've taken the results from tests 12 and 14, sorted them by total elapsed time for each function call, and plotted the results against each other.
This chart shows the top function calls and their elapsed time. The results from test 12 are in green, those from test 14 are in red. If a function only has a green bar (e.g.,
push_dirty_bp
) it indicates that this function was in the top 15 for test 12, but not for test 14. Similarly, the presence of just a red bar (e.g., mutex_vector_enter
) indicates that this function was in the top 15 for test 14 but not test 12.Well, we can see here where much of the additional overhead has come from. Calls to
mutex_vector_enter()
, non-existent in test 12, dominate the results in test 14.Given that
fdsync()
can not return until the data has been written to disk, this would suggest that multiple fdsync()
calls to files in the same directory will serialise, one behind the other, presumably synchronising on a mutex to do so.Results - open()
This chart shows the results from tests 13 and 15, plotted in the same way.
It's apparent here that much of the overhead is in code related to locking. Almost two seconds is spent in
cv_wait()
, which suspends the calling thread. There's increased time spent in splx()
, which is normally associated with blocking on interrupts, and additional time spent in lock_set_spl()
, disp_lock_enter()
, the names of which strongly suggest locking contention.If I had more time I'd dig in to the OpenSolaris kernel code to extract more details about this. I'll leave that as an exercise for the reader.
What happens with multiple queue directories?
The last thing to look at is to see how the results change when multiple queue directories are used.
To that end I repeated tests 14 and 15, but with 10 queue directories. This is tests 16 and 17.
Here's the
fdsync()
comparison. The 1 sender/1 queue directory data in green, the 10 senders/1 queue directory data in red, and the 10 senders/10 queue directories data in yellow.I confess, I'm not entirely sure what's going on here. The results from comparing test 10 against test 11 clearly showed that
fdsync()
had a lower cost with multiple queue directories. Yet here many of the kernel function calls take longer.The total time spent executing these calls is also greater.
Test | Descr. | Time (ns) |
---|---|---|
10 | Total time spent in fdsync() | 3,255,859,582,382 | 14 | Total time spent in kernel functions called by fdsync() | 285,960,093,086 | 11 | Total time spent in fdsync() | 551,254,035,562 |
16 | Total time spent in kernel functions called by fdsync() | 324,482,491,954 |
I imagine that this means there is additional work being carried out by
fdsync()
that is not being accounted for by the kernel function calls. I'm not entirely sure how I go about tracking that down, but it's going to be an excellent question for the DTrace mailing list...The data gathered instrumenting kernel calls by
open()
makes a litlte more sense. Here's the chart.Here, at least, it's obvious that in the 10-queue-directory case (yellow)
open()
is not spending significantly more time in the kernel than in the 1 queue/1 sender case (green).I'll ask some DTrace gurus about the
fdsync()
discrepencies, and report back.
Did you actually get any results on the fdsync() strangeness?
ReplyDeleteBut anyway: creat series of posts (just stumbled over it while checking some backgrounds on fdsync() performance).
Greetings
Bernd