Day 30 of 60: What are the single queue directory bottlenecks? (pt 2)

Having established that there's a significant increase in the amount of taken by the fdsync() and open() system calls when Sendmail creates queue entries with a single queue directory I've set about tracking down what that bottleneck is.



To confirm that these system calls were the ones most dramatically affected by multiple queue directories I re-ran the tests but with 10 queue directories. Everything else about the environment was the same.

This was test 11, and the raw data for test 11 is available.

These next charts show the results from this test compared against the results from test 10. This first chart compares the average execution time of each system call. The results from test 10 are in green, those from test 11 in red.

Test 11, average syscall exec time compared against test 10

As expected, given the results from yesterday this shows the dramatic difference when using multiple queue directories.

Totalling up the execution time for the system calls paints a similar picture. Again, this compares test 10 against test 11, with the test 10 results in green.

Test 11, total syscall exec time compared against test 10

It's obvious that fdsync() and open() should be instrumented in more detail.

To do this I created queue-create-fbt.d. This script traces kernel functions using the fbt provider (hence the name) that are called by a particular syscall.

Test 12 - 1 queue directory, 1 concurrent sender, fdsync()



View raw data

The purpose of this test is to see what fdsync() is doing in the simple case of 1 queue directory and 1 sender. We already know that this is very fast. These results can then be compared against a second test that uses multiple concurrent senders to see which kernel functions are taking the most time.

Test 13 - 1 queue directory, 1 concurrent sender, open()



View raw data

This is identical to the previous test, except that calls initated by open() were instrumented.

Test 14 - 1 queue directory, 10 concurrent senders, fdsync()



View raw data

This is the same as test 12, but with 10 concurrent senders rather than 1.

Test 15 - 1 queue directory, 10 concurrent senders, open()



View raw data

This is the same as test 13, but with 10 concurrent senders rather than 1.

Results - fdsync()



I've taken the results from tests 12 and 14, sorted them by total elapsed time for each function call, and plotted the results against each other.

This chart shows the top function calls and their elapsed time. The results from test 12 are in green, those from test 14 are in red. If a function only has a green bar (e.g., push_dirty_bp) it indicates that this function was in the top 15 for test 12, but not for test 14. Similarly, the presence of just a red bar (e.g., mutex_vector_enter) indicates that this function was in the top 15 for test 14 but not test 12.

Test 12, total function call exec time compared against test 14

Well, we can see here where much of the additional overhead has come from. Calls to mutex_vector_enter(), non-existent in test 12, dominate the results in test 14.

Given that fdsync() can not return until the data has been written to disk, this would suggest that multiple fdsync() calls to files in the same directory will serialise, one behind the other, presumably synchronising on a mutex to do so.

Results - open()



This chart shows the results from tests 13 and 15, plotted in the same way.

Test 13, total function call exec time compared against test 15

It's apparent here that much of the overhead is in code related to locking. Almost two seconds is spent in cv_wait(), which suspends the calling thread. There's increased time spent in splx(), which is normally associated with blocking on interrupts, and additional time spent in lock_set_spl(), disp_lock_enter(), the names of which strongly suggest locking contention.

If I had more time I'd dig in to the OpenSolaris kernel code to extract more details about this. I'll leave that as an exercise for the reader.

What happens with multiple queue directories?



The last thing to look at is to see how the results change when multiple queue directories are used.

To that end I repeated tests 14 and 15, but with 10 queue directories. This is tests 16 and 17.

Here's the fdsync() comparison. The 1 sender/1 queue directory data in green, the 10 senders/1 queue directory data in red, and the 10 senders/10 queue directories data in yellow.

Test 12, total function call exec time compared against tests 14 and 16

I confess, I'm not entirely sure what's going on here. The results from comparing test 10 against test 11 clearly showed that fdsync() had a lower cost with multiple queue directories. Yet here many of the kernel function calls take longer.

The total time spent executing these calls is also greater.

























TestDescr.Time (ns)
10Total time spent in fdsync()3,255,859,582,382
14Total time spent in kernel functions called by fdsync()285,960,093,086
11Total time spent in fdsync()551,254,035,562
16Total time spent in kernel functions called by fdsync()324,482,491,954


I imagine that this means there is additional work being carried out by fdsync() that is not being accounted for by the kernel function calls. I'm not entirely sure how I go about tracking that down, but it's going to be an excellent question for the DTrace mailing list...

The data gathered instrumenting kernel calls by open() makes a litlte more sense. Here's the chart.

Test 13, total function call exec time compared against tests 15 and 17

Here, at least, it's obvious that in the 10-queue-directory case (yellow) open() is not spending significantly more time in the kernel than in the 1 queue/1 sender case (green).

I'll ask some DTrace gurus about the fdsync() discrepencies, and report back.

1 comment:

  1. Did you actually get any results on the fdsync() strangeness?

    But anyway: creat series of posts (just stumbled over it while checking some backgrounds on fdsync() performance).

    Greetings
    Bernd

    ReplyDelete