Try before you buy: 2006

Identity theft

I've just discovered that I've been an unwitting participant in an identity theft.

But not, perhaps, in the way that you might imagine.

use.perl journals and full text feeds

One of the sites I make a point of reading regularly is use.perl, and in particular, the user journals / blogs. They don't take too long to read, and there's normally a couple of posts a day that teach me something I didn't know about Perl, or that highlight a new module that's doing something useful.

But there's a problem.

SVN::Web and Google Code Hosting

You're probably aware of Google's code hosting service. They use Subversion as their revision control system, so if you want to contribute to a project hosted there you really need a Subversion client.

I was reading the FAQ for the hosting service the other day and a particular entry struck me.

Planet Subversion

I've been experimenting with Plagger, a tool for plugging together chains of filters, pumping RSS/ATOM feeds in one end, and getting transformed output at the other end.

This doesn't have to be as simple as chaining a few XSLT transformations together, as Plagger filters can carry out additional actions (such as e-mailing the results to you, calling on the power of Perl modules to create summaries, and so on).

As a learning exercise, I've built Planet Subversion (edit I've updated the URL to point to the official domain). This takes feeds from a number of different sources and builds a "Planet" site from them. And, of course, with Plagger being open source, it's easy to contribute any fixes back to the author.

Please let me know if you use Subversion and can recommend any other feeds to add.

If you'd like to produce your own aggregation site using Plagger, here's the config file that I'm using for Planet Subversion.

Trigger happy hosting / spam @ The Guardian

Two spam related pieces of information today.

The first concerns what happens if you're hosted at an ISP with an anti-spam policy, an itchy-trigger finger, and a support desk that is devoid of clue.

It appears as though the fine folk over at The Weekly had their infrastructure on a shared server at their ISP, HostingPlex. That same server was then used by a spammer to send spam, which was caught by SpamCop. Rather than track down the actual culprits, HostingPlex have locked The Weekly's account, and are demanding US$150 to reinstate access to the server, while ignoring repeated e-mails from The Weekly that contain what appears to be pretty straightforward evidence of their innocence.

I paraphase somewhat, you can read the The Weekly's side of the story for yourself.

In other news, "Thoughts on stopping spam" appeared, somewhat edited, as an article in The Guardian yesterday.

CAPTCHA farming

Charles Arthur's wondering why spam came through his CAPTCHA system, and concludes that people are probably being paid to sit there and fill out CAPTCHAs.

There are a couple of other possibilities. The first is that the CAPTCHA system he's using might be compromised. Some OCR systems can be surprisingly effective on them.

The second is his CAPTCHAs are being reproduced on another site for humans to solve. The canonical example would be where a visitor to a porn site is shown a CAPTCHA and asked to solve it before they can, er, continue. Unbeknownst to them, however, the CAPTCHA is actually coming from Charles' system, and the solution is then used to send him spam. This is "CAPTCHA farming".

Searching for "CAPTCHA porn" turns up a number of stories about this over the past few years.

Thoughts on stopping spam

I was pinged on IRC earlier today by someone who was having an e-mail discussion with Charles Arthur of the Guardian, in response to this article on Six steps to stopping spam. Since I spend a lot of my day job doing anti-spam engineering for a large organisation, Robbie thought that I might have some useful comment.

I've fired an e-mail off, which I reproduce below, in the hope that it might be useful to a wider audience.

I didn’t win…

... or so I thought.

Day 59 of 60: Final thoughts

This system will be going back to Sun soon, while I wait to find out whether or not they've decided to grant me the system. In the meantime, here are some final thoughts on the last 59 days.

Day 59 of 60: Developer benchmarks (pt 4)

Yesterday's tests show that using gcc on both FreeBSD and Solaris yields a marked improvement in the time taken to compile Perl.

However, despite the big difference in compile times, the run-times of Perl's test suite aren't dramatically affected. The worst performer, Perl running on Solaris, compiled with Sun's cc and optimisation is 6% slower than the best performer, Perl running on FreeBSD, compiled with gcc and optimisation. This test involved a great deal of IO and process creation, and I thought that that might be part of the reason for the differences. So I've been using a Perl based application, SpamAssassin, to test whether or not there are big differences between the run times of the various Perl interpreters.

Day 58 of 60: Developer benchmarks (pt 3)

Yesterday I looked at performance compiling Sendmail on Solaris and FreeBSD using Sun's compiler (on Solaris) and gcc (on both systems).

In the tests gcc come out handily ahead, with gcc on FreeBSD being 16% faster than gcc on Solaris with low optimisation options, and 12% faster than gcc on Solaris with optimisation turned on. Sun's compiler was over twice as slow as gcc on either system.

Today I've been looking at the time taken to compile Perl on both systems, using both compilers, with and without optimisation.

Day 57 of 60: Developer benchmarks (pt 2)

As I explained on day 55, I've been comparing GCC and the Sun Studio compiler on Solaris, to GCC running on FreeBSD to see if there are any significant differences in the time taken to compile the applications, and if there is, whether that difference is reflected in the time taken by the applications to run. I used gcc 3.4.3 on both systems.

Day 55 of 60: Installing FreeBSD on a Sun Ultra 40

Fetching and installing FreeBSD was relatively painless. I downloaded the ISO image for FreeBSD 6.1 (AMD64 build), and wrote it to a blank CD.

When I reinstalled Solaris some weeks back I made sure to leave some space on the disk for FreeBSD. So then it was a case of booting from the CD.

This was my first time booting FreeBSD AMD64, but the process was reassuringly familiar to that on x86. FreeBSD is considerably more verbose at boot time than Solaris 10 is, being much more like earlier Solaris releases. Personally I find this to be no bad thing, at least as the default option.

It was then that I ran in to a problem -- the keyboard on the machine is USB. It works fine during the boot process, but once the kernel's started it was nonresponsive.

Day 55 of 60: Developer benchmarks (pt 1)

The last week has been quite busy with work that's not related to this project. Mindful that the 60 day time limit is almost up, and aware that I've not done any actual benchmarking of this workstation -- vis a vis "How does Solaris on this hardware compare against another OS on this system?", I've started doing some investigation.

Sun bill this machine as a developer workstation, so I thought I'd look at how speedy it is at carrying out tasks that developers do. I also thought it would be worthwhile carrying out a few performance benchmarks relating to a real-world application that I currently run on Solaris.

Day 46 of 60: Queue sort strategies

I've been looking at different queue sort strategies to see what their overhead is. Since all the messages are going to be delivered to a single host these results aren't necessarily going to be indicative of what you would see on a production server. However, they should serve to illustrate any inherent speed advantages of one sort strategy over another.

Read on for the resuls.

Day 43 of 60: Multiple queues, multiple queue runners (pt 3)

It's definitely a bug.

Specifically, in the default case, and contrary to the documentation, sendmail does not run one queue runner for every queue directory. It runs precisely one. I brought this up on the Sendmail mailing list, sendmail-2006@support.sendmail.org. The most recent message in that discussion (at the time of writing) follows.

Day 41 of 60: Multiple queues, multiple queue runners (pt 2)

That's odd.

Day 38 of 60: Multiple queues, multiple queue runners (pt 1)

I've started to get data about the effect of multiple queues with multiple queue runners.

As before I'm using 1, 5, 10, 20, 30, and 40 queue directories, and I'm instrumenting with queue-run-duration.d. This time I'm starting queue runners with the command sendmail -q30s. This will cause Sendmail to create a new queue runner to process the queue every 30 seconds.

The problem is that, even with 30,000 messages, Sendmail can process the whole lot in about 40 seconds, which doesn't give enough time for more than two queue runners to start.

So I'm using the -w option to smtp-sink(1) to insert a 1 second delay at the DATA stage. So (roughly) the first 30 messages go through at the rate of one per second. Then a second queue runner starts, and messages go at the rate of 2 a second, and so on.

But it's still slow going.

As I write this it occurs to me that I could use DTrace to induce this slowdown, by using chill() to have the process pause for a tenth of a second at the start of every job run. That's something I may look at later. As Thursday is my normal day in London, look for updates on Friday.

Day 38 of 60: Multiple queues, one queue runner

Today I'm looking at the results that I've obtained from the latest round of tests. These tests used sendmail -q to deliver 30,000 messages to a different zone. There were 10 runs to each test, and the different tests collected data on timings for 1, 5, 10, 20, 30, and 40 queue directories.

Day 37 of 60: Instrumenting queue processing time

Previously I've written about variables that may affect how rapidly Sendmail can process the mail queue. I've now started working to gather data on exactly how much influence these variables have.

Day 33 of 60: Strategies for processing the queue.

Note: If you're not familiar with sendmail queues, the sendmail queue primer I wrote might be useful.

There are two aspects of mail queue management to consider with Sendmail. The first is the process that puts messages in the queue. I've looked at that in some detail already, and written a number of D scripts that should make it easy for you to instrument Sendmail on your production systems so you can decide how best to layout your queue directories for optimal inbound performance.

The flip side of the coin is to try and answer the question How do you maximise delivery from the queue?" This is a more complex question to answer, as the number of variables that you can control that affect this is much larger. Also, there's more variability when delivering mail, as you are at the mercy, to some extent, of each remote site -- how fast they process mail you send them, whether or not they're actually up, how much latency there is between you and them, the speed of DNS lookups, and so on.

So, what can we test?

Day 32 of 60: Complete instrumentation of queue creation

Or: "How do I use DTrace with programs that fork?"

With some help from the dtrace-discuss[1] mailing list I've now written a couple of D scripts that can trace what Sendmail is doing between probe points. There's a writeup, and sample output, below the fold.

[1] Note -- the forum archive doesn't seem to link to the discussion yet. When it does I'll update this link to point to the discussion. The subject was "Using pid provider when process forks".

Day 31 of 60: Queues and connections

Back on day 28 I looked at the effect of multiple queue directories with concurrent senders.

These results showed that there was considerable benefit with 10 senders and 10 queue directories. The benefit going to 20 queue directories with 10 senders was negligible.

At the time I wondered whether this was a general rule -- i.e., is anything more than 10 queue directories overkill? Or is there a correlation between the number of queue directories compared to the number of simultaneous sending systems.

test18-40q.png

test18-30q.png

test18-20q.png

test18-10q.png

Day 30 of 60: What are the single queue directory bottlenecks? (pt 2)

Having established that there's a significant increase in the amount of taken by the fdsync() and open() system calls when Sendmail creates queue entries with a single queue directory I've set about tracking down what that bottleneck is.

test12-14-16-r1-total.png

test13-15-r1-total.png

test12-14-r1-total.png

test11-r1-total.png

test11-r1.png

Day 29 of 60: What are the single queue directory bottlenecks?

Earlier posts have shown that using a single queue directory imposes a significant bottleneck when processing concurrent connections with Sendmail. Yesterday I posed some questions, and today I've started work on answering the first one.

The first question was:

What is responsible for the dramatic slow down in the single-queue case (test 4)?

test9-r1-total.png

test10-r1.png

test9-r1.png

Day 28 of 60: Instrumenting Sendmail queue file creation (pt 4)

Yesterday I looked at the effect of multiple queue directories when processing messages over a single connection.

Today I've been looking at how multiple queue directories can help when processing concurrent connections.

The methodology was identical to the previous tests. The only change was to the smtp-source(1) command line. The previous tests were run with -s 1, indicating one concurrent connection. These tests were run with -s 10, to force 10 concurrent connections.

test7-r1.png

test6-r1.png

test5-r1.png

Day 27 of 60: Instrumenting Sendmail queue file creation (pt 3)

I've commited the first sets of results to the repository in the aptly named results/ directory.

To refresh your memory, the question I intended to answer was:

does the number of queue directories (on a single disk) make a significant impact on the time taken to create new entries in the queue?

They're quite surprising.

test3-r1.png

test2-r1.png

test1-r1.png

Disclaimer

I do not suggest that you use any results shown here to make decisions about your own production system(s).

There are a number of things about these tests that mean that they’re not directly applicable to the real-world.

1. Everything’s running off one disk, and hence one disk controller.
2. I haven’t gone out of my way to tune the Solaris install.
3. The SMTP traffic is being generated by the same machine that’s receiving it.
4. The test machine is running a number of services that you wouldn’t expect to see on a real mail server. For instance, a GNOME desktop, web browser, and so on.

However, you should be able to use the methods shown here to generate data about your production systems, and make a decision accordingly.

Day 27 of 60: Instrumenting Sendmail queue file creation (pt 2)

It's time to run an instrumented Sendmail, throw some messages at it, and see how it performs. Specifically, does the number of queue directories (on a single disk) make a significant impact on the time taken to create new entries in the queue?

Day 26 of 60: Instrumenting Sendmail queue file creation (pt 1)

I've (finally) got Sendmail built, zones configured, DTrace working for functions declared static, and a mechanism for creating test SMTP sessions.

So it's time to start putting this together, instrumenting Sendmail, and seeing whether or not I can use this to prove (or disprove) some common advice given when configuring Sendmail.

First, I'm going to look at queue directories.

Day 26 of 60: smtp-source

It's the school holidays, and my two children have had friends staying over this past week, which meant that there hasn't been much opportunity to work on this project, and even less opportunity to write about it. So these next few posts are going to be something of a catch up.

I've previously documented running Sendmail in a zone. One of the things that I need for testing Sendmail is a source of messages, and an easy mechanism to get them to Sendmail over SMTP.

Day 20 of 60: Running Sendmail in the zones (pt 2)

I've now got Sendmail built and installed, and adjusted the SMF so that it uses my local version of Sendmail (with DTrace probes) in favour of the system version.

Day 19 of 60: Running Sendmail in the zones (pt 1)

Now that Sendmail is building and correctly installing in to a custom directory it's time to start looking at how I get my version of Sendmail used instead of the version that's supplied with Solaris.

For that, I need to delve in to the Solaris Service Management Facility (SMF).

Day 19 of 60: M4 issues resolved, ministat updates

The issues with M4 have been resolved. A colleague, Andre Lucas, took up the challenge and worked out a fix which he describes in detail. And ministat's now looking much better. It's grown some useful new options, a lot of documentation, and can now (optionally) generate plots in colour. Look below the fold for two example plots.

Day 17 of 60: ministat

I've spent some of today porting some useful statistics reporting software from C to Perl.

ministat reads in two or more files of data and uses the Student's t test to determine if there is any statistical difference between the means of the datasets. This is especially useful when comparing benchmark results.

For instance, I figure that this will be useful to compare data from several Sendmail runs, where the number of queue directories differ between the runs. It should highlight any benefits between different numbers of queue directories.

Rather than explain more here, I'll point you at the code. There's documentation towards the end of the file. Note that this still needs some work -- there's no proper command line option handling at the moment, some of the documentation needs fleshing out, and I wouldn't use this as an example of good Perl code, as it still looks far too much like a Perl program that's been written in C.

When I've fixed that I'll put it up on CPAN.

Day 16 of 60: I strongly dislike M4

Which is a problem because the Sendmail build system is written in it.

Here's the problem I'm trying to solve.

Day 16 of 60: Reinstall

I've just had to carry out a complete reinstall of the OS, which was uncommonly tedious.

Day 15 of 60: Installing Sendmail

I've spent some time today getting Sendmail+DTrace to install properly.

This wasn't quite as straightforward as it could be, requiring a little build infrastructure hacking.

Day 14 of 60: Minor updates

I've been a bit busy with other work over the past few days, and haven't made quite as much progress as I'd like.

There are a few things that have moved forward though.

Day 10 of 60: First probes added to Sendmail

Following Monday's info dump about queues, I've spent some time over the last few days reading the DTrace documentation in detail. In particular, the Solaris Dynamic Tracing Guide. This is the DTrace handbook, with a great deal of information about how to use DTrace.

It also contains the information about how to add custom DTrace probes to user applications. I was a bit surprised when I first read that, as it's only a couple of pages long.

It turns out that adding DTrace probes really is that simple...

Day 8 of 60: Sendmail queues

The time has come to start adding DTrace functionality to Sendmail. Of course, there's no point in just diving in and adding code left, right, and centre, so over the last couple of days I've been thinking about what I should be instrumenting first.

Day 5 of 60: DTrace mode for Emacs?

I'm just starting to get my feet wet with DTrace. Does anyone know of a decent Emacs mode for editing .d files?

Day 5 of 60: Building test zones

I've spent a bit of time today preparing some zones that I'll be using for testing my changes to Sendmail.

Day 4 of 60: Sendmail first build

It's not much of a milestone, but spent five minutes getting Sendmail to build on Solaris 10 with the Sun Studio 11 compiler.

Day 4 of 60: The learning zone

One of the new features that Solaris 10 has that I'm interested in is zones. A zone is lightweight virtualisation environment. Unlike VMWare, or Xen, the whole environment is not virtualised. You still have one running OS kernel which arbitrates access to the hardware, for example. A zone is more like a separate instance of the userland, with its own IP address, users, running processes, and so on.

In this respect Solaris Zones are very similar to FreeBSD Jails, and if I was going to sum it up I might call it "chroot on steroids, with a much better management interface."

I'm quite familiar with FreeBSD's Jail system, much less so with Zones. I've offered up a Zone to some of the pkgsrc developers so they can experiment with pkgsrc on Solaris 10, and I'm planning on using Zones for testing the changes that I'll be making to Sendmail, so I need to learn how to create and manage them.

Day 2 of 60: Synergy

This is a plug for one of the handiest network programs I've used in a long time. Synergy.

Day 2 of 60: Blastwave

I wrote earlier about getting pkgsrc builds up and working.

Unfortunately, I wrote too soon.

Day 2 of 60: Importing and branching Sendmail

Now that I've started to get a development environment that I feel comfortable with I've imported the latest release of Sendmail in to my Subversion repository. This is publicly accessible, so you can follow along at home if you've got a Subversion client installed.

Day 2 of 60: SSH agent authentication

I've just configured the desktop to prompt me for my SSH credentials once, instead of on every connection, using ssh-agent and an X11 SSH password requestor.

This is bread-and-butter stuff that should be easy, made a little more complex by documentation not being accessible on the Sun site. Since I couldn't find the correct incantation through Google I'm documenting it here

Day 1 of 60: Fun and games with pkgsrc

In the years that I've been using Solaris it's support for third party packages has, to my mind, always let it down. The open source community writes and releases software at a phenomenal rate, and systems like FreeBSD and Linux have developed a number of interesting ways to make it as easy as possible to get this software, install it, and (much more importantly) manage it once it's been installed.

Day 1 of 60: Tourist

I've spent a bit of time poking around the workstation and Solaris 10. I feel like a tourist in a Western European city. Everything's pretty familiar -- the roadsigns all look the same (although the font might be a bit different, and the speeds are marked in kph not mph) but there are things everywhere that remind you that you're not quite home.

Day 1 of 60: First boot

After a few phone calls along the way to find out where it had got to, the workstation arrived today a little after midday. I was greeted with two large boxes, one containing the machine itself, the other containing a far smaller box with the keyboard and mouse. As a friend has already remarked, there's a certain something about new computer smell.

ENOSERVER

No, I haven't given up already. Although the box was supposed to arrive on the 5th there's been no sign of it so far. I did try using the tracking number that Sun sent me at mysun.sun.com.

Sadly that just gives me:

Thank you for your interest in Order Status. It will take three business days to activate your Order Status entitlements.

I've been on the phone to them, and they do assure me that it will arrive later today.

In the meantime, here's something else I've been working on -- scrollable commit timelines for SVN::Web.

raison d’être

It started when I read a number of posts at Jonathan Schwartz's blog (in order: here, here, here, here, and here).

Jonathan is Sun's CEO (although he wasn't at the time he started this series). The essence of it is that Sun are so stoked about their new hardware that:

So... here's an invitation to developers and customers that don't want to move to Solaris, want to stay on GNU/Linux, but still want to take advantage of Niagara's (or our Galaxy system's) energy efficiency - click here, we'll send you a Niagara or Galaxy system, free. Write a thorough*, public review (good or bad - we just care about the fidelity/integrity of what's written - to repeat, it can be a good review, or a poor review), we'll let you keep the system. Free.

That sounds like a good deal to me. So I started thinking about how I might take advantage of this offer.

Pages