Hardware and software I/O queues: the difference
Let answer a simple question first: why we need two types of I/O queue? Why an hardware and a software queues?
The simpler answer is that, in some system, one kind of queue can be absent (in other word: you have only one queue). Old IDE drives are a good example of that: in the absence of an hardware assisted queue, the OS was in charge of any I/O sorting.
On diametrically different case, a very high-end machine can have a very complex, purposely-configured hardware queue, so it can disable the simplistic software queue inside the OS.
However, in many machines, you will indeed find both queues enabled at the same time. To understand why, it is crucial to realize that they have different limitation, scope and visibility of the understanding hardware.
A modern disk expose its storage as a continuous string of storage blocks. This addressing mode is called LBA, for logic block address. In short, the OS see the disk as a long sequence of blocks starting at index 0 and ending at index N. To tell with a picture:
Now imagine that user-level application require the OS to read blocks 1,5,2,9,4. A "dumb" OS will immediately send these requests to the disk but, as head seeks are slow, performance will be low. It would be much smarter to correctly order the requested blocks in a sequential manner: let read 1,2,4,5,9. This results in less head seeking and so in higher performance. Moreover, as CPU cycles and memory space are cheaper than disk seeks, the software controlled queue can be very big - even to accommodate hundred or thousands of requests. The software component called "I/O scheduler" has this very precise duty: to organize and reorder requests to extract greater performance from the disks. Some OSs (as Linux) even have multiple, selectable schedulers in order to use the fastest one given a specific workload.
However, let us face the reality: physically, HDD are not organized as a continuous stream of blocks, rather they are organized in rotating platters. This means that not only head seek latency, but rotational delay (the time needed to rotate the platter up to the correct block) also play a important role in determining I/O speed. But OS know nothing about rotational latency: remember that HDD announce itself is a continuous string of blocks. On the other hand, the internal HDD microcontroller surely know about the rotating nature of its storage platters. This means that an hardware-assisted queue can noticeably increase I/O performance by further requests reordering. Again, a picture (taken by wikipedia) can be very useful:
Do you see how in the NCQ-enabled disk the head travels a shorter total path? This is thanks to rotational-conscious reordering of the requested blocks. However, SATA NCQ is good only for 32 requests (31 in fact, due to a bug in the specification). But if hardware queues are so smart, why we don't have gigantic queues in today disk and ditch software queue entirely? The problem is that hardware queues are costly because they need to be implemented, er, in the hardware (or, better, inside microcontroller software which, in turn, require faster and pricier hardware). However, HDD microcontrollers tend to be (and are required to be, due to low-cost requirement) quite simple in the design. This cost/performance tradeoff results in the 32-entry-sized queue of today consumer disks.
It's worth nothing that even in enterprise disks where TCQ can be hundred of requests in size, software queues (due to their simplicity) remain even bigger.
A last thing to note: all this discussion about queues makes sense only if there are multiple outstanding requests for the disk subsystem. In a single-thread, single-request scenario, there is nothing to reorder: the queues will be always empty. With current multi-core hardware and software (which often use multiple threads and concurrently, asynchronous I/O requests) even consumer-class workload can expect greater-then-one queue depth, but this discussion is clearly headed at servers where QD rise significantly.
So, how hardware and software queues interact to give us greater speed? Let see the data...