This case study examines a database management system that uses raw I/O; many databases provide the ability to perform I/O through the raw device rather than using block I/O through the buffer cache.
The system in this study is not used to run any other applications or for developing applications. The study uses a benchmark written in-house to tune the system's performance. It is understood that a benchmark is only an approximation to the demands that will be placed upon the database in real life.
The system configuration is:
The configuration of the disk drives is:
The database supports the main activities of an organization, and transaction response time is critical. Desirable performance goals for such a system are:
The database administrator runs mpsar to collect system
statistics every 5 minutes while running the benchmark:
mpsar -o /tmp/sar_op 300 25
After the benchmark run finishes, cpusar -u is run to
examine the activity of the operating system on each CPU.
For example, on the base processor, the command is:
cpusar -P 1 -u -f /tmp/sar_op
The output below illustrates the status of the first processor; reports for the other processors recorded similar results:
02:30:00 %usr %sys %wio %idle ... 02:55:00 34 14 25 27 03:00:00 32 11 32 25 03:05:00 27 8 41 24 03:10:00 31 11 32 25 ...
The information provided by cpusar indicates that the processors are spending a significant amount of time waiting for I/O. Since it is observed that the disk drive busy lights are on virtually all of the time, this indicates that the disks may not be keeping up with the rest of the system. It is known that the benchmark does not test the networking I/O subsystems since it generates activity only on the server. For this reason, network I/O can be eliminated as a possible cause.
From the available evidence, it looks as if the problem is
in the disk subsystem. Sometimes an I/O
problem is a symptom of an underlying memory shortage.
A more likely cause of the poor I/O performance
is the distribution of activity among the system's disks.
However, it is a good policy to
eliminate memory shortage as the origin of the
In this example, there is a high amount of RAM, so it is unlikely that memory is the root of the problem. To make certain, the administrator runs mpsar -r to check on the amount free memory and swap while the benchmark was being run:
02:30:00 freemem freeswp ... 02:55:00 2012 131072 03:00:00 2004 131072 03:05:00 2098 131072 03:10:00 2009 131072 ...There was approximately 8MB of free memory and no swap space used during the period when the benchmark was run.
The administrator now concentrates on examining disk usage. It is unlikely that buffer cache usage is causing a problem as this has already been tuned for the root and applications filesystems. The database itself does not use the buffer cache as it uses asynchronous I/O access to disk.
The mpsar -d command is run to show disk usage during the benchmark:
02:30:00 device %busy avque r+w/s blks/s avwait avserv ... 02:55:00 Sdsk-0 0.91 2.11 12.37 23.11 3.24 2.91 Sdsk-1 85.24 4.13 39.93 155.12 79.62 25.44 Sdsk-2 78.02 4.14 41.10 160.82 63.50 20.21 Sdsk-3 85.11 4.36 34.54 167.54 80.85 24.05 Sdsk-4 89.34 4.42 37.00 156.91 85.26 24.92 Sdsk-5 83.59 4.48 40.41 159.34 60.11 17.26These results indicate that the system is I/O bound in the disk subsystem. The average service time (
03:00:00 Sdsk-0 0.76 2.42 8.37 13.64 8.35 5.88 Sdsk-1 90.10 4.42 42.37 156.67 65.71 19.19 Sdsk-2 85.42 5.38 39.09 160.81 79.96 18.24 Sdsk-3 88.29 4.62 37.65 163.57 83.27 23.00 Sdsk-4 85.99 5.29 41.11 166.28 79.21 18.45 Sdsk-5 99.54 4.21 43.09 170.21 56.76 17.67
03:05:00 Sdsk-0 0.70 4.28 8.07 11.95 7.32 2.23 Sdsk-1 88.01 4.03 40.01 167.76 63.27 20.91 Sdsk-2 89.18 3.82 38.97 156.08 65.44 23.20 Sdsk-3 93.02 4.53 39.55 163.20 80.56 22.81 Sdsk-4 85.33 4.07 37.03 154.31 78.70 25.66 Sdsk-5 91.14 4.87 41.34 164.40 59.02 15.24
03:10:00 Sdsk-0 0.79 2.94 9.60 16.23 6.30 3.67 Sdsk-1 87.75 4.20 40.77 159.85 69.53 21.85 Sdsk-2 83.82 4.52 39.72 159.24 69.63 20.55 Sdsk-3 88.81 4.50 37.25 164.77 81.56 23.29 Sdsk-4 86.89 4.59 38.38 159.17 81.06 23.01 Sdsk-5 91.42 4.52 41.61 164.65 58.63 15.98 ...
avserv) on the root disk (Sdsk-0) is much lower than that on the other disks -- even though they are physically identical. There is no significant variation between activity on disks Sdsk-1 through Sdsk-5 -- this is as expected given their RAID configuration which is designed to balance activity across several disks.
The large service times to the disks containing the database tables, indexes, and journal logs seem to indicate that large disk seek times are having an impact here.
Database journal logs are written to sequentially so it may impair disk performance if they are striped across the same disks as the tables and indexes. The effect will be to mix requests to write to the logs with those to access the tables and indexes. The requests will then be spread across all the disks in the array, losing all coherency between requests to write to the logs. Instead, each log should be placed on a dedicated disk so that the next block to be written to will usually be immediately available without having to move the drive head. This also makes sense for reducing load on the other disks when you consider that most writes on a database server tend to be to the logs.
The recommendation for reconfiguring this system would be to move the journal logs to dedicated disks. Since requests to access tables and indexes are generally random, the maximum performance from the disks that they occupy will probably be obtained by leaving these in a RAID 5 configuration. Journal logs on a database server are usually assigned to a dedicated mirrored disk configuration, such as RAID 1, to ensure data integrity.
Once the disk layout has been reconfigured, the benchmark should be run again to test the hypothesis. If the performance of the dedicated journal log disks is still poor, the disks should be upgraded to ones with lower access times. These disks will also benefit from write-back caching provided that the integrity of the data is protected by a backup power supply.
There is little to be gained from tuning the operating system's buffer cache for a database management system that uses raw I/O. However, the database management system may maintain its own buffer caches which you can tune independently. Refer to the performance tuning documentation that was supplied with the database management system for more information. Some databases also maintain profiling files that indicate how many jobs are going to each device. You may also find this useful in balancing the load on the system.
If your database caches its internal information in IPC shared memory segments, you may need to increase the shared memory kernel parameters SHMMAX and SHMMNI. See ``Shared memory'', ``Shared memory parameters'' and your database documentation for more information.