Tuning I/O resources

Case study: unbalanced disk activity on a database server

This case study examines a database management system that uses raw I/O; many databases provide the ability to perform I/O through the raw device rather than using block I/O through the buffer cache.

The system in this study is not used to run any other applications or for developing applications. The study uses a benchmark written in-house to tune the system's performance. It is understood that a benchmark is only an approximation to the demands that will be placed upon the database in real life.

System configuration

The system configuration is:

Multiprocessor -- four 80486DX2 processors running at 66MHz.
EISA bus.
64MB of RAM.
64MB of swap space -- note that this system has been configured with a swap space that is smaller than the minimum of twice the amount of RAM (128MB) that is normally recommended.
Six 1GB SCSI-2 hard disk drives.
One Bus Mastering DMA Ethernet network card with a 16KB buffer and 32-bit wide data path.
Two terminal concentrators supporting a total of 128 users on a variety of ASCII terminals and personal computers running terminal emulation software.

Additional software that has been installed on the system includes SCO SMP License, a relational database management system, and transaction compression software.

The configuration of the disk drives is:

Sdsk-0: boot, swap, root, and applications divisions
Sdsk-1 through Sdsk-5: database tables, indexes, and journal logs configured as RAID 5 (striped with parity) to balance disk load

Defining a performance goal

The database supports the main activities of an organization, and transaction response time is critical. Desirable performance goals for such a system are:

Do not allow the system to start swapping or paging out.
Minimize time spent waiting for I/O.
Maximize the time that the database management processes spend on the CPUs.
Minimize system overhead.
Minimize contention for resources between the CPUs.
Ensure that sufficient kernel resources are available to the database management software (such as IPC facilities, STREAMS, and so on).
Maximize the amount of physical memory that is available for the database's internal work areas.

Collecting data

The database administrator runs mpsar to collect system statistics every 5 minutes while running the benchmark:

mpsar -o /tmp/sar_op 300 25

After the benchmark run finishes, cpusar -u is run to examine the activity of the operating system on each CPU. For example, on the base processor, the command is:

cpusar -P 1 -u -f /tmp/sar_op

The output below illustrates the status of the first processor; reports for the other processors recorded similar results:

   02:30:00    %usr    %sys    %wio   %idle
   ...
   02:55:00      34      14      25      27
   03:00:00      32      11      32      25
   03:05:00      27       8      41      24
   03:10:00      31      11      32      25
   ...

Formulating a hypothesis

The information provided by cpusar indicates that the processors are spending a significant amount of time waiting for I/O. Since it is observed that the disk drive busy lights are on virtually all of the time, this indicates that the disks may not be keeping up with the rest of the system. It is known that the benchmark does not test the networking I/O subsystems since it generates activity only on the server. For this reason, network I/O can be eliminated as a possible cause.

From the available evidence, it looks as if the problem is in the disk subsystem. Sometimes an I/O problem is a symptom of an underlying memory shortage. A more likely cause of the poor I/O performance is the distribution of activity among the system's disks. However, it is a good policy to eliminate memory shortage as the origin of the performance degradation.

Getting more specifics

In this example, there is a high amount of RAM, so it is unlikely that memory is the root of the problem. To make certain, the administrator runs mpsar -r to check on the amount free memory and swap while the benchmark was being run:

   02:30:00 freemem freeswp
   ...
   02:55:00    2012  131072
   03:00:00    2004  131072
   03:05:00    2098  131072
   03:10:00    2009  131072
   ...

There was approximately 8MB of free memory and no swap space used during the period when the benchmark was run.

The administrator now concentrates on examining disk usage. It is unlikely that buffer cache usage is causing a problem as this has already been tuned for the root and applications filesystems. The database itself does not use the buffer cache as it uses asynchronous I/O access to disk.

The mpsar -d command is run to show disk usage during the benchmark:

02:30:00 device %busy avque r+w/s blks/s avwait avserv ... 02:55:00 Sdsk-0 0.91 2.11 12.37 23.11 3.24 2.91 Sdsk-1 85.24 4.13 39.93 155.12 79.62 25.44 Sdsk-2 78.02 4.14 41.10 160.82 63.50 20.21 Sdsk-3 85.11 4.36 34.54 167.54 80.85 24.05 Sdsk-4 89.34 4.42 37.00 156.91 85.26 24.92 Sdsk-5 83.59 4.48 40.41 159.34 60.11 17.26

03:00:00 Sdsk-0 0.76 2.42 8.37 13.64 8.35 5.88 Sdsk-1 90.10 4.42 42.37 156.67 65.71 19.19 Sdsk-2 85.42 5.38 39.09 160.81 79.96 18.24 Sdsk-3 88.29 4.62 37.65 163.57 83.27 23.00 Sdsk-4 85.99 5.29 41.11 166.28 79.21 18.45 Sdsk-5 99.54 4.21 43.09 170.21 56.76 17.67

03:05:00 Sdsk-0 0.70 4.28 8.07 11.95 7.32 2.23 Sdsk-1 88.01 4.03 40.01 167.76 63.27 20.91 Sdsk-2 89.18 3.82 38.97 156.08 65.44 23.20 Sdsk-3 93.02 4.53 39.55 163.20 80.56 22.81 Sdsk-4 85.33 4.07 37.03 154.31 78.70 25.66 Sdsk-5 91.14 4.87 41.34 164.40 59.02 15.24

03:10:00 Sdsk-0 0.79 2.94 9.60 16.23 6.30 3.67 Sdsk-1 87.75 4.20 40.77 159.85 69.53 21.85 Sdsk-2 83.82 4.52 39.72 159.24 69.63 20.55 Sdsk-3 88.81 4.50 37.25 164.77 81.56 23.29 Sdsk-4 86.89 4.59 38.38 159.17 81.06 23.01 Sdsk-5 91.42 4.52 41.61 164.65 58.63 15.98 ...

These results indicate that the system is I/O bound in the disk subsystem. The average service time (avserv) on the root disk (Sdsk-0) is much lower than that on the other disks -- even though they are physically identical. There is no significant variation between activity on disks Sdsk-1 through Sdsk-5 -- this is as expected given their RAID configuration which is designed to balance activity across several disks.

Making adjustments to the system

The large service times to the disks containing the database tables, indexes, and journal logs seem to indicate that large disk seek times are having an impact here.

Database journal logs are written to sequentially so it may impair disk performance if they are striped across the same disks as the tables and indexes. The effect will be to mix requests to write to the logs with those to access the tables and indexes. The requests will then be spread across all the disks in the array, losing all coherency between requests to write to the logs. Instead, each log should be placed on a dedicated disk so that the next block to be written to will usually be immediately available without having to move the drive head. This also makes sense for reducing load on the other disks when you consider that most writes on a database server tend to be to the logs.

The recommendation for reconfiguring this system would be to move the journal logs to dedicated disks. Since requests to access tables and indexes are generally random, the maximum performance from the disks that they occupy will probably be obtained by leaving these in a RAID 5 configuration. Journal logs on a database server are usually assigned to a dedicated mirrored disk configuration, such as RAID 1, to ensure data integrity.

Once the disk layout has been reconfigured, the benchmark should be run again to test the hypothesis. If the performance of the dedicated journal log disks is still poor, the disks should be upgraded to ones with lower access times. These disks will also benefit from write-back caching provided that the integrity of the data is protected by a backup power supply.

Database performance tuning

There is little to be gained from tuning the operating system's buffer cache for a database management system that uses raw I/O. However, the database management system may maintain its own buffer caches which you can tune independently. Refer to the performance tuning documentation that was supplied with the database management system for more information. Some databases also maintain profiling files that indicate how many jobs are going to each device. You may also find this useful in balancing the load on the system.

Use of shared data segments

If your database caches its internal information in IPC shared memory segments, you may need to increase the shared memory kernel parameters SHMMAX and SHMMNI. See ``Shared memory'', ``Shared memory parameters'' and your database documentation for more information.