Automating frequent tasks

File tools

The following scripts are used for manipulating and returning information on files.

Return the total size of a group of files

The following is a script called filesize:

   l "$@" | awk ' {	s += $5
   			f = f" "$NF
   		    }
   		END {	print s, "bytes in files:", f} '

The l command (equivalent to ls -l) returns a long listing, the fifth field of which contains the size of a file in bytes. This script obtains a long listing of each file in its argument list, and pipes it through a short awk script. For each line in its standard input, the script adds the fifth field of the line to the variable s and appends the last field (the filename) to the variable f; on reaching the end of the standard input, it prints s followed by a brief message and f.

Compress a batch of files concurrently

The compress(C) command can compress a batch of files listed as arguments; however, if you run compress in this way only one process is created, and it compresses each file consecutively.

The following code is a script called squeeze:

   ((jobcount=0)) ; rm squish.log
   for target in $*
   do
     if ((jobcount+=1 > 18))
        then ((niceness = 18 ))
     else
        ((niceness = jobcount ))
     fi
     ((jobcount % 18 != 0)) || sleep 60
     nice -${niceness} compress ${target} && print "Finished compressing " \
       ${target}>> squish.log &
     print "Started compressing "${target} "at niceness " \
       ${niceness} >> squish.log
   done
   print "finished launching jobs" >> squish.log

A concurrently running squeeze process is started for each file. However, if run on a large directory, this could overload the system: therefore, squeeze uses nice(C) to decrease the priority of processes as the number increases.

The first section of this script keeps track of the niceness (decrement in scheduling priority) with which each squeeze job is to be started:

   if ((jobcount+=1 > 18))
      then ((niceness = 18 ))
   else
      ((niceness = jobcount ))
   fi

The value of jobcount is incremented every time a new file compression job is started. If it exceeds 18, then the niceness value is pegged to 18; otherwise, the niceness is equal to the number of files processed so far. (nice accepts a maximum value of 18; this construct places a bounds check on the argument passed to it.)

The following line is a special test:

   ((jobcount % 18 != 0)) || sleep 60

If jobcount is not a multiple of 18 (that is, if there is a nonzero remainder when jobcount is divided by 18) then the first statement evaluates to TRUE and the second statement (separated by the logical OR) is not executed. Conversely, when jobcount is an exact multiple of 18, the first statement is evaluated to ``0 != 0'', which is false. When the first statement fails, the second statement (sleep 60) is executed. Thus, on reaching every eighteenth file, the script sleeps for one minute to allow the earlier compression processes to complete.

The real action of the script is as follows:

   nice -${niceness} compress ${target} && print "Finished compressing " \
     ${target}>> squish.log &
   print "Started compressing "${target} "at niceness " \
     ${niceness} >> squish.log

nice is used to start a compress process for each target file with the niceness level predetermined by the counter in the if loop at the top of the program. A logical AND connective is used to print a message to the file squish.log when the compression job terminates; the whole command line is executed as a background job. The shell then executes the next line, which prints a start message to the logfile, almost certainly executing it before the compression process has begun. (This illustrates the asynchronous execution of processes.)

It is well worth examining the logfile left after running squeeze on the contents of a directory. This illustrates how concurrent execution of processes can provide a significant performance improvement over sequential execution, despite the apparent complexity of ensuring that a rapid proliferation of tasks does not bring the system to its knees.

You can adapt squeeze to run just about any simple filter job in parallel; simply define a function to do the operation you want, then use it to replace compress.