DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH PRINT BOOK
 
lex

Rules section

Each rule consists of a pattern to be searched for in the input, followed on the same line by an action to be performed when the pattern is matched. Because lexical analyzers are often used in conjunction with parsers, as in programming language compilation and interpretation, the patterns can be said to define the classes of tokens that may be found in the input.

Regular expressions

The patterns describing the classes of strings to be searched for are written using regular expressions in a notation similar to that used in awk(C) and sed(C). The terms ``pattern'' and ``regular expression'' are often used interchangeably. A regular expression is formed by concatenating characters and, usually, certain operators. This notation used with lex is summarized in the following list:

Actions

An action is a block of C code that is executed whenever the corresponding pattern in the lex specification is matched. Once the lex-generated lexical analyzer matches a regular expression specified in a rule in the specification, it looks to the right of the rule for the action to be performed. Actions typically involve operations such as a transformation of the matched string, returning a token to a parser, or compiling statistics on the input.

The simplest action contains no statements at all. Input text that matches the pattern associated with a null action is ignored. A sequence of characters that does not match any pattern in the rules section is written to the standard output without being modified in any way. To cause lex to generate a lexical analyzer that prints everything in the input text with the exception of the word ``orange'', which is ignored, the following rules section is used:

   %%
   orange  ;
Note that there must be some white space (spaces or tabs) between the pattern and the semicolon.

You may want to print out a message noting that a string of text was found, or a message transforming the text in some way. To recognize the expression ``Amelia Earhart'', the following rule can be used:

   "Amelia Earhart"   printf("found Amelia's bookcase!\n");
To replace a lengthy medical term with its acronym, a rule such as this is called for:
   Electroencephalogram    printf("EEG");
In order to count the lines in a text file, the analyzer must recognize end-of-lines and increment a counter. The following rule is used for this purpose:
   %{
   int lineno=0;
   %}
   %%
   \n   lineno++;


NOTE: If an action consists of two or more C statements spread over two or more lines, the code must be enclosed in curly braces, '{' and '}'.

yytext, yyleng

When a character string matches some pattern in the lex specification, it is stored in a character array called yytext. The contents of this array may be operated on by the action associated with the pattern: it can be printed or manipulated as necessary. lex also provides a variable yyleng, which gives the number of characters matched by the pattern.

For example, the following rule directs the lexical analyzer to count the digit strings in an input text and print the running total, and print out the text of each string as soon as it is found:

   %{
   int digstringcount=0;
   %}
   %%
   [-+]?[0-9]+     {
                           digstringcount++;
                           printf("%d %s\n",digstringcount,yytext);
                   }
This specification matches negative digit strings, and positive strings whether or not they are preceded by a plus sign; the ``?'' indicates that the preceding sign is optional.

ECHO

The macro ECHO is a shorthand way of printing out the text of the token. The two rules in the next example have the same effect:

   Jim|James       { ECHO; }
   Jim|James       { printf("%s",yytext); }

The following lex specification draws together several of the points discussed previously.

  1 %{
  2 int subprogcount = 0;
  3 int gstringcount = 0;
  4 %}
  5 %%
  6 -[0-9]+                printf("negative integer\n");
  7 "+"?[0-9]+             printf("positive integer\n");
  8 -0\.[0-9]+             printf("negative real number, no whole number part\n");
  9 rail[ ]+road           printf("railroad is one word\n");
 10 crook                  printf("Here's a crook!\n");
 11 function               subprogcount++;
 12 G[a-zA-Z]*             {
 13                        printf("may have a G word here: %s\n ", yytext);
 14                        gstringcount++;
 15                        }
The first three rules (lines 6-8) recognize negative integers, positive integers, and negative real numbers between 0 and -1. The fourth rule (line 9) matches cases where one or more blanks intervene between the two syllables of the word ``railroad''. The fifth specification (line 10) matches the word ``crook'' and prints a useful warning. The rule recognizing ``function'' (line 11) increments a counter. The last rule (lines 12-15) illustrates a multiline action, and the use of yytext.
Next topic: Definitions section
Previous topic: Writing lex programs

© 2003 Caldera International, Inc. All rights reserved.
SCO OpenServer Release 5.0.7 -- 11 February 2003