meta characters in regular expressions

clarification and correction from class today (21-mar-2005)

 

The main confusion in class today (on my part, and therefore probably on yours...) relates to the way the following three "meta-characters" are handled:

  1. dot (.):
    The dot (.) acts as a placeholder for a single character. For example, the regular expression A. matches anything that contains an A followed by a single character. (It can be followed by more than a single character, but there must be at least one character after the A for the pattern to match this expression.)

  2. asterisk (*):
    The asterisk (*) refers to the character (or regular expression) immediately preceding it and looks for matches where there are zero or more of the character (or regular expression) that immediately precedes the asterisk. For example, the regular expression A* matches anything that contains 0 or more A's.

  3. plus (+):
    The plus (+) is treated like the asterisk in that it refers to the character (or regular expression) immediately preceding it. It looks for matches where there are ONE or more of the character (or regular expression) that immediately precedes the plus. For example, the regular expression A+ matches anything that contains 1 or more A's.

Below is a table containing a bunch of examples. The color key to the table is:

Note that these were all tested using this input file and either this sh script or perl script. Note that the sh script uses egrep (extended grep). Plain old grep doesn't handle most regular expressions.

 

Here is the table of examples:

 | AE | ABE | ABBE | A | E | B | AAE | AEE | AAAE | AAEEE| BE | BEE | BBEE |

1. AE  AE                  AAE   AEE   AAAE   AAEEE           
matches lines that contain a consecutive AE

2. A.E     ABE               AAE   AEE   AAAE   AAEEE           
matches lines that contain an A followed by one of any character followed by an E

3. A*E  AE   ABE   ABBE      E      AAE   AEE   AAAE   AAEEE   BE   BEE   BBEE  
matches lines that contain 0 or more A's followed by an E
Note that "ABE" and "ABBE" match for the same reason that "BE", "BEE" and "BBEE" match.
The important thing is the E -- in these cases, the A doesn't matter.
The match occurs because we have 0 A's preceding the E.

4. AE*  AE   ABE   ABBE   A         AAE   AEE   AAAE   AAEEE           
matches lines that contain 0 or more E's after an A
As above, the reason that "ABE" and "ABBE" match is because of the A.
These words match with an A immediately followed by 0 E's.

5. A.*E  AE   ABE   ABBE            AAE   AEE   AAAE   AAEEE           
matches lines that contain an A followed by a single character followed by 0 or more E's

6. A*.E  AE   ABE   ABBE            AAE   AEE   AAAE   AAEEE   BE   BEE   BBEE  
matches lines that contain 0 or more A's followed by a single character followed by an E
Note that "AE" matches for the same reason that "BE" matches.
"BE" contains 0 A's before the E -- B is the single character.
In "AE", the A is counted as the single character and there are 0 A's before it.

7. A.E*  AE   ABE   ABBE            AAE   AEE   AAAE   AAEEE           
matches lines containing an A followed by a single character followed by 0 or more E's

8. A*E.                       AEE      AAEEE      BEE   BBEE  
matches lines containing 0 or more A's followed by an E followed by a single character

9. A+E  AE                  AAE   AEE   AAAE   AAEEE           
matches lines containing one or more A's followed by an E

10. AE+  AE                  AAE   AEE   AAAE   AAEEE           
matches lines containing an A followed by one or more E's

11. A.+E     ABE   ABBE            AAE   AEE   AAAE   AAEEE          
matches lines containing an A followed one or more of a single character followed by an E