Regular Expressions in perl

Regular Expressions in perl Regular Expressions in perl

For those not familiar with perl patterns, aka "regular expressions", here is a brief synopsis. To fully understand how things match, you also need to appreciate that the index files used actually have the tune title twice: Once in a "canonical" upper-case form with all non-letters dropped, and again in the full original form. Between these is the URL and the X: index. Patterns can take advantage of this order. Anyway, here is how patterns work:

.*

The most useful pattern element is .*, which matches anything at all (including nothing). So early.*morn will match anything with early and morn, in any capitalization. Since the title is in each line twice, this will also match titles with morn before early. Most of the time, this is the only pattern element you will need.

Letters, digits and spaces

These represent themselves, as literal characters. The index files contain only spaces, not tabs. Note again that we ignore capitalization.

Metacharacters

These don't represent themselves, but stand for some special match. Examples are:

.: represents any single character. Thus the pattern de.il would match strings such as devil, de'il, de il, that is, any occurrence of de and il separated by exactly one character.
[...]: matches a list of characters. Thus [abcd] matches any single character a, b, c or d. As a special feature, - between two characters means to match the entire range, so [A-Z] will match any single upper-case letter, [0-9] will match any single digit. You can include ] in the list by putting it first, so [][] will match either of the bracket characters. Similarly for -. Or either may be preceded by \, so [\-\]] will match a hyphen or a right bracket. The character \ is special inside [...], and is described below.
*: means any number (zero or more) of the preceding item. Thus ab*c will match ac, abc, abbc, and so on. [A-Za-z]* will match a string of zero or more letters.
+: means one or more of the preceding item. Thus ab+c will match abc, abbc, and so on, but it will not match ac.

Escaped symbols

If preceded by \ (back-slash), letters have special meaning, and extend the list of special sorts of matches. Here are some of the more important escape sequences:

\s: matches any non-printing ("white space") character, such as space, tab, and the CR and LF line separators. Since the ABC index files contain sapces but not tabs, this is of limited use.
\w: matches "alphanumeric" characters, letters and digits, and the _ for obvious computing reasons. It is shorthand for [A-Za-z0-9_].
\b: matches a "word boundary". That is, it matches only if there is one of the \w characters on one side and not on the other. So \blow will match low or lowly, but not below
\$: matches a single $. Not too useful here.
\.: matches a single dot.
\\: matches a single \ (backslash).

In general, \ before a non-alphanumeric character cancels any special meaning of that character, and causes an exact match. You should not use \ before a letter or digit unless you know the special meaning of that sequence, because the result will usually not match sensibly.

Groupings.

The perl pattern match allows use of parentheses to surround a chunk of the pattern, marking it for later use. This is of limited use with this tune matching service, but there is one situation where it is useful: When combined with the | symbol, which means "or", you can give alternatives. Thus the pattern Charl(ie|ey|es) will match Charlie, Charley, or Charles. This pattern could also be written Charl(ie|e(y|s)). Or you could just use charl.*.

Examples

jenn(y|ie).*charl(ie|ey|es)
This will find all the "Jenny's Welcome To Charlie" tunes in their various variant spellings. Actually, any title with both names (in either order) will be shown, due to the repetition of the title in the index files.
stanford.edu.*de[v']*il
This takes advantage of the fact that the tune indexes have the URL before the tune's full title, and looks for entries on a Stanford University machine that have various forms of "devil" in their names.

For more details, or to learn about perl (which is the main language behind the Web), visit O'Reilly's perl web site or the Perl Institute's web site. They have full manuals online, plus the perl sources, and executables for many common computer systems, all available free. There are also several well-written books on the language, which aren't free, but are a good investment for any programmer. (I expect that few if any musicians will get this far unless they are also computer programmers. ;-)