Wiki
Clone wikiinf225public / Exercise 1
Please let me know if you find mistakes (in particular, some characters might have disappeared during Markdown formatting). Bugfixers so far: Andreas (1).
1.1 Regular Expressions
Write regular expressions for (at least some of) the following. The valid/invalid examples are list of literal strings you might paste directly into your Java or Python (or similar) program for testing your regular expressions.
-
The language of Norwegian sheep. Sheep can say
bæ
,bæææ
,bæbæbææææ
, and so on, with at least one 'æ' following a 'b'. Examples:Valid strings:
"bæ", "bæbæ", "bææææbææææbæbæ", "bæbæbæbæbæ"
Invalid strings:
"", "b", "bø", "ææææ", "æ", "bbæææ", "quack"
-
Traditional C comments. A C comment starts with
/*
and ends with*/
and cannot be nested. Java/C++-style//
-comments are not supported. Examples:Valid strings:
"/* foo */", "/** */", "/*/* /**/"
Invalid strings:
"/*", "// foo", "/* /* */ */", "/*/"
-
String literals, according to the following rules (C-like): they begin and end with a double quote ("), and any double or single quote must be escaped by a preceding backslash (\). The backslash itself must also be escaped. For example,
"foo"
,"fo\"o"
and"fo\\"
and"\\\""
are legal strings, and"ba""
,"b\jar"
are illegal.Valid strings:
"\"foo\"", "\"fo\\\"o\"", "\"\""
(i.e.,"foo"
,"fo\"o"
,""
without extra quotes)Invalid strings:
"\"", "\"f\"o\"", "\"\\\\\\\""
(i.e.,"
,"f"o"
,"\\\"
without extra quotes) -
Add the following to the string literal definition: a string may contain
\xNN
or\uNNNN
, where N is any hexadecimal digit (this would be used to insert arbitrary characters in hexadecimal ASCII or Unicode notation). -
Integer constants: an optional
+
or-
, followed by decimal digits.Valid strings:
"1234", "+1", "-0", "0"
Invalid strings:
"+-0", "+", "a", "", "0.0"
-
Numeric constants in scientific notation, consisting of:
- an optional sign (
+
or-
) - zero or more decimal digits
- a decimal point (
.
) - zero or more decimal digits
- an exponent part, consting of
- an upper or lowercase
e
- an optional sign (
+
or-
) - one or more decimal digits Addtional rules: the numbers should contain at least one digit either before or after the decimal point; and at least one of the decimal point or the exponent must be present.
- an upper or lowercase
Valid strings:
"-2.0", "2e5", "314e-2", ".0", "1."
Invalid strings:
".", "1", "1e2e4", "1e2.0", "1.+10e2"
- an optional sign (
-
Internet names and IP addresses. Find the exact syntax rules online or make something up; in general, a host or domain name has one or more labels separated by dots (
.
), possibly ending with a dot. An IPv4 address in dotted-decimal notation contains four numbers from 0-255, separated by dots (.
), and an IPv6 address contains:- Eight groups of four hexadecimal digits, separated by colons (
:
) - Leading zeros may be omitted from groups
- Multiple groups of zeros may be replaced by a double colon (
::
) – this is only legal once in an address
- Eight groups of four hexadecimal digits, separated by colons (
Extra Regular Expression Shorthands
The following shorthands are supported by many, but not all tools, and may be useful:
\s
or[:space:]
any space (incl tabs and newlines)\S
any non-space\w
any word-character (alphanumeric plus underscore)\W
any non-word-character\d
or[:digit:]
a digit\D
a non-digit
The [:foo:]
style is typically used by standard Unix tools, and only works inside character classes; e.g., [[:space:]]
. Perl, Java, Python and many other languages support the \x
convention.
1.2 Regular Expressions and Tools
- Try your regular expressions from 1.1 on different tools (see below for examples).
- Check that they actually work and match what you think they did.
- Do you have to modify your expressions to make them fit the syntax of the tool?
Matching vs Searching
There's often a distinction between matching and searching. Matching a regular expression means checking that a string conforms to the grammar specified by the regular expression. Typically the match starts at the beginning of the string; the string may or may not be allowed to have trailing characters that are not matched. Searching will try to identify one (or more) substrings that match a regular expressions.
If you want to be sure your regular expression is matched against the entire string, insert a ^
(matches beginning of string) and $
(end of string) at the beginning/end of the regular expression. Whether 'beginning' or 'end' refers to a line or the entire string depends on your tool (e.g., grep
will match lines; String.matches()
will match the entire string by default). For example:
Starts with a comment: ^//
An empty string/line: ^$
A string/line of spaces: ^ *$
Regular Expressions in Java
You can match regular expressions in Java using the String.matches()
method:
#!java assertTrue("foo".matches("fo+"));
In more complicated cases, you'll want to use the Pattern
class, which lets you precompile a regular expression (makes it faster!) and also lets you do more advanced things.
Java supports a wide range of regular expression operators.
Regular Expressions in Python
In Python, you'll find tools for regulare expressions in the re
package:
#!python import re re.match("fo+", "f")
re.match()
gives you a Match
object on success, which you can use to find out more about the match (start / end of matching region, information about subgroups (things that are in parentheses in the expression) and so on). If there is no match, you get None
.
For example, in Python 2.x, you might test a whole bunch of strings this way:
#!python for s in ["a", "foo", "bææ"]: print s, re.match("fo+", s) != None
Python also has a rich regular expression language, similar to that found in Perl.
Regular Expressions in Unix Tools
You can use grep
(1) to find matches in files:
#!bash
grep regexp file1...
This will find all (or at least most – it will be confused by comments and line breaks) productions in a Rascal grammar:
#!bash grep '^[ \t]*syntax[ \t][ \t]*[A-Z][a-zA-Z0-9_]*' MyLang.rsc
Similarly, sed
(1) will allow you to do simple text maniulations:
#!bash sed -e 's/FOO/BAR/g' < input > output
FOO
by BAR
in input
and writes the result to output
.
Both sed
and grep
support a limited subset of regular expressions (e.g., they don't support +
). The variant egrep
(1) is better, but still somewhat different from what you find in Python, Java, Perl, etc.
1.3 Grammars
Consider this very simple grammar (in Rascal syntax notation:
start syntax Program = Expr; syntax Expr = ID // variables | NUM // integers | Expr "+" Expr // addition | Expr "*" Expr // multiplication | ID "(" Expr ")" // function call | "(" Expr ")" // parentheses ;
Program
, and we define a single non-termial Expr
for expressions, with multiple alternatives (separated by |
). In addition to the literal operator symbols, we have two terminals ID
and NUM
. Their definitions are:
// identifiers // y !<< x means 'x' must not be preceeded by 'y' // x !>> y means 'x' must not by followed by 'y' // so, this means that an identifier is a sequence of one // or more letters or underscores, with no additional // letters or underscores before or after lexical ID = [a-zA-Z_] !<< [a-zA-Z_]+ !>> [a-zA-Z_]; // numbers lexical NUM = [0-9] !<< [0-9]+ !>> [0-9];
[a-zA-Z_]+
and [0-9]+
. The stuff before !<<
and after !>>
is just special Rascal notation for saying that there shouldn't be any leftover letters/digits before/after the identifier or number.
-
Write a few syntactically valid expressions. Make sure you use all the "features" of the language, including nested expressions.
-
Draw parse trees for your expressions.
-
Draw parse trees for the following expressions. Do any of the expressions have multiple parse trees? If so, draw all the trees.
a + b * 2 f(g(x)) a + b + c (a + b) * 2 (f((2)))
1.4 Parsing with Rascal
Do this if you have extra time, want to get a head start or don't have anything better to do in your weekend.
-
Follow the Rascal installation instructions.
-
With the Simple grammar still loaded in the console, try parsing all the example expressions you created. Things in the concrete object syntax (the grammar you've supplied, as opposed to the grammar for Rascal itself) are written in back-quotes (
`a + b`
). You can tell Rascal which non-terminal to expect by prefixing the concrete syntax fragment with the name of the non-termial in parentheses:
rascal>(Program)`x` Program: `x` Tree: appl(prod(sort("Program"),[sort("Expr")],{}),[appl(prod(sort("Expr"),[lex("ID")],{}),[appl(prod(lex("ID"),[conditional(iter(\char-class([range(65,90),range(95,95),range(97,122)])),{\not-precede(\char-class([range(65,90),range(95,95),range(97,122)])),\not-follow(\char-class([range(65,90),range(95,95),range(97,122)]))})],{}),[appl(regular(iter(\char-class([range(65,90),range(95,95),range(97,122)]))),[char(120)])[@loc=|stdin:///|(10,1,<1,10>,<1,11>)]])[@loc=|stdin:///|(10,1,<1,10>,<1,11>)]])[@loc=|stdin:///|(10,1,<1,10>,<1,11>)]])[@loc=|stdin:///|(10,1,<1,10>,<1,11>)] rascal>(Expr)`x` Expr: `x` Tree: appl(prod(sort("Expr"),[lex("ID")],{}),[appl(prod(lex("ID"),[conditional(iter(\char-class([range(65,90),range(95,95),range(97,122)])),{\not-precede(\char-class([range(65,90),range(95,95),range(97,122)])),\not-follow(\char-class([range(65,90),range(95,95),range(97,122)]))})],{}),[appl(regular(iter(\char-class([range(65,90),range(95,95),range(97,122)]))),[char(120)])[@loc=|stdin:///|(7,1,<1,7>,<1,8>)]])[@loc=|stdin:///|(7,1,<1,7>,<1,8>)]])[@loc=|stdin:///|(7,1,<1,7>,<1,8>)] rascal>(ID)`x` ID: `x` Tree: appl(prod(lex("ID"),[conditional(iter(\char-class([range(65,90),range(95,95),range(97,122)])),{\not-precede(\char-class([range(65,90),range(95,95),range(97,122)])),\not-follow(\char-class([range(65,90),range(95,95),range(97,122)]))})],{}),[appl(regular(iter(\char-class([range(65,90),range(95,95),range(97,122)]))),[char(120)])[@loc=|stdin:///|(5,1,<1,5>,<1,6>)]])[@loc=|stdin:///|(5,1,<1,5>,<1,6>)]
Try running unparse(...)
on the result. You should get the input string back.
Updated