Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP2100/6442
Tokenizers and
Parsers?
Why do we have to
learn this?
2Introduction and Agenda
In this lab you will be building your own calculator!
This calculator will take in a mathematical string, e.g. “1 + 2”, then tokenize it and evaluate it.
To achieve this, you will need to accomplish the following two tasks:
Task 1 (non-assessable but mandatory): Complete the ‘next()’ method of Tokenizer.java
Task 2 (2 marks): Complete the three parse methods found in Parser.java
However, before going into the tasks we would like to address a common question amongst
students who first see this lab:
“Why are we learning about tokenizers and parser?”
It seems some students believe these concepts are irrelevant… The truth is: we are
surrounded by tokenizers and parsers!
3Why Learn About Tokenizers
Tokenization
Tokenization breaks up raw text into ‘tokens’ oftentimes more useful to a program than just the raw
text.
Here are just a few uses of tokenization:
1. Security (so you are not sending around raw text).
2. Machine learning (divides information into meaningful pieces).
3. Search engines (tokenizes user search input).
4. And more!
Note that in this lab, tokens are Java objects. That is, the INT token holding value ‘15’ is actually an
instance of the class Token whose type is INT and value is the string “15”.
4Why Learn About Parsers
Parsing
The general definition of parsing is: “analysing a string of symbols according to the rules of a formal grammar” (from Wikipedia).
An alternative (which is more alike what we will be doing in this lab) is:
Parsing takes in a series of symbols (tokens in our case) conforming to a grammar and separates them into more easily
processed components.
Here are just a few uses of parsing:
1. Compilers (oftentimes used to breaks data into small components but has more uses in compilers).
2. Interpreters (parses the source code).
3. Webpages (html and xml is parsed)
4. And more!
In the task 2 slides, we have provided a step by step explanation of the parsing process to help you understand how this all
works.
5Reinventing the Calculator
One day, our alien friend set out to make their own calculator. In the process
however, they would need to learn about tokenizers and parsers.
I am sick of these old fashioned
calculators. I want a calculator
with emoji addition and
multiplication.
I want a calculator that can
compute “happy emoji + sad
emoji = neutral”.
First, try and get your
calculator to do
simple arithmetic,
then consider emojis.
I guess you are right… It must
be able to do simple math first.
6Task 1: Tokenizing Input
Your goal in this task is to complete the ‘next()’ method inside Tokenizer.java.
You can view tokenizing as simply a matter of pattern matching:
- ‘+’ returns Token.Type.ADD with value ‘+’
- ‘-’ returns Token.Type.SUB with value ‘-’
- ‘123’ returns Token.Type.INT with value ‘123’
- IllegalTokenException if the character does not match any token type.
For example: ‘1 + 2’ becomes: INT(1) ADD INT(2).
To help you both see this in action and test your code, we have provided a main
method in Tokenizer.java which will output your tokenized input! I can put in
anything for it to
tokenize! (with
some exceptions).
Task 1 Complete
7
I am truely, an
entrepreneur...
8Task 2: Parsing the Tokens
You must complete task 1 before you can complete task 2.
The overall objective of this task is to implement a parser for the following grammar:
::= | + | -
::= | * | /
::= | !
::= | ( )
This takes the form of completing the three parse methods within Parser.java. If any series of tokens provided does not conform to the
grammar, you must return an IllegalProductionException.
For a better understanding of what is happening here, we have provided a step by step explanation of the parsing process that you can
follow in the next couple slides.
Note that the symbols: “exp”,”term”,”factor” and “coefficient” in this grammer don’t necessarily correspond with their formal meanings in
mathematics.
9Task 2: Step by Step Explanation of Parsing (Part 1)
Let’s first define three things:
1. Grammar: set of (recursive) rules used to generate patterns of strings.
2. Non-terminal symbol: any symbol that can be replaced with other symbols according to a production rule.
3. Terminal symbol: any symbol which cannot be replaced with other symbols (has no production rule).
In our case, is a non-terminal symbol but the string ‘12’ is a terminal symbol. Notice as well, that the production rule governing is
recursive as it can create another symbol.
For something to be parsed, it must conform to the grammar the parse is built for. So let’s see how we can check if a string conforms to a
particular grammar. Let’s say you are given the following string: ‘12 + 1 * 2’.
How can we use the grammar below to make this string (starting with exp)?
::= | + | -
::= | * | /
::= ! |
::= | ( )
Try this out yourself, the next slide will give you the answer.
Pen and paper might
help here!
10
Task 2: Step by Step Explanation of Parsing (Part 2)
‘12 + 1 * 2’ can be generated using our grammar in these steps:
1.
2. + (using rule 1)
3. + (using rule 2 and 1)
4. + * (using rule 3 and 2)
5. 12 + * (using rule 4,3 and 2)
6. 12 + 1 * (using rule 4 and 3)
7. 12 + 1 * 2 (using rule 4)
At step 7 we no longer have any non-terminal symbols so the process finishes. We were able to use the
grammar to form the string. What this tells us is that the string conforms to our grammar and can be parsed.
11
Task 2: Step by Step Explanation of Parsing (Part 3)
Now that we know the string, ‘12 + 1 * 2’, conforms to our grammar, we will attempt to tokenize it
and parse it.
The tokenization process (as we can see with our task 1 tokenizer) provides us with the tokens:
INT(‘12’) ADD INT(‘1’) MUL INT(‘2’)
Using these tokens, we can form what is called a ‘parse tree’. In our parse tree, operations connect
two values (e.g. ‘+’ connects two integers).
Try to form a parse tree from the above series of tokens yourself. The next slide will give you the
answer. Attempting is part of
the learning process.
So no peeking until
you are done!
12
Task 2: Step by Step Explanation of Parsing (Part 4)
The series of tokens below form the given parse tree.
INT(‘12’) ADD INT(‘1’) MUL INT(‘2’)
An important note here is that when evaluating a parse tree, the lower operations will be
evaluated first. Thus, MUL will be evaluated before ADD.
ADD
INT(‘12’) MUL
INT(‘1’) INT(‘2’)
13
Task 2: Step by Step Explanation of Parsing (Part 5)
Now let us consider what this parsing would actually look like in code.
In task 2 you are provided with many classes, a few of which will be useful in the
next step (the inputs are provided as well):
- AddExp(Term, Exp) returned by parseExp() according to the production rule.
- MulExp(Factor, Term) returned by parseTerm() according to the production rule.
- IntExp(Integer) returned by parseCoefficient() according to the production rule.
Try to use the above to parse the tokens: INT(‘12’) ADD INT(‘1’) MUL INT(‘2’). Keep
in mind that we start from parseExp() and build from there.