Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CSI4104 Compiler Design, Document Version 1.3
1 Introduction In the first assignment we will implement a scanner by hand. This scanner will read a character stream from a MiniC source file and translate it to a sequence of MiniC tokens. This is known as the lexical analysis part of a compiler. The reason for writing the scanner by hand is that this experience will allow you to thoroughly understand the underlying principles of scanner construction. Based on this understanding, implementing a scanner using scanner generators like JLex or JavaCC will be straight-forward to you. To keep the effort reasonable, you are provided with a skeleton scanner that you can extend.
Before you start, you should carefully read the MiniC Language Specification (also available in LearnUs). The MiniC Language Specification explains in detail about the characters, tokens and comments that we want our MiniC scanner to recognize. Please note that at this stage it is not necessary to fully understand the MiniC grammar; the grammar is only provided to give you an initial idea how MiniC tokens will be composed to form MiniC programs (more on that topic in Assignment 2 on Syntax Analysis). 2 Environment You can solve this assignment on our server (csi4104.cs.yonsei.ac.kr) or any computer that has a Java compiler and a Java virtual machine (JVM) installed. We provide a VirtualBox Linux image that contains the development environment. You can run this image locally on your computer as described in our VirtualBox_HowTo guide provided in the ``Resources’’ folder on LearnUs.
Please note: • All assignments will be tested and marked on the server. • It is mandatory to submit your code on the server (as described in Section 9 below.) • It is highly recommended that you test your programs on the server before submitting. • Programs that do not compile on the server or exhibit other problems such as entering infinite loops, crashing ... will without exception receive zero score. • TAs are not allowed to hand-edit submissions to make them compile-able or fix other problems in the submitted code. 2
3 Code structure of the MiniC compiler For this and all subsequent assignments, we’ll use the directory structure on the following page.
Every Java component of the MiniC compiler that you will develop during this semester will reside in its own Java package. All packages are sub-packages of the MiniC package. The MiniC package resides in directory MiniC, and all sub-packages reside in subdirectories of directory MiniC.
We provide you with a skeleton compiler that you can extend to support complete lexical analysis of MiniC source code. You will find the skeleton compiler on the server in the archive file /opt/ccugrad/Assignment1/CSI4104_Assignment_1.tgz. Copy this archive to your home-directory and untar it (e.g., in the shell, type tar xvzf CSI4104_Assignment_1.tgz). This will create the above directory structure in your home-directory. The directory structure is already populated with the files provided with the skeleton compiler. In the following description, all paths are relative to the top-level directory CSI4104_Assignment_1:
▪ MiniC/MiniC.java: this is the main program (driver) of our MiniC compiler. ▪ MiniC/Scanner/Scanner.java: a skeleton scanner to be completed by you. ▪ MiniC/Scanner/Token.java: definitions of all MiniC tokens, and a method for separating keywords from identifiers. ▪ MiniC/Scanner/SourceFile.java: handling of source files. ▪ MiniC/Scanner/SourcePos.java: a class to handle source positions (line and column numbers) of tokens. ▪ build.gradle, gradle, gradlew, gradle.bat, build: files and directories related to the Gradle open-source build automation system, which we use to compile the MiniC compiler (see below instructions on how to compile). MiniC Scanner Parser ASTGen SemanticAnalysis CodeGen solutions tst base testcases CSI4104_Assignment_[1-5] 3 ▪ MiniC/Scanner/tst/base/testcases: several testcases that you can use to test your scanner. You are encouraged to come up with more testcases to test your scanner in every possible way. ▪ MiniC/Scanner/tst/base/solutions: solutions for the provided testcases. Your completed scanner shall produce the output as shown in the solutions. Please note that this assignment gives some freedom with respect to lexical errors (see Section ‘Lexical errors’ below). Testcases containing lexical errors need not correspond byte-by-byte to the provided solutions, as long as they conform to the provided guidelines. ▪ MiniC/scripts/scannertest.sh: this is a script that you can use to automatically test your scanner. The script will run your scanner on every testcase in the Scanner/tst/base/testcases directory and compare the output to the corresponding solutions in Scanner/tst/base/solutions. Please note: this script is provided to you to help debugging your scanner. The second purpose is to show you how large testing tasks can be automated by scripts. For the remaining assignments, you will have to create such scripts by yourself (e.g., by adapting the scanner testscript). Please read the comments at the beginning of the script file so that you understand how to run the script. The script is provided as is, without guarantees about its functionality and correctness. Please use at your own risk, and do make sure that you also test your scanner manually, to have sufficient confidence that your scanner works correctly.
Note: for this assignment it is only necessary to modify file Scanner.java. You must not modify Token.java or SourcePos.java, otherwise automated testing of your scanner might no longer work. The compiler driver in MiniC.java will repeatedly call your scanner on an input file specified on the command line. With each call, your scanner is required to return the next token from the input file.
File Token.java defines a class Token, which we use to represent MiniC tokens. For every token in the input stream, your scanner is expected to provide the corresponding Token object. The constructor of the Token class expects three pieces of information: 1) the token kind, represented as an int (e.g., Token.ID, TOKEN.INTLITERAL); 2) the lexeme of the token, represented as a string (e.g., “counter”, “22”) 3) the position of the token in the program. Positions are represented as objects of class SourcePos. Class Token makes it easy to distinguish keywords from identifiers: initially the scanner classifies all keywords as identifiers. The constructor of class Token compares the lexeme of the presumptive identifier with the list of MiniC keywords. In case the presumptive identifier is contained in the list of MiniC keywords, the constructor will change the token kind accordingly (see file Token.java).
The class SourcePos from file SourcePos.java contains four instance variables: StartCol, EndCol, StartLine and EndLine. These variables are used to record the 4 position of a token within a MiniC program. In MiniC, line and column numbers start from 1. Variable StartCol denotes the column where a token starts, variable EndCol denotes the column where a token ends, and StartLine and EndLine denote the line number where the token occurred (Note: in MiniC, no token can span multiple lines, therefore StartLine=EndLine for every token.).
With string literals, StartCol and EndCol are set to the occurrences of the enclosing quotes: “Hi there, this is a string token...\n” ^ ^ | | StartCol EndCol
If a string contains illegal escape sequences (i.e., an escape sequence other than \n), it is up to you to decide on the lexeme for this string (see also the section on lexical errors below). However, you must set StartCol and EndCol accordingly.
When the end of file is reached, the parser returns an end-of-file token (Token.EOF). This has already been implemented for you in Scanner.java.
Error Token: Token.ERROR is used like all other tokens: its spelling and position is set according to the input. For example, if the character @ appears in the input, the scanner will create a Token object using new Token (Token.ERROR, “@”, src_pos), where src_pos specifies the position of character @ in the input. 4 Example Let us assume the following MiniC input program (starting at line 1, column 1 of the input source file):
int x; x = 1.4e+2;
Then your scanner should deliver the following sequence of tokens:
5 Longest match It is important for our MiniC scanner to always consume the longest possible match (maximal munch). The examples in the following table should point that out.
6 Input Token(s) Comments >= ``>=’’ One token GREATEREQ, rather than two tokens GREATER and ASSIGN. == ``==’’ One token EQ instead of two tokens ASSIGN. // An end-of-line comment, rather than two tokens ``/’’ and ``/’’. Note: the scanner throws away comments. So everything starting from // to the end of the line should be consumed and thrown away. @ Token.ERROR 2.2+2 ``2.2’’ ``+’’ `2’’ Token.FLOATLITERAL (``2.2’’), Token.PLUS (``+’’), Token.INTLITERAL (``2’’) 2.2e+2 ``2.2e+2’’ Token.FLOATLITERAL (``2.2e+2’’) 2.2e+ 2 ``2.2’’ ``e’’ ``+’’ ``2’’ Token.FLOATLITERAL (``2.2’’), Token.ID (``e’’), Token.PLUS (``+’’), Token.INTLITERAL (``2’’)
The scanner should discard whitespace and comments. The scanner must always return an object of class Token as defined in file Token.java. 6 Lexical errors A MiniC scanner should detect four kinds of lexical errors: 1) illegal characters, i.e., characters that cannot be part of any token, 2) un-terminated comments, 3) un-terminated strings, 4) illegal escape sequences in strings, for example ``\y’’.
In Case (1), your scanner should simply return an error token. Please refer to the introduction of the error token above regarding the lexeme that you should provide with the error token. Your scanner must “consume” the lexeme belonging to the error token, so that scanning can proceed when your scanner is called next time.