Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
The STAT GU4206/GR5206 Midterm is open notes, open book(s), open computer and online resources are
allowed. Students are required to be physically present during the exam. The TA/instructor will be available
to answer questions during the exam. Students are not allowed to communicate with any other people
regarding the exam with the exception of the instructor (Gabriel Young) and course TAs. This includes
emailing fellow students, using WeChat and other similar forms of communication. If there is any suspicion of
one or more students cheating, further investigation will take place. If students do not follow the guidelines,
they will receive a zero on the exam and potentially face more severe consequences. The exam will be posted
on Canvas at 10:05AM. Students are required to submit both the .pdf (or .html) and .Rmd files on Canvas by
12:40AM. If students fail to knit the pdf or html file, the TA will take off a significant portion of the grade.
Students will also be significantly penalized for late exams. If for some reason you are unable to upload the
completed exam on Canvas by 12:40PM, then immediately email markdown file to the course TA.
Important: If you have a bug in your code then RMarkdown will not knit. I highly recommend that you
comment out any non-working code. That way your file will knit and you will not be penalized for only
uploading the Rmd file.
1
Part I - Character data and regular expressions
Consider the following toy dataset strings_data.csv. This dataset has 461 rows (or length 461 using
readLines) and consists of random character strings.
char_data <- readLines("strings_data.csv")
head(char_data,8)
## [1] "\"strings\""
## [2] "\"rmJgFZUGKsBlvmuUOuWnFUyziiyWEEhiRROlJJXRXxOwp\""
## [3] "\"bacUqblSKDopCEAYWdgD\""
## [4] "\"qsPuSJdkmv\""
## [5] "\"RXAnEoHlliMllHMPFTcv\""
## [6] "\"SBolTFf0.2nMoQ9.454lKlgjQZGroup_IOMLFgXj\""
## [7] "\"rtoMgy0.36bRrnA9.454goQIJGroup_IMCRp\""
## [8] "\"CqdniznveOdQRhMyctjUEULimqmQjV\""
length(char_data)
## [1] 461
Among the 461 cases, several rows contain numeric digits and a specific string of the form “Group_Letter”,
where “Letter”" is a single uppercase letter. For example, the 6th element contains the symbols
“0.2”,“9.454”,“Group_I”.
char_data[6]
## [1] "\"SBolTFf0.2nMoQ9.454lKlgjQZGroup_IOMLFgXj\""
c("0.2","9.454","Group_I")
## [1] "0.2" "9.454" "Group_I"
Problem 1
Your task is to extract the numeric digits and the group variable from this character string vector. Notes:
1. The first number x is a single digit followed by a period and at least one digit. There are a few cases
where the first number is only a single digit without a period.
2. The second number y is one or two digits followed by a period and at least one digit. Note that the
second number can be negative or positive.