Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
FIT5197 Assignment 1 (25 marks) Contents 1 Details 2 2 Probabilities in Cards (2 marks) 3 2.1 A special flush (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 No repeats (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 PDF and Expectations (3 Marks) 3 3.1 Plot (1/2 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Mean (1/2 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 Variance (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.4 Skewness (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 Distributions (2 marks) 4 4.1 Model (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.2 Checking (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 Entropy (3 Marks) 4 5.1 Conditional probabilities (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.2 Entropies (1 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.3 Coding (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 Maximum likelihood estimation of parameters (3 marks) 5 6.1 Model (1 mark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6.2 Maximum likelihood fitting (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 7 Central Limit Theorem (7 marks) 6 7.1 Sampling distribution (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 7.2 Simulation (2 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7.3 Plotting normality (3 marks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Submission due date: by 11:59pm on Friday 12 April 2019 (end of Week 6) 11 Details Marks This assignment contains 6 questions. There are 25 marks in total for the assignment and it counts for 25% of your overall grade for the unit. Also, 3 of the 25 marks are awarded for code quality and 2 of the marks awarded for presentation of results, for instance use of plots. That leaves 20 marks for individual answers. You must show all working, so we can see how you obtain your answer. Marks are assigned for working as well as the correct answer. Your solutions Please put your name or student number on the first page of your solutions. Do not copy questions in your solutions. Only include question numbers. If you use any external resources for developing your results, please remember to provide the link to the source. If an extension has been given then submission after the due date is allowed with no penalty being incurred. If no extension has been given then assignments submitted after the due date, there will be penalised 5% per day up to a maximum of 10 days late. Submitting your assignment on Moodle Please submit your assignment through Moodle via upload a Word or PDF document as well as R markdown you used to generate results. If you choose to use R markdown file to directly knit Word/PDF document, you would need to type in Latex equations for Question 1,2 and 5. Find more information about using latex in R markdown files here. You may also find the R markdown cheatsheet useful. You can also work with Word and R markdown separately. In this case you would need to type your answers in Word and also copy R code (using the format: Courier New), results and figures to the Word document. We will mark your submission mostly using your Word/PDF document. However, you need to make sure your R markdown file is executable in case we need to check your code. Code quality marks Your R code will be reviewed for conciseness, efficiency, explainability and quality. Inline documentation, for instance, should demarcate key sections and explain more obtuse operations, but shouldn’t be over verbose. Out of the 25 marks, 3 will be awarded for code quality. Presentation marks Your presentation of results using R will be reviewed. How well do you use plots or other means of ordering and conveying results. Out of the 25 marks, 2 will be awarded for presentation using R. 22 Probabilities in Cards (2 marks) Have a regular deck of cards with no jokers (13 cards per suit, 4 suits) giving 52 cards. Suppose we draw a 5 card hand, so 5 cards without replacement. For each answer write out the full calculation in R to show working. Note there are 52! 47! different 5 card hands if ordering of the draw is considered, and each is equally likely. If ordering of the draw is ignored, there are different 5 card hands. 2.1 A special flush (1 mark) What is the probability of getting a royal flush but where the cards ordered by rank have alternate color? That is, order the cards as 10,J,Q,K,A and then check to see they have alternate colour. Note in a proper royal flush, it is all the one suit, but we have changed that to alternate colour. So, for example “red 10, black J, red Q, black K, red A” is OK but “red J, black 10, red Q, black K, red A” is not OK because once reordered in rank the alternating colour no longer holds. Note the order in which they are drawn from the pack is not considered. HINT: This event is defined ignoring the order of the draw, so count out the number of such hands (ignoring the order of the draw), and divide by 2.2 No repeats (1 mark) What is the probability that in the sequence of cards, as they are drawn, no rank occurs twice in a row? So ignoring the suit, the following are allowed: A, 10, 4, J, 10 or A, 10, A, 4, A, but the following are not allowed: A, A, 10, 4, A (A repeated in positions 1 and 2), A, 4, 10, 10, J (10 repeated in positions 3 and 4). HINT: This event is defined using the order of the draw, so count out the number of such hands, and divide by 52!/47!. 3 PDF and Expectations (3 Marks) Let X have the PDF given by a function with a different negative and positive part. f(x) = 12 You can use Wolfram Alpha to do the definite integrals, for instance https://www.wolframalpha.com/input/?i=integral+(1-x)%5E3+from+0+to+1 3.1 Plot (1/2 mark) Draw the plot in R. 33.2 Mean (1/2 mark) Find E(X). Why is it not zero? 3.3 Variance (1 mark) Find variance, V ar(X). 3.4 Skewness (1 mark) Find skewness, using the formula in the lecture notes. Interpret the value. 4 Distributions (2 marks) One study has evaluated a number of leukaemia records in a rural area. The population of the area was 35,000. In a year there were 16 leukaemia cases identified, of which 4 where not local residents but tourists or new immigrants (of which there are not many). In a general population, the annual rate of leukaemia is typically about one in 10,000. 4.1 Model (1 mark) Describe the model you recommend to use for the counts, and estimate the parameters using suitable point estimates. 4.2 Checking (1 mark) Also, consider the hypothesis, “the annual rate of leukaemia in the area is 1/10,000?” Assume this is the rate for the residents only. Plot the distribution over counts under this hypothesis. Where does your data lie, and do you think it is consistent with the hypothesis? 5 Entropy (3 Marks) In this question, we will use a modified version of the Titanic dataset from the Kaggle competition, Titanic: Machine Learning from Disaster? The dataset includes information about passenger characteristics as well as whether they survived from the disaster. Import the Titanic data using the following R code: df <- read.csv("Titanic.csv",header=TRUE, sep=",") Now Survived is Boolean so convert to a truth value with: df[['Survived']] <- df[['Survived']]==1 45.1 Conditional probabilities (1 mark) Compute tables for the frequency estimates of P(Survived), P(Survived|P class = val) and P(Survived|Gender = val), for different vals. Do the computation in R. But its OK to present the final table as a separate Word table (since it might be hard to layout in R). What does this tell you about survival? 5.2 Entropies (1 marks) Calculate the entropy (log2()) of Survived, H(Survived) and the conditional entropy of Survived given P class, H(Survived|P class), and of Survived given Gender, H(Survived|Gender). Do not use an entropy function but write the code yourself. Use R functions table() and prop.table() to gather stats and form probabilities from the data frame. What do these three entropies tell you about Survived? 5.3 Coding (1 mark) Consider the joint space (Survived, P class) which has six outcomes, (T rue, 1), (T rue, 2), (T rue, 3), (F alse, 1), (F alse, 2), (F alse, 3). Develop an efficient binary prefix code to transmit these outcomes. Would it be adequate to just provide the codelengths, or is a code needed too? Justify your answer. 6 Maximum likelihood estimation of parameters (3 marks) One of the central problems of sensory neuroscience is to separate the recordings of background physiological processes that are irrelevant (noise), from neural responses that are of experimental interest (signal). This is by no means an easy task, as the signals that neurons produce when they fire are extremely weak and more random. It is therefore of particular interest to examine the randomness of neuro signals as this allows researchers to study the brain at a cellular level. Let’s assume that we have conducted one experiment and recorded the spike signals from one particular neuron for a duration of 15 seconds. After some data processing, we can obtain spike signals with data given by a time in seconds and a spike size, similar to the following data and in Figure 1. n <- 30 times <- c(0.8670763, 1.2550631, 1.3463051, 2.6999393, 3.5238785, 4.8215638, 4.8502006, 5.2372364, 5.3201143, 6.2835730, 7.6961491, 8.0164785, 8.6279902, 9.1390150, 9.5136710, 9.9207854, 9.9795974, 10.0242579, 10.1622076, 10.5968354, 11.6766725, 12.3441424, 12.7731282, 12.8911034, 13.0458095, 13.4280567, 14.2443711, 14.4219672, 14.7461019, 14.7726211) spikes <- c(0.220136914, 1.252061356, 0.943525370, 0.907732787, 1.157388806, 0.342485956, 0.291760012, 0.556866189, 0.738992636, 0.690779640, 0.425849738, 0.876344116, 1.248761245, 0.697514552, 0.174445203, 1.376500202, 0.731507303, 0.483036515, 0.650835440, 1.106788259, 0.587840538, 0.978983532, 1.179754064, 0.941462421, 0.749840071, 0.005994156, 0.664525928, 0.816033621, 0.483828371, 0.524253461) 6.1 Model (1 mark) Let us assume that the rate of signals remains constant over time, and the size of each signal is independent of time too. If the rate of the signals remains constant over time, which distribution would most suit to model the probability distribution for the number of spike signals over 15 seconds? Why? Briefly answer this question in a sentence or 2. Also, while we don’t know enough to suggest a distribution for spike sizes, but what properties should it have? 5Figure 1: Spike data. 6.2 Maximum likelihood fitting (2 marks) Using the model above, what is the log-likelihood function for number of spike signals for the period of experiment time, and what is the maximum likelihood estimate for its parameters? You’re told that a candidate distribution for spike sizes is the Weibull with shape given by 0.7 and unknown scale, between 0.5 and 2. This is supported in R using the [dpqr]weibull() functions. One can do maximum likelihood fitting using the Weibull density on the unknown parameter. Use the optimize() function for that, so something like ‘optimize(fn, c(minvalue,maxvalue), maximum = TRUE, tol = .Machine$double.eps?0.25)’ 7 Central Limit Theorem (7 marks) Assume that we draw random integers from a Poisson distribution with rate one of λ1 = 1, λ2 = 5, or λ3 = 20. 7.1 Sampling distribution (2 marks) According to Central Limit Theorem what is the limiting distribution for the sample mean, for the three rates λ1, λ2, λ3, when we have sample size of 10, 100, 1000 and 10000? Give the theory then compute the parameter values in R. Bonus question for HD students giving bonus 1 mark (added to final mark for Assignment 1 only if the final mark is 24 or less): what is the limiting distribution for the sample variance? This is not really a solvable problem, so approximate it for just λ3 = 20. 67.2 Simulation (2 marks) Experimentally justify the result in the CLT that says the sample mean has a mean given by the population mean and a variance given by the population variance divided by sample size. See the CLT Theorem in Lecture 4. Use simulation given sample a size of 10, 100 and 1000. For each given sample size use 50000 simulations to compute samples and their means. From these means compute the mean and variance of the sample means, and discuss how results reflect the CLT. Plot the results (3 sample sizes and 3 rates with mean and SD) to demonstrate any effects you want to discuss. 7.3 Plotting normality (3 marks) When rate λ1 = 1 and λ2 = 5 and sample size is 10 or 100, obtain the z scores of the sampling means (from 50000 simulations). Plot their distributions in a histogram with the theoretical Gaussian curve overlaid. Note for sample size 100, the plots overlay very nicely. But what happens with sample size 10? Explain the differences between the four plots. For each simulation: the z score of the mean can be calculated as: where Xˉ is the mean of the sample, μ is the population mean and σ is the population SD.