联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2023-07-18 09:10

MTHM506/COMM511 - Statistical Data Modelling

Referral Assessment

Deadline: 12:00pm (noon) 31st July 2023

This assignment consists of two sections and you should attempt both sections. Both sections are worth 50 marks

each and will be combined to give a total of 100 marks for this assignment. Please submit one pdf via eBART

containing your solutions - it should be written up using word processing software (e.g. LaTeX, R Markdown, or

Word).

You are expected to work independently - strict disciplinary action will be taken for any plagiarism. Late submissions

will also be penalised. The data required for this assignment datasets_refdef.RData can be loaded into R using the

load() function.

2

Section A - Exercises

In this section, you are required to answer a series of exercises based on the module. Note that the questions are

organised in the order we covered the topics, and not in order of difficulty. Therefore it is advised that you read

through the questions first, and start working on those that you feel more comfortable with. Solutions are expected

to be concise, well structured and well presented. Commented R code (e.g. ‘model <- glm(...)’) and the outcomes/plots

should form part of your solutions. Do not display too much raw R output (e.g. don’t display the full output of

‘summary(model)’), but edit this down to the essentials. Ensure to include justification for each step of your analyses,

providing comments alongside your R code to explain what you are doing and add appropriate titles and labelled axes

to your plots. Hand written solutions will be accepted where mathematical descriptions are required, but a

professional word processed submission is preferred.

Question 1

The data frame dengue involves data on a response variable y, the count of weekly dengue fever cases in Rio de

Janeiro. This is a time series starting on the 36th week of 2012 and goes up to the 15th week of 2013:


Suppose for these data we wish to consider the model:

Yi ~ NegBin(μi,θ) Yi independent

log(μi) = β0 + β1xi

where xi is time (in weeks). The goal is to capture the temporal structure of the disease outbreak in 2013. The

Negative Binomial distribution with mean μ and dispersion parameter θ (note the R functions dnbinom, qnbinom

etc. call θ the size) has pmf:

p(yi;μi,θ) = yi +yθi ? 1θ +θμiθ θ +yiμiyi

(a) [2 marks] Write down the log-likelihood `(β0,β1,θ;y,x).

(b) [1mark] Write an R function mylike() which evaluates the negative log-likelihood (i.e. ?`(β0,β1,θ;y,x)) for any

values of the three parameters.

(c) [4 marks] Use the R function nlm() in association with your function mylike() to numerically minimise the log-

likelihood and report the maximum likelihood estimates for the model parameters. Provide some evidence of

how you chose sensible starting values.

0

1000

2000

3000

4000

0 10 20 30

Time

3

(d) [3 marks] Estimate the standard errors and construct 90% confidence intervals for β0 and β1.

(e) [2 marks] Test the hypothesis that β1 = 0 at the 5% significance level (not using a confidence interval) and

compute the associated p-value of the test.

(f) [3 marks] This model can also be fit within the GLM framework. Fit this model using glm.nb() and compare

(qualitatively) the estimates and the associated standard errors for β0, β1 and θ to those from parts (c) and

(d).

(g) [3 marks] Produce the residual plots and comment appropriately.

(h) [1 mark] Does the model fit well with respect to the saturated model?

(i) [6 marks] Produce a plot of the associated mean relationship, the 95% confidence intervals and the 95%

prediction intervals on a scatter plot of y versus x. Comment on the appropriateness of the model and

suggest a possible way forward for this model.

Question 2

The data frame titanic relates to 1309 passengers on the last voyage of the ocean liner ‘Titanic’. The response variable

survived is a binary variable where the value 1 means the passenger survived the sinking. The data frame also contains

predictors relating to passenger class (1st, 2nd, 3rd), gender, age and the fare amount each passenger paid. Passenger

names are also available (for interest, rather than for modelling).

(a) [6 marks] Fit a Binomial GLM with logistic link of survived with age, pclass and gender as predictors, as well as

all the associated two-way interactions. Reduce the model if and as appropriate using the AIC in conjunction with

the R function drop1(). Make sure to perform relevant model checking. (b) [6 marks] Interpret the final model in

terms of parameter estimates and their significance.

Question 3

The dataframe carbonD contains monthly observations of CO2 concentrations from 1959 to 1997, measured at

Mauna Loa (Hawaii). The variables are: co2 (CO2 concentrations in parts per million), month (month of

measurement), year (year of measurement) and timeStep (unique time variable). A scatter plot of co2 against

timeStep suggests CO2 is increasing over time with a seasonal cycle:


(a) [3 marks] Write down (mathematically) a plausible GAM to describe this data set.

4

(b) [5 marks] Fit the suggested GAM ensuring to perform all relevant model checks.

(c) [2 marks] Plot estimates of any smooth functions in your model and comment appropriately.

(d) [3 marks] Use the model to predict CO2 for the year 1998 and produce a plot of this along with 95%

prediction intervals.

Section B - Project

In this section, you are required to conduct an independent analysis using Generalized Additive Models

(GAMs). You should write a report detailing your analyses, results and present a conclusion. Your report is expected

to be concise, well structured and well presented. It should comprise at most two sides of text and should have no

more than six figures and/or three tables. Figures, tables or R code are not included in the page limit. You must use

A4 paper and a font size of at least 11 points, while lines must be single spaced. No credit will be awarded to

additional pages of text. Ensure all figures have appropriate titles, axes and captions. Commented R code (e.g.

‘model <- glm(...)’) and the outcomes/plots should not form part of your report but should be included as

appendices at the end.

There are 50 marks in total, and a brief outline of the marking criteria is given below with approximate marks:

[10 marks] Understanding and exploration of both the problem and the data.

[5 marks] Thoroughness and rigour, e.g. clear mathematical description of models.

[10 marks] Clear exposition of the steps you took in model fitting and exposition of a final model.

[10 marks] Clear presentation and interpretation of results.

[5 marks] Critical review of the analysis.

[10 marks] Clarity and conciseness in writing and tidy presentation of R code and associated plots.

You are required to analyse the daily trends in Nitrogen Dioxide (NO2) from an air pollution monitor in Harlington,

London. The monitor is situated just North of Heathrow Airport and records the daily average

NO2 for twelve years between 1st January 2010 and 31st December 2021 along with some meteorological variables. A

line plot of the daily average NO2 over that period can be seen below.


The dataframe no2_data contains this information and contains the following variables:

1. site: Site name

2. date: Date of measurement (in yyyy-mm-dd format)

0

50

100

150

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022

5

3. doy: Day of the year (1 - 1

st January, 2 - 2nd January, 3 - 3rd January etc. )

4. month: Month of the year (1 - January, 2 - February, 3 - March etc.)

5. year: Year (2010, 2011, 2012, etc.)

6. no2: Daily average NO2 (in micrograms per cubic metre, μg/m

3)

7. air_temp: Daily average temperature (in degrees celcius,

oC)

8. ws: Daily average wind speed (in metres per second, m/s)

9. wd: Daily average wind direction (in degrees, 0

o - wind blowing from the North, 90o - wind blowing from the

East, 180o - wind lowing from the South, 270o - wind blowing from the West)

The aim is to use this data to build one model (using the GAM framework) and use this to answer the following

questions:

Do any of the meteorological variables (7-9 above) significantly affect the daily average NO2. If so, in what

way?

Is there a within year seasonal trend in NO2 concentrations? If so, when are concentrations typically at their

highest/lowest?

London has implemented a number of measures to reduce air pollution. Has there been a noticeable

downward trend in NO2 concentrations between 2010 and 2021?

The British government implemented a series of lockdowns in 2020 and 2021 as a response to the COVID-19

pandemic. Was there a sudden change in NO2 concentrations in 2020 and 2021 as a result of the pandemic? If

so, update and use your model to estimate the decrease in NO2 concentrations.

The World Health Organization (WHO) publishes guidelines for maximum daily average (25μg/m

3) NO2

concentrations in order to protect human health.Do the outputs from your model indicate the daily limits

were exceeded? If so how many times? Are the number of days decreasing each year?

When building a model, make sure to perform all relevant model checks. Note, that you can use the predict()

function with type = "terms" to extract the individual predicted smoothed functions in R.


相关文章

版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp