#### MAT10251 STATISTICAL ANALYSIS PROJECT

# MAT10251 STATISTICAL ANALYSIS

This project leads you through a statistical analysis of residential property data from a given non-capital city or town in Australia.

The data for this project was obtained from http://www.realestate.com.au/buy during January 2020.

Part A covers parts of Topics 1 and 2, Part B parts of Topics 5 to 9.

**You will
need to work on this project throughout Session 1.**

**Project
Data**

The data for this project can be accessed from the MySCU site for MAT10251 in Task 2 – Project in Assessment Tasks and Submission under** ASSESSMENT**.

The data set provided contains 10 randomly chosen samples of size 100.

**To obtain your data**

(1) Click on the Project Data file. This will download an Excel file.

(2) Select
the 4 columns (**Price $000 **to** Type**) of data for the sample specified
by the last digit of your student ID number.

(3) Copy this into a new Excel file.

There are 10 sample data sets
each of four (4) columns (**Price $000 **to** Type**)

Your sample number matches the last digit of your SCU student ID number. For example, if your student ID number ends in 2 your sample is Sample 2 and you will be analysing residential property data from Gold Coast Queensland in columns K to N and cells K2:N102.

**Project
Situation**

Your statistical analysis of residential property data is to enable you to answer questions from a relative who is seeking to buy a property in the particular town or city of your sample and has asked you for information and advice.

In each part of the project you are required to analyse your sample data in response to given questions and provide a written answer. You can assume that each written answer is a part of a letter or email to your relative.

**Project Preparation** MAT10251 STATISTICAL ANALYSIS PROJECT

You are expected to use Excel when completing the project.

Your written answers presenting findings and conclusions should be considered as a part of a letter or email to your relative. Each written answer should be a word document into which your Excel output has been copied.

In addition, your statistical workings for Part B should appear as appendices to your written answer. This should include all necessary steps and appropriate Excel output.

Each
part of the project should be submitted as a **SINGLE** Word document, with appropriate Excel output added.

**Note:**

- You should not need to read beyond the study guide and textbook to complete the project.

**Project
Submission**

- Each
part of the project should be submitted as a
**SINGLE**Word file with Excel output. - The given
**cover sheets**should be the first pages of your submitted project and are not part of the page limit. **DO NOT**submit your appendices, which are not part of the page or word limit, for Part B as a separate file.- Ensure that the page setup of your submitted document is A4 Portrait, with an appropriate format so that it is easily readable if printed.
- Use line spacing of at least 1.5.
- Please name your file

“Family Name_First Name_Part_A/B/_Campus”

For example; Jayne_Nicola_Part_A_Lismore

**Penalties For** **Incorrect Sample** MAT10251 STATISTICAL ANALYSIS PROJECT

- If you use a sample that does not correspond to the last digit of your student ID number, to be entered on the cover sheet, a maximum of two marks may be deducted, as this causes the marker extra work and frustration.

**Incorrect
Format **

- If the page setup of your submitted Word file is not as required (that is, A4 Portrait, with appropriate format so that it is easily readable if printed), with at least 1.5 line spacing or your project is not submitted as a single Word document a maximum of two marks may be deducted, as this causes the marker extra work and frustration.
- If your submitted file is not a Word file, for example it is a pdf or a zip file, a maximum of two marks may be deducted, as this causes the marker extra work and frustration.
- In addition, if your file is not named as requested or the required cover sheets are not included or correctly completed a maximum of two marks may also be deducted, as this can cause the marker extra work and frustration.

**MAT10251 STATISTICAL ANALYSIS**

**PROJECT
– PART A**

**Due **Week 4 Tuesday 24 March 202

If you are a late enrolment in MAT10251, email Nicola Jayne nicola.jayne@scu.edu.au with the date you enrolled in MAT10251 for a revised due date

**Value: **10%

**Objectives:** 1
to 5

**Topics:** 1
and 2

**Purpose: **To

- introduce you to the project data, situation and Excel
- use Excel to graph data and calculate statistics
- interpret and communicate Excel results

**Part
A Preliminary Analysis of Sample Data**

Your relative is interested in buying a property in the town or city specified by your sample and asks you to obtain information on the property prices in this location.

Your relative is considering purchasing a two or three bedroom property, as they are either downsizing since their children have left home or they are new home buyers. Therefore, they are interested in the typical price of two and three bedroom properties. They are also interested in the difference in price between units and houses and the relationship between number of bedrooms and price. myscu

**Tasks
– Part A Submission**

**Complete
the following**

**1) **Download and save your data.

**2) **Download the Project
Part A cover
sheets, name and save this file as

“Family Name_First Name_Part_A_Campus”

**3) **Enter your Sample
Number on page 2 of the Part A coversheets.

**4) ****Statistical
Output: **For your sample perform the following tasks:

**Price of two and three bedroom properties**

Use **Price $000 **(1st column of data) for **two and
three bedroom **residential properties for sale to explore the typical price
of a two or three bedroom residential property, by using Excel to:

- Construct a frequency histogram or polygon for the price of two and three bedroom residential properties.
- Calculate descriptive statistics for the price of two and three bedroom residential properties.

**Notes:**

- The required data for two and three bedroom residential properties are in the first rows of your sample.
- Analyse
the price of two and three bedroom properties together. Do
**NOT**separate into two or three bedrooms or into units or houses.

**b) Difference between
unit and house prices**

Use **Price $000 **(1st column of data) and **Type**
(4^{th} column of data) for **all
100** residential properties for sale to explore the difference between unit
and house prices, by using Excel
to:

- Construct separate boxplots, on the same plot or separately, for house prices and for unit prices.

**Hint:** Sort data on **Type** to obtain two samples. One for
house prices and the other for unit prices.

**c) Relationship
between number of bedrooms and price**

Explore the relationship between number of bedrooms
and price using **Number of Bedrooms**
(2^{nd} column of data) as the independent variable and **Price $000 **(1^{st} column of
data) as the dependent variable for **all
100** properties by using Excel to:

- Construct a scatter plot for number of bedrooms and price
- Calculate the correlation coefficient for number of bedrooms and price.

**5) Written Answer – Email or letter**

Using the instructions given on pages 4 and 5 of the Part A coversheets, introduce your data and the results of your preliminary investigation of residential property prices

This should be three to five pages and 400 to 800 words.

Use an appropriate style, without statistical jargon and equations, to clearly communicate your results.

**6) **Complete
Coversheets 1 and 2, save and submit Part A of the project online using Project
Part A link
in Submit
Project by the due date Tuesday 24 March 2020.

**Marking
Criteria – Part A**

**Read
the marking criteria carefully and consider them when preparing your
Part A Submission. **

See the marking and feedback sheet, page 3 Part A coversheets, for allocation of marks.

**Statistical
Calculations**

- To
obtain full marks your
**graphs**and**plots**must be correct, including correct labels on both axes and a title.

Marks will be deducted if:

- Graph or plot incorrect

Examples

- Gaps between classes of non-zero
frequency in a histogram for continuous data
- Incorrect independent and dependent variables in a scatter plot.

- Line graph instead of a histogram

- Excel, PhStat, Excel Workbooks, or similar, is not used.
- Axes incorrectly or not labelled.
- No title.
- For a histogram or frequency polygon inappropriate classes are used.
- Scale on axes distorts graphs.
- To
obtain full marks for
**descriptive statistics**copy the output table of the Descriptive Statistics command in Data Analysis or the Descriptive Summary and/or Boxplot command in PhStat or Descriptive workbook. You may delete unnecessary statistics in these tables. - Marks will be deducted if any descriptive statistics are incorrect, so check:

- Your sample size.
- Whether you are calculating sample statistics or population parameters.

**Written
Answer – ****Email or letter**

- 400 to 800 words and three to five pages – marks will be deducted if this is greatly exceeded.
- To obtain full marks must:

- Be well structured.
- Clearly communicate the results of the Excel output in language appropriate for your audience.
- Include appropriate graphs and plots with appropriate statistics.
- Provide information on average/typical price of a two or three bedroom residential property, how the price of two and three bedroom residential properties vary and any pattern to the price of two and three bedroom residential properties.
- Provide information on the difference in price of units and houses.
- Provide information on the relationship between number of bedrooms and the price of residential properties. Comment on the strength, shape and sign of the relationship.
- Marks will be deducted if:

- There is little or no comment on, or interpretation of, the Excel output.
- Unnecessary statistical jargon and equations appear.
- It is confusing or not readable.
- It is handwritten.
- For each major spelling and/or grammatical error half a mark will be deducted, up to a maximum of two marks.
- Also up to two marks may be deducted for poor structure and/or presentation.

**MAT10251 STATISTICAL ANALYSIS**

**PROJECT
– PART B**

**Due: **Week 11 Sunday 17 May
2020

**Value: **25%

**Objectives:** 1
to 5

**Topics:** 5
to 9

**Purpose: **To
apply your knowledge of statistical inference and regression to answer
questions about residential property prices by analysing the data and
communicating the results.

**Part
B Further Analysis of Data – Using Statistical Inference and Regression
Analysis**

In response to your letter or email in Part A, your relative asks for further information and clarification. You use the graphs and statistics obtained in Part A and techniques from statistical inference and regression and correlation to provide this information.

**Part B
Submission**

You should submit a single word document consisting of:

- Part B coversheets
- Written answer either as a letter or an email or emails. See instructions on page 4 of Part B coversheets
- Appendices for Part B which contain full statistical working for the required statistical tasks. This should follow the format given on pages 5 of Part B coversheets

**Part B
Preparation**

Graphs and statistics from Part A are required in the statistical and written answers in Part B. Therefore, check these and make any required corrections.

While the submission date for Part B is Sunday 17 May 2020, you should be working on Part B during Weeks 6 to 11.

It is recommended that you follow the following timetable:

- Question 1, covering Topic 5, should be completed in Week 6
- Question 2, covering Topic 6, should be completed in Week 8
- Question 3, covering Topic 7, should be attempted in Week 9
- Question 4, covering Topic 8, should be attempted in Week 10
- Question 5, covering Topic 9, should be attempted in Week 11

**Task 1 Part B – Appendices ****Statistical
Inference and Regression and Correlation Tasks (38 marks)**

The following statistical tasks should appear as appendices to your written answers. These should include all necessary steps and appropriate Excel output.

These appendices should come after your written answer within your single Word document for Part B.

**Statistical Inference **

Choose a level of significance for any hypothesis tests and a level of confidence for any confidence intervals. Enter these values on page 2 of the Part B coversheets along with the sample number from Part A.

For your sample answer the following questions using appropriate statistical inference and regression techniques.

**Question
1 – Topic 5 (5.5 marks)**

Your relative is considering buying a unit which from your previous research you have shown appear to be cheaper than houses. However, your relative is concerned that if they only consider units their choice will be limited.

To explore if their choice will
be limited if they restrict their search to units use **Type **(4^{th} column of your data) for** all 100 **residential properties for
sale and an appropriate statistical
inference technique to:

- Estimate the population proportion of residential properties for sale, in the location and state specified by your sample, which are units.

**Hint:** Sort data on **Type**
to enable you to easily count the number of properties in your sample which are
units

**Question
2 – Topic 6 (7.5 marks)**

Your relative has a maximum of $330,000 to purchase a residential property. If the average price of two or three bedroom residential properties is more than this your relative will consider the location to be too expensive.

Explore
if your relative will find the location specified by your sample too expensive by using **Price
$000 **(1^{st} column of data) for **two and
three bedroom **residential propertiesfor sale, your output from Part A, and an appropriate statistical
inference technique to answer the following question

- In the location specified by your sample, is the mean two and three bedroom residential property price more than $330,000?

**Notes:**

- The required data for two and three bedroom residential properties are in the first rows of your sample
- If you
have sorted your data on
**Type**in Question 1 download your data again.

**Question 3 Topic 7 (6 marks)**

From your previous research you have shown that units appear to be cheaper than houses. Your relative asks you to estimate how much they would save if they purchased a unit instead of a house.

To
provide a justified answer use **Price
$000 **(1^{st} column of data) and **Type** (4^{th} column of data)
for **all 100** properties for sale, your output from Part A and an
appropriate statistical inference technique to
answer the following question.

- Estimate the mean difference in price between units and houses for sale in the location specified by your sample.

**Hint:** Sort data on **Type**
to obtain two samples. One for house prices and the other for unit prices.

**Questions 4 and 5 Simple and Multiple Linear
Regression (19 marks)**

Your relative asks what factors influence the price of a residential property and if you can estimate the price of a residential property from these factors.

To answer this you develop a simple linear regression model to estimate price from number of bedrooms and a multiple linear regression model to estimate price from number of bedrooms, number of bathrooms and type. Then you choose and interpret the linear model that best fits your data.

**Question 4 Simple Linear Regression Model Topic 8**

Use **Number of Bedrooms** (3^{rd} column
of data) as the independent variable and **Price
$000 **(1^{st} column of data) as the dependent variable for **all 100** properties and your output from
Part A to develop and then explore a simple linear relationship between the two
variables by:

- Calculating the least squares regression line, correlation coefficient and coefficient of determination.
- Interpreting the gradient and vertical intercept of the simple linear regression equation.
- Interpreting the coefficient of determination.

**Question 5 Multiple Linear Regression Model Topic 9 **

To explore whether
being a house or unit and number of bathrooms also influences price, add **Number of Bathrooms **(3^{rd}
column of data) and **Type** (4^{th}
column of data) as additional independent variables to the simple linear
regression model in Question 4. Then develop and explore the relationship
between price and the three independent variables by:

- Calculating the multiple regression equation and coefficient of determination.
- Interpreting the values of the multiple regression coefficients.
- Interpreting the value of the coefficient of determination. Compare the value with the corresponding value for the simple linear regression model.

Then determine the best model to estimate price by:

- Using appropriate tests to determine which independent variables make a significant contribution to the regression model.
- Then state or calculate the simple or multiple regression equation which best fits the data.

**Notes:**

- You may need to transform or manipulate the given data, before using Excel for the corresponding statistical calculations.
- Use Excel for all statistical calculations. You do not need to repeat any Excel calculations by hand. However, make sure that you define your random variables and include any steps not given by Excel. For example, in a hypothesis test include the null and alternative hypotheses, along with the decision to reject or not reject the null hypothesis.
- Mention any assumptions you need to make, where appropriate justify these from Part A output.
- In Question 4 fit a linear model even if from your scatter plot you decide that a non-linear relationship better fits the data or that no apparent relationship exists. However, mention this in your written answer and/or corresponding appendix.
- Comment on why a test or confidence interval has been chosen. Where appropriate include and refer to Part A output.
- Make sure you interpret confidence intervals and write conclusions to hypothesis tests.

**Task 2 – Written Answer – Email or letter (12 marks)**

For Questions 1, 2, 3 and Questions 4 and 5 combined present the results of your calculations, with your interpretation and conclusions as either a letter or email/emails to your relative.

Use the instructions given on page 4 of the Part B coversheets.

This should be 400 to 900 words and two to five pages.

It should be submitted as a Word file with Excel output included.

Make sure you:

- Introduce each question and put it in context
- Answer each question in non-statistical language.
- Present the result of your calculations and tests without unnecessary statistical jargon
- Include a conclusion which answers the given question.

In particular, for Questions 4 and 5

- Include and justify the best model.
- Discuss and interpret the values of the regression coefficients and coefficient of determination of the best model.

**Marking
Criteria – Part B**

**Read
these marking criteria carefully and consider them when preparing Part B.**

See the marking and feedback sheet, page 3 Part B coversheets, for allocation of marks.

**Statistical Calculations**

- For
**statistical inference calculations**(Questions 1, 2, 3 and 5) marks will be given for:

- Choice of appropriate statistical technique/s.
- Random variable/s defined.
- Correct hypotheses for tests.
- Correct Excel output.
- Correct interpretation of results.
- For
**regression coefficients and coefficient of determination**(Questions 4 and 5) use either:

- The Regression command in Data Analysis and copy resultant tables.
- Or the simple/multiple regression command in PhStat and copy the resultant tables.
- Or the Simple Linear and Multiple Regression workbooks and copy the resultant tables.

- For
**regression coefficients and coefficient of determination**(Questions 4 and 5) marks will be deducted if Excel is not used and also for incorrect equations or coefficients, so check:

- Your independent and dependent variables.
- Your sample size.

**Written
Answer –**** ****Email/Emails or Letter**

- 400 to 900 words and two to five pages – marks will be deducted if this is greatly exceeded.
- To obtain full marks must:

- Be well structured and analysed
- Clearly communicate the results of the Excel output in language appropriate for your audience
- Include an introduction to each question and a conclusion
- Include appropriate Excel output
- Answer the questions in non-statistical language.
- Marks will be deducted if:

- There is little or no comment on, or interpretation of, the Excel output
- Unnecessary statistical jargon and equations appear
- It is confusing or not readable
- For each major spelling and/or grammatical error half a mark will be deducted, up to a maximum of two marks
- Also up to two marks may be deducted for poor structure and presentation.

- For Questions 1 to 3, and Questions 4 and 5 combined in (), the following rubric will be used

Mark | Acceptable | |

Poor | 0 (0) | Question not introduced and/or results not presented. Confused response. Incorrect and/or inconsistent comments and conclusions. Unnecessary statistical jargon, especially symbols, equations and definitions (copied from the textbook) Question unanswered. |

Acceptable | 1 (2) | Question introduced and results presented. Minimal interpretation and/or conclusions on how to use the information and/or only minimally relates information obtained to residential property prices. Only minor errors and inconsistencies in comments and conclusions. Question answered. |

More than acceptable | 2 (4) | Results presented and questions introduced and answered, clearly and concisely. Includes interpretation and/or conclusions on how to use the information and/or relates information obtained to residential property prices. No errors or inconsistencies in comments and conclusions. Questions answered and justified |