ICT583 Data Science Applications Exercise 1 Data manipulation
ICT583 Data Science Applications
Exercise 1: Data manipulation
- This exercise must be done individually by each student.
- Write your answers in a report format. Clearly indicate each question/sub-question number, and give your code followed by the snapshot of your results.
- You will submit one .R file along with the report, so we can run it for check. Make sure your code matches the provided answers. For example, if there are three separate data frames your code should produce the same three separate data frames.
- Code should be easy to read and understand. Only include code and comments necessary for the exercise.
Each of the following tasks can be performed using a single data verb function.
1. Find the average of one of the variables. summarise()
2. Add a new column that is the ratio between two variables. mutate()
3. Sort the cases in descending order of a variable. arrange() with desc()
4. Create a new data table that includes only those cases that meet a criterion. filter()
5. From a data table with three categorical variables A, B, and C, and a quantitative variable X, produce a data frame that has the same cases but only the variables A and X. select()
- Use the nycflights13 package and the flights data frame to answer the following question: What plane (specified by the tailnum variable) traveled the most times from New York City airports in 2013? (20 points)
- Use the nycflights13 package and the weather table to answer the following questions: On how many days was there precipitation in the New York area in 2013? Were there differences in the mean visibility (visib) based on the day of the week and/or month of the year? (20 points)
- Define two new variables in the Teams data frame from the Lahman package: batting average (BA) and slugging percentage (SLG). Batting average is the ratio of hits (H) to at-bats (AB), and slugging percentage is total bases divided by at-bats. To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run. (20 points)
- Display the top 15 teams ranked in terms of slugging percentage in MLB history. Repeat this using teams since 1969. (20 points)
- Create a factor called election that divides the yearID into four-year blocks that correspond to U.S. presidential terms. During which term have the most home runs been hit? Hint: seq function (20 points)