联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Database作业Database作业

日期:2023-05-29 08:27

Assignment 2: Apache Spark Programming and Optimisation

Group Assignment (15%) 10/05/2023

Introduction

This is the second part of the practical assignment of DATA3404 in which you have to translate some

SQL queries, respectively data analysis tasks into Apache Spark Python programs, and optimise

them in Databricks.

You find links to online documentation, data, and hints on tools and schema needed for this

assignment in the ’Assignment’ section in Canvas modules.

Data Set Description

This assignment is based on the same Aviation On-time scenario than Assignment 1 which includes

data about airports, airlines, aircrafts, and flights. We will provide a slightly more cleaned data set

this time, but besides the schema is the same (primary keys are underlined):

airport_code

airport_name

city

state

country

Airports

tail_number

manufacturer

model

aircraft_type

year

Aircrafts

carrier_code

name

country

Airlinesflight_id

carrier_code

flight_number

flight_date

origin

destination

tail_number

scheduled_departure_time

scheduled_arrival_time

actual_departure_time

actual_arrival_time

distance

Flights

You will be provided with scaffold Jupyter notebook files for setting up your Databricks workspace

similar to the Assignment 1. Note that this scaffold will download and prepare new files which have

been pre-processed to have less data anomalies. As such you can apply a more stringent schema.

Downloading Datasets

As part of the tutorial in Week 11, we have provided an Assignment 2 Bootstrap notebook to

handle downloading data into DBFS. Note that this dataset has been cleaned so it is different from

Assignment 1.

1

Questions

There are two group tasks to be solved:

In the first task, you are provided with some SQL code that is similar to an Assignment 1

question. You are asked to rewrite that as Python Spark code and then optimize it.

In the second task, you are provided with a question which you must answer with some

Python Spark Code that you then also must optimize.

Task 1: SQL Code conversion

Rewrite the following SQL query as Python Spark code, using its Python Dataframe API. Then

optimise your code using Spark’s explain() functionality.

1. Determine the names of the top-5 US airlines with the most Boeing aircrafts.

SELECT A . name , COUNT(B . ta i l number )

FROM A i r l i n e s A JOIN F l i g h t s F USING ( c a r r i e r c o d e )

JOIN A i r c r a f t s B USING ( ta i l number )

WHERE A . country = ’USA ’ AND B . manufacturer = ’ Boeing ’

GROUP BY A . name

ORDER BY COUNT(B . ta i l number ) DESC

LIMIT 5

Conversion: rewrite as Apache Spark Python code (pySpark)

Optimisation: Optimise this query. Compare the optimisations on small/medium/large

datasets, both in terms of the query plan, and in terms of runtime (you can use the sparkmea-

sure package for this, cf. Tutorial Week 11).

Task 2: PySpark Code to determine ”Airports with largest total delay, and their

top-5 delayed airlines”

In this second task, write an Apache Spark program that determines for a given year and country

(both user-specified) the top-10 airports with the largest total annual departure delay (in minutes)

and lists them in descending order of the total delay. For each of those airports, list further the

five airlines contributing most to this overall departure delay in the given year, together with the

percentage of how much they contributed to this delay. List those airlines as comma-separated

string that is enclosed in ’[’ and ’]’ (indicating a list), each airline listed with both airline name and

the percentage of their delay contribution in descending order of the percentage.

The output should have the following tab-delimited (\t) format as follows:

airport name \t total departure delay \t [(airline1, percentage), (airline2, percentage), ...,

(airline5, percentage)]

Deliverable: write as an Apache Spark Python code (pySpark)

Optimisation: Optimise your program. Compare the optimisations on small/medium/large datasets

both in terms of the query plan, and in terms of runtime performance (e.g. using StageMetrics from

the sparkmeasure package).

2

Optimisation Recommendations and Report contents

You may use the following approaches as guidelines for your analysis and report:

Analyse the operation of your queries using the techniques covered for SQL and Spark

Identify something that can be improved, and change the query to support that and thus to

be more efficient

Ensure the output is unchanged – and that you anticipate it would be unchanged for any valid

inputs, i.e. do not rely on the specific data you have available for testing.

If you are unable to identify an improvement, you should write something that is equivalent

in output but involves a different execution.

Observe an improvement both from the SQL execution plan, the Spark DAG, and the Spark-

Measure analysis.

You may wish to use the larger flights datasets to check this, particularly the large dataset –

this will make it clear whether there is a speed improvement. Note that these versions have

not been cleaned as the flights small cleaned dataset has. To achieve similar results, dropna

should be applied to the dataframe when it is loaded, the column names should be corrected,

and inner joins should be used when connecting with airline/airport/aircraft data.

In the event that any bugs in the provided queries are discovered, they should be treated as

working – that is, your modification should retain any bugged behaviour rather than try to

correct it to the stated question.

The cache clearing in the Spark code should be retained – do not use caching that requires

the queries to be run multiple times to achieve efficiency. The efficiency improvement should

come from the query being run for the first time. Caching within the query itself is fine.

Note that each task should be approached by the group as a whole. Each task answer should have

the authors who contributed to that version indicated in a comment.

Note also that there will be quite a lot that you might identify and change – you do not need

to optimise these queries to be perfectly efficient! All you need to do is to identify one aspect of

inefficiency in each question and correct it. However, there are a couple of things that are the same

for both tasks, particularly with Spark – you will need to use a different approach for the two rather

than cutting and pasting an improvement that applies to both tasks for your improvement to count.

Deliverables and Submission Details

There are three deliverables per group:

1. a brief report/documentation outlining your outputs, optimisations and observations; and a

2. source code - Jupyter notebook as a single .DBC archive or notebook source file that answers

the given two tasks.

3. A demo in week 12/13 where you will be quizzed on the contribution of each member.

Here are the specifics associated with each deliverable item.

3

Report

Filename recommendation: data3404 y23s1 assignment2 tutgroupname assignmentgroupnum.pdf

Your group name, including tutorial code and group number

The answers (output) you receive for each question when executing against all three (small-

/medium/large) datasets.

A short explanation of ’what is going on’ in the general sense in the Spark code of your

solutions for each Task.

A group contribution statement with the following headings:

– Your group members’ names and SIDs

– Whether they contributed meaningfully to tasks and the report (yes or no)

This does not have a strict page limit, but you should keep it relevant to ensure feedback can

be useful. In particular:

– Do not include large code dumps. You are already submitting all of your code. Use 1

or 2 line snippets if you absolutely must. Remember that including your code is not

explaining your code.

– Do not keep writing more in the hope that some of what you include will be correct. You

are more likely to run into issues including incorrect information than you are to gain

by including correct information.

Jupyter notebook or DBC Archive

Filename recommendation:data3404 y23s1 assignment2 tutgroupname assignmentgroupnum.ipynb

A single Jupyter notebook file (or .DBC archive) that contains all of your completed tasks. This file

must be able to be run attached to a Databricks Community cluster that has had the Assignment

Bootstrap notebook run on it, and no other configurations made.

Due Date: All deliverables are due in Week 13, no later than Sunday 28th May. Late submission

penalty: -5% of the marks per day late. The marking rubric is in Canvas.

Students must retain electronic copies of their submitted assignment files and databases, as the

unit coordinator may request to inspect these files before marking of an assignment is completed. If

these assignment files are not made available to the unit coordinator when requested, the marking

of this assignment may not proceed.

All the best!

Group member participation

This is a group assignment. The mark awarded for your assignment is conditional on you being

able to explain any of your answers to your tutor or the lecturers if asked.

If your group is experiencing difficulties with the content, you should ask on Ed (use a private

post if you need to discuss code or report writing directly).

Level of contribution Proportion of final grade received

No participation 0%

Minor contributor, but at least some understanding of group submission 50%

Major contributor to the group’s submission. 100%

4


相关文章

版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp