联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2022-05-12 07:49

School of Computing Sciences

Module: CMP-6002B Machine Learning

Assignment: Classification with Decison Trees

Date set: Tuesday 11th March 2022

Value: 50%

Date due: Wednesday 18th May 2022 3pm

Returned by: Wednesday 17th June 2022

Submission: Blackboard

Learning outcomes

The students will learn about implementing variants of decision trees and ensemble techniques

and the general issues involved with implementing any classifier. They will understand better

how to evaluate and compare classifiers on a range of data and for a specific problem. They will

appreciate the difficulty in presenting technical results in a way to support or refute a hypothesis.

The exercises will improve the following transferable skills: programming in Java; analysis

of data and results; writing clearly and concisely; presenting data coherently. The students will

also learn the basics of how to use GitHub to work with open source software.

Specification

Overview

The task for this assignment is to implement components of decision trees and ensembles,

then to evaluate/compare them on a range of real world data.

1

Part 1: Building a Tree by Hand (10%)

The purpose of this exercise is to make sure you understand how the basic algorithms work and

to provide you with a bench check for later implementations.

The data in Table 1 shows a training data set for a classification problem involving predicting

the region of a whisky based on the three taste attributes: Peaty; Woody; and Sweet. Train a

Table 1: Whisky Region Classification Example

Peaty Woody Sweet Region

yes no yes Islay

yes yes yes Islay

yes no no Islay

yes no no Islay

no yes no Islay

no yes yes Speyside

no yes yes Speyside

no yes yes Speyside

no no yes Speyside

no no yes Speyside

decision tree on this data using the information gain as a splitting criteria.

Construct decision trees for this data using

1. the chi-squared statistic as the splitting criterion;

2. the information gain as the splitting criterion.

The stopping criteria for both trees should be to stop at a pure node (node of just one class) or

if there are no more attributes to use. Include all calculations and draw the final decision tree.

2

Part 2: Implement Variants of Decision Trees and Ensembles (50%)

This exercise involves implementing, debugging and testing variants of decision tree classifiers.

A java code base, tsml, is provided. To get started, you need to clone the ml6002b-coursework

branch on the tsml GitHub repository. There is guidance on how to do this on blackboard.

Please follow method and class names specified in the instructions. Failure to do so may result

in a marks penalty.

Static functions to help assess attribute quality (15%)

Implement a class called AttributeMeasures in the package ml 6002b coursework that contains

four static methods, each of which measures the quality of an attribute split at a node.

They should all take a two dimensional array of integers as an argument and return a double.

You can assume that the rows represent different values of the attribute being assessed, and

the columns the class counts. The methods are:

1. measureInformationGain returns the information gain for the contingency table.

2. measureInformationGainRatio returns the information gain ratio for the contingency

table.

3. measureGini returns the gini measure for the contingency table.

4. measureChiSquared returns the chi-squared statistic for the contingency table.

The formal definitions for information gain, gain ratio, gini and chi-squared are given in the lecture

on Decision Trees, and I want you to follow these formula. Comment the code to indicate

how you have dealt with any edge cases. Include a main method test harness that tests each

function by finding each measure for the attribute Peaty in terms of the diagnosis. Print each

measure to the console, in the form “measure <insert> for Peaty = <insert>”.

Alternative attribute selection methods for CourseworkTree (15%)

Your task is to facilitate using your attribute measures as selection mechanisms within the

CourseworkTree classifier provided in the package ml 6002b coursework. It is advisable to

revisit the lab sheet on decision trees and to look at the original source for ID3. There is an

interface called AttributeSplitMeasure that is used in CourseworkTree. You need to implement

variants of this interface and use them within the CourseworkTree. The interface

contains a single abstract method, computeAttributeQuality, that takes an Instances and

an Attribute, that should return the quality of the attribute passed. There is also a default

method splitData to split Instances by the Attribute.

1. Implement and test the skeleton class IGAttributeSplitMeasure so that the split is

performed using information gain or information gain ratio. The choice of measure should

be controlled by a boolean variable called useGain.

2. Implement and test a class ChiSquaredAttributeSplitMeasure that implements

AttributeSplitMeasure and measures the quality using the chi-squared statistic.

3. Implement and test a class GiniAttributeSplitMeasure that implements

AttributeSplitMeasure and measures the quality using the Gini index statistic.

3

4. Configure CourseworkTree so that it can be used with any of the four attribute split measures

you have implemented (including both information gain and information gain ratio).

Adjust CourseworkTree so that the split criterion can be set through setOptions.

5. Currently, CourseworkTree can only be used with nominal attributes. Implement a

method in AttributeSplitMeasure called splitDataOnNumeric that makes a binary

split of instances into those below the value and those above the value. This should be

done prior to measuring the attribute quality. Please research how best to do this.

The majority of the marks will be for correctness. However, some consideration will be given

to efficiency and code quality. Include a main method in all three AttributeSplitMeasure

classes that prints out the split criteria value for Peaty, Woody and Sweet for the data from Section

1 for the whole data. Print in the form “measure <insert> for attribute <insert> splitting

diagnosis = <insert>”. Add a main method to the class DecisionTreeCoursework that loads

the problem optdigits (all nominal attributes) from the directory test data (do not use absolute

file paths for this), performs a random split, then builds classifiers using IG, ChiSquared

and Gini split criteria, outputting the test accuracy of each to console in the form

“DT using measure <insert> on optdigits problem has test accuracy = <insert>”.

Repeat this for the data set ChinaTown (all continuous attributes).

Implement an Ensemble Classifier (20%)

The task is to implement an Ensemble classifier that can be used with variants of your enhanced

CourseworkTree classifier. You must implement this classifier from scratch rather than use any

existing classifiers in tsml. Implement a classifier, TreeEnsemble, that extends

AbstractClassifier and consists of an ensemble of CourseworkTree classifiers stored in an

array or List. The size of the ensemble should be controlled by an integer attribute numTrees,

set to a default value of 50. TreeEnsemble should be in the package ml 6002b coursework.

1. The method buildClassifier should construct a new set of instances for each base

classifier of the ensemble by selecting a random subset (without replacement) of the

attributes. The proportion of attributes to select should be a parameter (defaults to 50%).

The TreeEnsemble will need to store which attributes are used with which classifier in

order to recreate the attribute selections in classifyInstance and

distributionForInstance.

2. Further diversity should be injected into the ensemble by randomising the decision tree

parameters. This diversification should include attribute selection mechanisms you have

implemented, but can also involve other tree parameters such as stopping conditions.

3. Implement classifyInstance so that it returns the majority vote of the ensemble. i.e.,

classify a new instance with each of the decision trees, count how many classifiers predict

each class value, then return the class that receives the largest number of votes.

4. Implement distributionForInstance so that it returns the proportion of votes for

each class.

5. Implement an option where rather than counting votes, the ensemble can average probability

distributions of the base classifiers. This should be controlled by a boolean attribute

called averageDistributions.

6. Implement a main method in the class TreeEnsemble that loads the problems optdigits

and ChinaTown, prints out the test accuracy but also prints out the probability estimates

for the first five test cases.

4

Evaluation of Decision Trees and Ensembles Classifiers (40%)

Your task is to perform a series of classification experiments and write them up as a research

paper. Your experiments will address the question of whether the variations you have implemented

for decision trees improve performance over a range of problems and whether your

ensemble is a good approach for a specific case study data set. You have been assigned a specific

data set (see blackboard) and a link to further information. Please note that for this part

of the coursework we mark the paper itself, not the code used to generate the results. We

advise you reserve significant time for reading and checking your paper, and recommend you

ask someone else to read it before submission. Imagine you are writing a paper for a wide

audience, not just for the marker. Aim for a tight focus on the specific questions the paper

addresses. We have provided two sets of datasets to perform these experiments: these are

in files UCIDisrete.zip and UCIContinuous.zip. A list of the problems in these files is given in

the otherwise empty class DatasetLists.java. You will be assigned a completely different

problem from timeseriesclassification.com for the case study.

There are four issues to investigate:

1. Decision Trees: Test whether there is any difference in average accuracy for the attribute

selection methods on the classification problems we have provided. Compare your versions

of DecisionTree to the Weka ID3 and J48 classifiers.

2. Tree Ensemble vs Tree Tuning: Test whether tuning CourseworkTree, including choosing

attribute selection criteria method, is better than ensembling. It is up to you to select

ranges of values to tune and ensemble over, but you should ensure that the range of

parameters you tune over is the same as those you use to ensemble. Perform this experiment

with the proportion of attributes set to 100%, then repeat the experiment with the

proportion set to 50%. You can use any code available in tsml to help you do the tuning.

3. Compare your ensemble against a range of built in Weka classifiers, including other ensembles,

on the provided classification problems.

4. Perform a case study on your assigned data set to propose which classifier from those

you have used would be best for this particular problem.

Experiments

You should compare classifiers based on accuracy (or, equivalently, error) and any other metrics

described in the evaluation lecture that you feel are appropriate. You can use the built

in classifier evaluation tools provided in tsml, or use any other mechanism. You should think

about the experimental design for each experiment, including deciding on a sampling method

(train/test, cross validate or resample) and performance measures. The choice of classifiers for

part 3 is up to you, but we recommend you include random forest and rotation forest in the list.

Include a table of all parameter settings of all classifiers in the paper. You can also compare the

time each algorithm takes to build a model. You can use both the discrete data and continuous

data, although it may be advisable to split the analysis into two parts if you use both. For the

case study, the data you have been given has all continuous attributes. You should experiment

with the data in this format, and compare results to those obtained by first discretizing the data

so that it is converted to nominal attributes. The convertion can be done using the Weka Filter

Discretize or any other means

5

The Write Up

You should write a paper in Latex using the style file available on blackboard. The paper should

be called “An experimental assessment of decision trees and decision tree ensembles”. This

should be around 4 pages, and should not be longer than 8 pages, including references and

appendices. Your paper should include the following sections:

1. Introduction: start with the aims of the paper, a description of the hypotheses you are

testing and an overview of the structure. Include in this your prior beliefs as to what you

think the outcome of the test will be, with some rationalisation as to why. So, for example,

do you think attribute evaluation will make a difference? Can you find any published

literature that claims to answer any of the questions?

2. Data Description: an overview describing the data, including data characteristics summarised

in a table (number of attributes, number of train/test cases, number of classes,

class distribution). For your case study data set you should have more detail, including a

description of the source of the data and the nature of the problem the data describes.

Look for references for similar types of problem. Do not simply copy the data description

from timeseriesclassification.com.

3. Classifier Description: Include a description of your C45Coursework classifier, including

details of design choices you made and data structures you employed. You should include

an example of how to use the classifier with the refinements you have included.

Also provide an overview of the other classifiers you use in the case study, including references.

4. Results: A description of the experimental procedure you used and details of the results.

Remember, graphs are good. There should be a subsection for each hypothesis. The

case study section should go into greater depth and remember, accuracy is not the only

criteria.

5. Conclusions: how do you answer the questions posed? Are there any biases in your

experiment and if so, how would you improve/refine them?

Presenting the output of experiments is a key transferable skill in any technical area. I stress

again we mark the paper for this section, not the experimental code. Avoid writing a description

of what you did. We do not need to know about any blind alleys you went down or problems

you encountered, unless they are relevant to the experiments (e.g. memory constraints).

Please also take care to include references where appropriate, and to check the formatting of

references.

These experiments will require a reasonable amount of computation. We suggest you store

the results of experiments in the format used on the module to avoid data loss in the case of

computer failure. You can remotely log on to the lab machines and run experiments on this

hardware if yours is limited.

Submission Requirements

You should submit a single zipped file containing your solution to part 1, 2 and 3 to the blackboard

submission point.

For part 1, submit a single pdf with your worked solution. For part 2, submit a zipped Intellij

project with your code. For part 3, submit a single pdf of your paper. A template format is here

https://www.overleaf.com/read/ycgpvxbgwmkf

6

Feel free to copy this project.

The blackboard submission portal will go live one week before the deadline.

Plagiarism and collusion: Stackoverflow and other similar forums (such as the module discussion

board) are valuable resources for learning and completing formative work. However, for

assessed work such as this assignment, you may not post questions relating to coursework

solutions in such forums. Using code from other sources/written by others without acknowledgement

is plagiarism, which is not allowed (General Regulation 18).

Marking Scheme

1. Part 1. Decision trees by hand (10 marks)

2. Part 2. Implement tree variants (50 marks)

3. Part 3. Evaluate Classifiers (40 marks)

Total: 100 Marks, 50% of the overall module marks

7


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp