联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2024-04-01 11:56

Assignment 6

Large-Scale Supervised Learning

MET CS777

Description

In this assignment, you will be implementing a regularized, logistic regression to classify text documents.

Data

You will be dealing with a data set that consists of around 170,000 text documents (7.6 million lines of text), and a test/evaluation data set that consists of 18,700 text documents (almost exactly one million lines of text). All but around 6,000 of these text documents are Wikipedia pages; the remaining documents are descriptions of Australian court cases and rulings. At the high level, your task is to build a classifier that  can automatically figure out whether a text document is an Australian court case or not. We have prepared three data sets to use.

1.   The Training Data Set (1.9 GB of text). This is the set to train the logistic regression model.

2.   The Testing Data Set (200 MB of text). This is the set to evaluate the model.

3.   The Small Data Set (37.5 MB of text). This is to use for training and testing of the model locally before trying to do anything in the cloud.

AWS

Small data set (37.5MB)

s3://metcs777-sp24/data/Assignment6_SmallTrainingData.txt

Large training data (1.9GB)

s3:// metcs777-sp24/data/ Assignment6_TrainingData.txt

Test data set (200MB)

s3:// metcs777-sp24/data/ Assignment6_TestingData.txt

Table 1: Data set on Google Cloud Storage - URLs

Some Data Details to Be Aware Of. You should download and look at the

SmallTrainingData.txt file before you begin. You’ll see that the contents are sort of a

pseudo-XML, where each text document begins with a  tag and ends with < /doc >.

Note that all of the Australia legal cases begin with something like . Doc id for an Australian legal case always starts with AU. You will be trying to figure

out if the document is an Australian legal case by looking only at the contents of the document and document Id.

Tasks

Task 1 (10 points): Data Preparation

First, you need to write Spark code that builds a dictionary that includes the 20,000

most frequent words in the training corpus. This dictionary is essentially an RDD that has words as the keys, and the relative frequency position of the word as the value. For

example, the value is zero for the most frequent word, and 19,999 for the least frequent word in the dictionary.

Next, you will convert each of the documents in the training set to a TF (“term

frequency”) vector that has 20,000 entries. For example, in a particular document, the entry in the 177th value in this vector is a double that captures the frequency of the

177th most common word in this document. Likewise, the first entry in the vector is a double that captures the frequency of the most common word in the corpus in this document.

Then create the TF-idf matrix based on top-20k words similar to the previous assignments.

To get credit for this task, give us the average TF value of the words “applicant”, “and”, “attack”, “protein”, and “court” for court documents and wikipedia documents (Average on Documents). You need to have a printout in your code for these outputs.

This would be then 5 numbers for Wikipedia documents, and 5 numbers for the court cases. Print these values for the large training data set.

Report how long the task takes to run.

Notes:

.    You need to compute the TF-idf vector of the whole data set including training and test data.

.    To implement your script. on your laptop you can set the dimension size to smaller number and then change it later when you run on the cloud.

Task 2 (50 Points): Learning the Model

You will then use a gradient descent algorithm to learn a logistic regression model that can decide whether a document is describing an Australian court case or not. Your

model should use L2 regularization; you can change the regularization parameter to

determine how the parameter impacts the results and get a sense of the extent of the

regularization. We have enough data that you might find that the regularization not that important.

You should run your gradient descent until the L2 norm of the difference in the parameter vector across iterations is very small.

Once you have completed this task, you will get credit by printing out the five words

with the largest regression coefficients. These five words are most strongly related with an Australian court case.

Report how long the task takes to run.

Notes:

.    Remember that the problem is a classification problem with 1 for Australian court cases and 0 otherwise.

.    Cache the important RDDs. You do not want to re-compute the RDDs each time you make a pass through the data during learning. You can create one script for the multiple assignment tasks with different outputs.

.    You pass the whole data in each iteration and not a small sample of it.

.    In general, when debugging your code, the first step is to verify that the loss function is decreasing as the learning progresses.

.    The key is that each entry in the RDD should hold the ENTIRE normalized TF    vector for a document (this can be stored in any reasonable data structure; it

might make sense to store only the non-zero entries in a map from int to

double). Then, any loop happens WITHIN the map/reduce lambdas you build.

.    First implement your logistic regression without Regularization. Then, implement it with regularization, do some tests and find a good regularization value.

Task 3 (40 Points): Evaluation of the learned Model

After training the model, it is time to evaluate it. Use your model to predict which of the test docs are an Australian court case. To get credit for this task, you need to compute

the F1 score obtained by your classifier.

Next step - look at the text of three false positives that your model produced (that is,  Wikipedia articles that your model predicted Australian court cases). Write paragraph describing why you think your model made a mistake.

Is this related to bad documents from Australian legal system?

If you don’t have three false positives, just use the ones that you have (if any).

Important Considerations

Machines to Use

Be aware that you can choose virtually any configuration for your cluster - you can

choose different number of machines, and different configurations of those machines.

Note that each setting is going to cost you differently.

Pricing information is available at: http://aws.amazon.com/elasticmapreduce/ pricing/ Since this is real money, it makes sense to develop your code and run your jobs locally   on your laptop, using the small data set. Once things are working, then you’ll move to a cluster.

One option - You can run your Spark jobs over the “large” data using 3 workers machines with 4 cores and 8GB RAM each.

●     As you can see on the list price, costs per hour is not much, but IT WILL ADD UP QUICKLY IF YOU FORGET TO SHUT OFF YOUR MACHINES. Be very careful and stop your machine as soon as you are done working. You can always come back and start your machine or create a new one easily when you begin your work again.

●     Another thing to be aware of is that cloud services charge you when

you move data around. To avoid such charges, do everything in the northeast region, where the data is.

●     You should document your code very well and as much as possible.

Your code should be compiled on a Unix-based operating system like Linux or MacOS.

Submission Guidelines

Naming Convention:

METCS777-Assignment6- [TaskX-Y]FIRST+LASTNAME.[pdf/py/ipynb]

Where:

o [TaskX-Y] doesn’t apply for .[pdf] files

o No space between first and last name

Files:

o Create one document in pdf that has screenshots of running results of all coding problems. For each task, copy and paste the results that your last Spark job saved in the bucket. Also, for each Spark job, include a screenshot of the Spark History. Explain clearly and precisely the results.

Figure 1: Screenshot of Spark History

o Include output file for each task.

o Please submit each file separately (DO NOT ZIP them!!!).

For example, sample submission of John Doe’s Assignment 6 should be the following files:

o METCS777-Assignment6-JohnDoe.pdf

o METCS777-Assignment6-Task1-3-JohnDoe.ipynb

o METCS777-Assignment6-Task1-JohnDoe.py

o METCS777-Assignment6-Task1-Output-JohnDoe.txt

o …

Evaluation Criteria for Coding Tasks

Criteria

Excellent

Good

Fair

Poor

Points

Correctness

Code

accurately

completes all

tasks

Code

completes most tasks correctly

Code shows

understanding

but has

inaccuracies

Code fails most tasks

40%

Efficiency

Highly

optimized code

Somewhat

optimized code

Code works but not optimized

Inefficient code

20%

Code Structure and

Organization

Exceptionally

well-organized

code

Mostly

organized code

Somewhat

disorganized

code

Poorly

structured

code

20%

Error Handling and Data

Cleaning

Robust error handling and data cleaning

Handles most data issues

Some issues

with error

handling

Poor error

handling and

data

cleaning

10%

Reporting

Processing

Time

Accurate

processing

time reported

Mostly

accurate

processing

time

Significant

inaccuracies in time reporting

Inaccurate or no time reporting

10%

Total

100%





版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp