联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2019-10-02 10:07

CSCI446/946 - Spring Session 2019 Page 1

University of Wollongong

School of Computing and Information Technology

CSCI446/946 Big Data Analytics Spring 2019

Assignment 2 (Due: 9 October 2019, Wednesday) 20 marks

Aim

This assignment is intended to provide basic experience in conducting text analytics experiments with R. After

having completed this assignment you should know how to perform text classification, topic modeling, and sentiment

analysis.

Preliminaries

Read through the lecture notes and recommended readings on text analysis. Study all example programs therein so

that you fully understand these techniques and know how to perform them with R.

Task 1 – Text Classification (6 marks)

The 20 Newsgroups data set is a benchmark for text classification. It consists of approximately 20,000 newsgroup

documents, which have been categorised into 20 different newsgroups. Information on this dataset can be obtained

from the webpage http://qwone.com/~jason/20Newsgroups/ . Download the “20news-bydate-matlab.tgz” from

this webpage and unzip it to obtain the training and testing data sets. Train the Na?ve Bayes classifier with the

training data set and test it on the testing data set.

In your report, you need to

1. Describe this 20 Newsgroups data set.

2. Describe how each document is represented in your implementation.

3. Describe Na?ve Bayes classifier and how you use it to classify the 20 Newsgroups data set.

4. Report the classification accuracy and plot the confusion matrix.

5. Attach your code at the end of the report.

Task 2 – Topic Modeling (6 marks)

Perform LDA topic modeling on the Reuters-21578 corpus using R (or Python) and LDA. The NLTK has already

come with the Reuters-21578 corpus. To import this corpus, enter the following comment in the Python prompt:

from nltk.corpus import reuters

R comes with an lda package that has built-in functions. The LDA has also been implemented by several Python

libraries such as gensim. Either use one such package/library or implement your own LDA to perform topic

modeling on the Reuters-21578 corpus.

In your report, you need to

1. Describe the Reuters-21578 corpus.

2. Describe how each document is represented in your implementation.

3. Describe the whole procedure on applying LDA to this corpus to perform topic modeling.

4. Describe the parameter setting that you use in the LDA and explain their meanings.

5. Describe the output of your code and visualize the obtained topics in appropriate ways.

6. Attach your code at the end of the report.

Task 3 – Sentiment Analysis (8 marks)

Choose a topic of your interest, such as a movie, a celebrity, or any buzz word. Then collect 200 tweets related to this

topic. Hand-tag them as positive, neutral, or negative. Next, randomly split them into 150 tweets as the training set

and the remaining 50 as the testing set. Run one or more classifiers (such as Na?ve Bayes, Maximum Entropy, or

Support Vector Machines) over these tweets to perform sentiment analysis. Report the classification accuracy and

CSCI446/946 - Spring Session 2019 Page 2

plot the confusion matrix. When you run more than one classifiers, find methods to evaluate which classifier

performs better than the others. (* It is not compulsory for the students of CSCI446 to run more than one

classifier.)

In your report, you need to

1. Describe the procedure of collecting the tweets and manually tagging them.

2. Describe the statistics of the obtained data set.

3. Describe how you represent each tweet for classification.

4. For each classifier, describe its working principle, classification procedure, and parameter setting.

5. For each classifier, report the classification accuracy and plot the confusion matrix.

6. (CSCI946 only) When you run more than one classifiers, report which classifier performs better than the

others and describe the methods you use to reach this conclusion.

7. Attach your code at the end of the report.

Submit:

Important:

1. The report must be in PDF format.

2. The report shall contain sufficient and detailed description, explanation, justification and

discussion. Marks will be deducted for a BRIEF report.

3. Sufficient annotation shall be provided in your code to make it easy to understand.

Neatly print your report and code (i.e. first the report then the code) on A4 pages with an appropriate cover sheet and

hand it in during the lecture on the 9th of October 2019. Make sure your report and code are correctly formatted and

titled. (Marks will be deducted for untidy or incorrectly formatted work.) Also, submit your report and the source

code in a Zipped file named A2.zip via the submit link provided in the Moodle site.

Note: Failure of your code to run may attract zero marks. Code or reports considered to be unreasonably same due to

copying will attract zero marks. You may be requested to demonstrate and explain your program when necessary.

Marks will be awarded for correct design, implementation and style. Any request for an extension of the submission

deadline or demonstration time limit must be made to the Subject Coordinator before the submission deadline.

Supporting documentation must accompany the request for any extension. Late assignment submissions without

granted extension will be marked but the mark awarded will be reduced by 25% of the assignment mark for each day

(including weekends) late.

--- END ---


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp