联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Database作业Database作业

日期:2023-05-19 07:09

COMP3009J – Information Retrieval

Programming Assignment

This assignment is worth 30% of the final grade for the module.

Due Date: Sunday 28th May 2023 at 23:55 (i.e. before beginning of Week 15)

Before you begin, download and extract the files ``small_corpus.zip’’ and ``large_corpus.zip’’

from Brightspace. These contain several files that you will need to complete this assignment.

The README.md file in each describes the files contained in the archive and their format1

.

Both corpora are in the same format, except for the relevance judgments. For the small

corpus, all documents not included in the relevance judgments have been judged nonrelevant. For the large corpus, documents not included in the relevance judgments have not

been judged.

You should first try the following tasks using the small corpus. If you can do this successfully,

then you should move to the large corpus. The large corpus will require more efficient

programs.

Part 1: BM25 Model

For this assignment you are required to implement the BM25 of Information Retrieval. You

must create a program (using Python) that can do the following.

1. Extract the documents contained in the document collections provided. You must divide

the documents into terms in an appropriate way. The strategy must be documented in your

source code comments.

2. Perform stopword removal. A list of stopwords to use is contained in the stopwords.txt

file that is provided in the ``files’’ directory.

3. Perform stemming. For this task, you may use the porter.py code in the ``files’’

directory.

4. The first time your program runs, it should create an appropriate index so that IR using the

BM25 method may be performed. Here, an index is any data structure that is suitable for

performing retrieval later.

This will require you to calculate the appropriate weights and do as much pre-calculation as

you can. This should be stored in an external file in some human-readable format. Do not use

database systems (e.g. MySQL, SQL Server, SQLite, etc.) for this.

1 This is a Markdown file. Although you can open and read it as plain text, a Markdown

editor like Remarkable (https://remarkableapp.github.io/ - Windows or Linux) or MacDown

(https://macdown.uranusjr.com/ - macOS) is recommended.

5. The other times your program runs, it should load the index from this file, rather than

processing the document collection again.

6. Run queries according to the BM25 model. This can be done in two ways:

- In “interactive” mode, a user can manually type in queries and see the first 15 results

in their command line, sorted beginning with the highest similarity score. The output should

have three columns: the rank, the document’s ID, and the similarity score. A sample run of

the program is contained later in this document. The user should continue to be prompted to

enter further queries until they type “QUIT”.

- In “automatic” mode, the standard queries should be read from the

``queries.txt’’ file (in the ``files’’ directory). This file has a query on each line, beginning

with its query ID. The results should be printed into a file named “results.txt”, which

should include four columns: query ID, rank, document ID and similarity score.

It is ESSENTIAL that this can be run as a standalone program, without requiring an IDE such

as IDLE, PyCharm, etc.

You can assume that your program will be run in the same directory as the README.md file

(i.e. the current directory will have the ``documents’’ and ``files’’ directories in it). Do not

use absolute paths in your code.

Non-standard libraries (other than the Porter stemmer provided) may not be used.

Users should be able to select the appropriate mode by using command line arguments, e.g.:

- python search_small_corpus.py -m interactive

Or for automatic mode:

- python search_small_corpus.py -m automatic

Part 2: Evaluation

For this part, your program should evaluate your results.txt file (that was created during

automatic mode above) to evaluate the effectiveness of the BM25 approach.

The user should be able to run the program using the following command:

- python evaluate_small_corpus.py

Based on the results.txt file, your program should calculate and print the following

evaluation metrics (based on the relevance judgments contained in the ``qrels.txt’’ file

in the ``files’’ directory):

- Precision

- Recall

- P@10

- R-precision

- MAP

- bpref

What you should submit

Submission of this assignment is through Brightspace. You should submit a single .zip archive

containing:

- Part 1: Python programs to run the queries.

o search_small_corpus.py

o search_large_corpus.py (if you have attempted the large corpus)

- Part 2: Python files to perform the evaluation.

o evaluate_small_corpus.py

o evaluate_large_corpus.py (if you have attempted the large corpus)

- A README.txt or README.md file that describes what your program can do (in

particular it should mention whether the program will work on both corpora or only

the small one).

Sample Run (Interactive)

$ python search_small_corpus.py -m interactive

Loading BM25 index from file, please wait.

Enter query: library information conference

Results for query [library information conference]

1 928 0.991997

2 1109 0.984280

3 1184 0.979530

4 309 0.969075

5 533 0.918940

6 710 0.912594

7 388 0.894091

8 1311 0.847748

9 960 0.845044

10 717 0.833753

11 77 0.829261

12 1129 0.821643

13 783 0.817639

14 1312 0.804034

15 423 0.795264

Enter query: QUIT

Note: In all of these examples, the results, and similarity scores were generated at random for

illustration purposes, so they are not correct scores.

Sample Run (Evaluation)

$ python evaluate_small_corpus.py

Evaluation results:

Precision: 0.138

Recall: 0.412

R-precision: 0.345

P@10: 0.621

MAP: 0.253

bpref: 0.345

Grading Rubric

Below are the main criteria that will be applied for the major grades (A, B, C, etc.). Other aspects

will be taken into account to decide minor grades (i.e. the difference between B+, B, B-, etc.).

- Readability and organisation of code (including use of appropriate functions, variable

names, helpful comments, etc.).

- Quality of solution (including code efficiency, presence of minor bugs, avoiding absolute

paths, etc.).

Questions should be sent to david.lillis@ucd.ie or posted in the Brightspace forum.

Passing Grades

``D'' Grade

Good implementation of the primary aspects of Information Retrieval, using the small corpus. This

includes extracting the documents from the document collection, preprocessing (stemming and

stopword removal), indexing and retrieval. The solution may contain some implementation errors.

It is clear that the student has correctly understood the Information Retrieval process.

``C'' Grade

Good implementation of the primary aspects of Information Retrieval, using the small corpus. The

program can also save and load the index to/from an external file, as appropriate.

``B'' Grade

Correct implementation of all sections of the assignment using the small corpus (some minor

implementation errors will be tolerated). It is clear that the student has understood both the

information retrieval process and the evaluation process. Note: This means that evaluation is only

taken into account if your search program can successfully retrieve documents for evaluation.

``A'' Grade

Excellent implementation of all sections of the assignment, including choice of appropriate efficient

data structures and efficient programming. The efficiency of the programs will be measured using

the large corpus. In particular, a response to a query must be returned in a reasonable amount of

time, although efficiency is important in indexing also. Note: This means that working efficiently on

the large corpus is only taken into account if your code can successfully work with the small corpus.

Failing Grades

``ABS'' Grade

No submission received.

``NM'' Grade

No relevant work attempted.

``G'' Grade

Wholly unacceptable, little or no evidence of meaningful work attempted.

``F'' Grade

Some evidence of work attempted, but little (if any) functionality operates in the correct manner.

``E'' Grade

Clear evidence that work has been attempted on implementing retrieval using BM25, but there

are serious errors in implementation, or in understanding of the process.

Other notes

1. This is an individual assignment. All code submitted must be your own work. Submitting the work

of somebody else or generated by AI tools such as ChatGPT is plagiarism, which is a serious

academic offence. Be familiar with the UCD Plagiarism Policy and the UCD School of Computer

Science Plagiarism Policy.

2. If you have questions about what is or is not plagiarism, ask!

Document Version History

v1.0: 2023-05-08, Initial Version.

v1.1: 2023-05-15, Updated requirements for the output format of automatic mode.


相关文章

版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp