CS512 Assignment
Coding Assignment
Before Dive in
All answers must be in pdf format.
This is an individual assignment. You can discuss this assignment in Piazza, including the performance your implementation can achieve on the provided dataset, but please do not
work together or share codes.
Reference libraries or programs can be found online.
You can use C/C++ or Java or Python2/3 as your programming language. A more detailed project organization guidance can be found at the end of the assignment.
Late policy:
10% off for one day (Oct. 27th, 11:59PM)(Oct. 30th, 11:59PM)
20% off for two days (Oct. 28th, 11:59PM) (Oct. 31st, 11:59PM)
40% off for three days (Oct. 29th, 11:59PM) (Nov. 1st, 11:59PM)
A section titled Frequently Asked Questions can be found at the end of the assignment, and we will keep updating it regarding questions in Piazza and office hours.
Please first read through the entire assignment description before you start.
Problem Description
As we learned from the class, traditional topic modeling could suffer from non-informative topics and overlapped semantics between different topics. As a response, discriminative topic mining incorporates user guidance as category name and retrieve representative and discriminative phrases during embedding learning. In the meantime, these category-name guided text
embeddings can be further utilized to train a weakly-supervised classifier in high quality.
Specifically, we need to finish four steps as follows.
Step 1: Download training dataset on news and movies(see links in Step 4), use AutoPhrase to extract high quality phrases.
Step 2: Write or adopt CatE on segmented corpus to find representative terms for each category.
Step 3: Perform weakly-supervised text classification with only class label names or keywords. Test the classifier on two datasets.
Step 4: Investigate the results and propose your way of using prompting of pre-trained language models to improve it. Implement your method and compare it with the one in Step 3.
Step 5: Submit your implementation, results, and a short report to Canvas.
Problem Data
You can find the problem data in this link. Here are the detailed files when you downloaded the data.
Name Num of
documents
Category names Training Text Testing Text #Validation Labels
News 120000 news_category.txt news_train.txt news_test.txt first 100 of
news_train_labels.txt
Movies 25000 movies_category.txt movies_train.txt movies_test.txt first 100 of movies_train_labels.txt
Step 1: Adopt AutoPhrase to extract high quality phrases
In this step, you will need to utilize AutoPhrase to extract high quality phrases in train.txt of both datasets provided. The extracted phrase list look like (the example here is different from
homework test data):
Score Phrase
0.9857636285 lung nodule
0.9850002116 presidential election
0.9834895762 wind turbines
0.9834120003 ifip wg
....
Step 2: Compute category name guided embedding on segmented corpus
Use your segmentation model to parse the same corpus, recommended parameters for segmentation is HIGHLIGHT_MULTI=0.7 HIGHLIGHT_SINGLE=1.0. An example segmented
corpus can be:
is presented of the use of
spatial data structuresin
spatial databases. The focus is on
hierarchical, including a number of variants of
quadtrees, which
sort
the data with respect to the space occupied by it. Such
techniques are known as
methods.
Hierarchical data structuresare based on the principle of
recursive
.
Then you will write your own CatE or refer the existing ones to compute the phrase embedding as well as category-guided phrase mining. You will need to submit category and their top-10
representative terms in {category_name}_terms.txt.
For example, in technology_terms.txt, you want have the first line as category name embedding, the following 10 lines would be category representative term embeddings
technology 0.720378 -0.312077 0.811608 ... 1.096724
terms_of_usage_privacy_policy_code_ 1.439691 0.508672 -0.958150 ... -1.277346
...
10/27/23, 6:33 AM Fall 2023 CS512 Assignment - CS 512 - Illinois Wiki
2/3
Tip: You can concatenate train.txt and test.txt into a larger corpus for phrase mining and category-named guided embeddings.
Step 3: Document Classification with CatE embeddings
In this step, you will need to build a weakly-supervised classifier (e.g. WeSTClass, LOTClass) on top of the term embeddings or topic keywords you obtained from the previous steps. The
only supervision is category names provided in the datasets.
For example, in news, we have following category names in news_category.txt
politics
sports
business
technology
To help validate your results, we also provide labels of first 100 documents in both dataset in news_train_labels.txt and movies_train_labels.txt. Feel free to discuss the validation
performance you get in this step on Piazza.