Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP9517: Computer Vision
Group Project Specification
Maximum Marks Achievable: 40
The group project is worth 40% of the total course mark.
Introduction
The goal of the group project is to work together with peers in a team of 5 students to solve a
computer vision problem and present the solution in both oral and written form.
Group members can meet with their assigned tutors once per week in Weeks 6-10 during the
usual consultation session hours to discuss progress and get feedback.
The group project is to be completed by each group separately. Do not copy ideas or any
materials from other groups. If you use publicly available methods or software for some of the
tasks, these must be properly attributed/referenced.
Note that high marks are given only to groups that developed methods not used before for
the project task. We do not expect you to develop everything from scratch, but the more you
use existing code (which will be checked), the lower the mark. We do expect you to show
creativity and build on ideas taught in the course or from computer vision literature.
Description
In order for autonomous vehicles to navigate safely and accurately in natural environments,
it is important that they are able to recognise the different types of scenarios and objects they
may encounter along the way. For example, a vehicle may need to proceed more cautiously
when travelling through sand or mud, or when there are many trees around, than when
driving over gravel or asphalt in a clear area, while water must be avoided at all times.
Compared to urban environments, perception in natural environments is more challenging, as
these generally contain highly irregular and unstructured elements.
Project work is in Weeks 6-10 with deliverables due in Week 10.
Deadline for submission is Friday 2 August 2024 18:00:00 AET.
Instructions for online submission will be posted closer to the deadline.
Refer to the separate marking criteria for detailed information on marking.
The first step toward comprehensive scene understanding is to perform fine-grained semantic
segmentation of the images captured by the vehicle’s cameras. That is, to assign a label to
each and every pixel in the images, indicating to which class that pixel belongs.
Task
The goal of this group project is to develop and compare different computer vision methods
for semantic segmentation of images from natural environments.
Dataset
The dataset to be used in the group project is called WildScenes (see links and references at
the end of this document). This is a recently released multimodal dataset consisting of five
sequences of 2D images recorded with a normal video camera during traversals through two
forests: Venman National Park and Karawatha Forest Park, Brisbane, Australia. The dataset
also contains 3D point cloud representations of the same scenes recorded using a lidar
scanner, but in this group project we will ignore that part of the dataset and use only the 2D
images. In total, the dataset has 9,306 images of size 2,016 x 1,512 pixels. Each and every one
of the images has been manually annotated.
Methods
Many traditional, machine learning, and deep learning-based computer vision methods could
be used for this task. You are challenged to use concepts taught in the course and other
techniques from literature to develop your own methods and test their performance. At least
two different methods must be developed and tested.
Although we do not expect you to develop everything from scratch, we do expect to see some
new combination of methods, or modifications of existing methods, or the use of more state-
of-the-art methods that have not been tried before for the given task.
As there are virtually infinitely many possibilities here, it is impossible to give detailed criteria,
but as a general guideline, the more you develop things yourself rather than copy straight
from elsewhere, the better. In any case, always do cite your sources.
Training
If your methods require training (that is, if you use supervised rather than unsupervised
segmentation approaches), you can use the same procedure for splitting the dataset into
training, validation, and test subsets, as the creators of the WildScenes dataset. In their paper
(see references below) they describe the procedure in detail and provide code for this in their
GitHub repository (see references below). The procedure ensures that the training, validation,
and test subsets have a uniform class distribution.
Even if your methods do not require training, they may have hyperparameters that you need
to fine-tune to get optimal performance. In that case, too, you must use the training set, not
the test set, because using (partly) the same data for both training/fine-tuning and testing
leads to biased results that are not representative of actual performance.
Testing
To assess the performance of each of your methods, compare the segmented images
quantitatively with the manually annotated (labelled) images by calculating the intersection
over union (IoU), also known as the Jaccard similarity coefficient (JSC), for each class and then
taking the mean over all classes in the whole test set. Notice that although the annotations
contain more classes, only 15 classes are to be used for evaluation (see further details in the
supplementary material of the paper referenced below).
Show these quantitative results in your video presentation and written report (see
deliverables below). Also show representative examples of successful segmentations as well
as examples where your methods failed (no method generally yields perfect results). Give
some explanation why you believe your methods failed in these cases.
Furthermore, discuss whether and why your methods performed better or worse than the
methods already evaluated by the creators of the WildScenes dataset (as reported in the
paper referenced below). And, finally, discuss some potential directions for future research to
further improve the segmentation performance for this dataset.
Practicalities
The WildScenes dataset is about 100 GB in total. However, only the 2D images and annotations
are needed for this project, which amounts to about 50 GB. Still, this may be challenging in
terms of memory usage and computation time if you are planning to use your own laptop or
desktop computer for training and testing. To keep computations manageable, you are free
to use only a subset of the data, for example 50%, 40%, 30% (again, use a correct splitting
procedure to ensure uniform class distributions). Of course, you can expect the performance
of your methods to go down accordingly, but as long as you clearly report your approach, this
will not negatively impact your project mark.
Deliverables
The deliverables of the group project are 1) a video presentation, 2) a written report, and 3)
the code. The deliverables are to be submitted by only one member of the group, on behalf
of the whole group (we do not accept submissions from multiple group members). More
detailed information on the deliverables:
Video
Each group must prepare a video presentation of at most 10 minutes showing their work. The
presentation must start with an introduction of the problem and then explain the used
methods, show the obtained results, and discuss these results as well as ideas for future
improvements. For this part of the presentation, use PowerPoint slides to support the
narrative. Following this part, the presentation must include a demonstration of the
methods/software in action. Of course, some methods may take a long time to compute, so
you may record a live demo and then edit it to stay within time.
The entire presentation must be in the form of a video (720p or 1080p MP4 format) of at most
10 minutes (anything beyond that will be ignored). All group members must present (points
may be deducted if this is not the case), but it is up to you to decide who presents which part
(introduction, methods, results, discussion, demonstration). In order for us to verify that all
group members are indeed presenting, each student presenting their part must be visible in a
corner of the presentation (live recording, not a static head shot), and when they start
presenting, they must mention their name.