Data Warehousing and Integration
Assessment
This assignment aims for you to apply your knowledge of XML, Json, data warehousing and integration.
You complete the assignment in a group of 2 students whom you choose yourself. If you decide to do it individually, you must complete all parts. Please complete the people.txt file by adding the usernames, names, and the responsible parts of group members before you start the work. You can backup your work by sending the report periodically to an email run by a third-party company like outlook or gmail.
If you do the assignment in a group, only one of you submits.
The assignment is reasonably open. As a result, justification in decision points is important. Make sure that your writing is logical and cohesive so that your answers are not misunderstood.
Assignment Tasks
Task 1: Retrieve data. (1 mark)
Retrieve all publications of a major type and a minor type as defined below from dblp.xml:
Major types:
A: article
B: inproceedings
Minor types:
1: mastersthesis
2: phdthesis
3: www
4: proceedings
5: book
6: incollection
From the major type choose A if the last digit of your student ID (like 110123123) is an even number (including 0) and B if the last digit is an odd number.
From the minor type, choose n = (d % 6)+1 where d is the last digit of your student id and n is the number against the type you should choose. For example, if the last digit is 8, n=(8%6)+1=2+1=3, the student should use book as his/her minor type.
Your code to do this should be in extract.xql.
The retrieved data should be in mydblp.xml.
Task 2: Design a star schema for publication analyses (8 marks)
a) Publication data can be analyzed for different purposes and in different ways to quantitively measure the amount of research outputs of organizations and/or individuals, journal popularity and impact, hot research directions etc.
Design a star schema for this purpose. The analysis subject and its metric are given. The subject is publication performance, and the metric for the subject is the number of publications. You design dimension tables for the subject based on your mydblp.xml.
Describe the fact table and dimension tables in the following format.
facttable(SK1,Sk2,…,Skn [, kpi])
constraints like keys and foreign keys
dim1(SK1, … )
constraints
…
dimn (SKn,…)
constraints
In the above schema, ‘SK’ means surrogate key, ‘…’ means more, kpi means a metric of the number of publications. [,kpi] means kip is optional. When kpi is not appearing, you must use an aggregate function to derive a metric, and this definition must be listed between the fact table schema and its constraints. Please replace the place holders for table names and attributes with your own.
b) For each dimension table, describe how the dimension is helpful in an analysis.
Here are some notes about dimension tables: If a dimension table does not support any analysis, it is not needed. Time must be represented in one of the dimensions. A dimension should model an entity type and an entity type should be modelled in one dimension table only. For example, you should put the author information and journal information in one dimension.
c) Create your fact table and dimension tables in an oracle database with all keys and foreign keys defined. Please note that your table and attribute names mut be consistent in the design and in the implementation. Add table creation statements to the report.
Display the schema of each table by ‘select * from tablename;’, take a screenshot of only the schema line as below and add it to the report after the table name as shown in the following.