Data processing for Big Data
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
FIT5202 - Data processing for Big Data
Assignment 2B: Using real-time streaming data to predict potential
customers
Worth: 10% of the final marks
Background
MonPG provides its loan services to its customers and is interested in selling more of its
Top-up loan services to its existing customers. They hired us as the Analytics Engineer to
develop a model to identify the potential customers that may have any Top Up services in
the future. In addition, they want us to help them integrate the machine learning models into
the streaming platform using Apache Kafka and Apache Spark Streaming to handle real-time
data from the company to recommend our services.
In part A assignment, we only process the static data and form the machine learning model.
In this part B, we would need to create proof-of-concept streaming applications to
demonstrate the integration of the machine learning models, Kafka and Spark streaming,
and create a visualization to provide some decision support.
File Structure
The files required for this assignment are available in moodle under Assessment 2B section.
The description of the files is summarized in the table below:
What you need to achieve
The MonPG requires a proof-of-concept application to ingest the new data and predict the
new top-up customers. To achieve this, you need to simulate the streaming data production
using Kafka, and then build a streaming application that ingests the data and integrates the
machine learning model (provided to you) to monitor the number of possible real-time top-up
services.
A compulsory interview would also be arranged in Week 12 after the submission to
discuss your proof-of-concept application.
Architecture
The overall architecture of the assignment setup is represented by the following figure.
Fig 1: Overall architecture for assignment 2 (part B components updated)In Part B of assignment 2, you have three main tasks - producing streaming data, processing
the streaming data, and visualizing the data.
1. In task 1 for producing the streaming for both data files, you can use the CSV module
or Pandas library, or other libraries to read and publish the data to the Kafka stream.
2. In task 2 for streaming data application, you need to use Spark Structured Streaming
together with PySpark ML / DataFrame to process the data streams.
3. For task 3, you can use the CSV module or Pandas library, or other libraries to read
the data from the Kafka stream and visualize it.
Please follow the steps to document the processes and write the codes in Jupyter Notebook.
Getting Started
● Download the data and models from moodle.
● Create an Assignment-2B-Task1_producer.ipynb file for data production
● Create an Assignment-2B-Task2_spark_streaming.ipynb file for consuming and
processing data using Spark Structured Streaming
● Create an Assignment-2B-Task3_consumer.ipynb file for consuming the count
data using Kafka
Version:0.9 StartHTML:0000000105 EndHTML:0000012306 StartFragment:0000000141 EndFragment:0000012266
1. Producing the data (10%)
In this task, we will implement one Apache Kafka producer to simulate the real-time data
transfer from one repository to another.
Important:
- Do not use Spark in this task
- In this part, all columns should be string type
Your program should send a random number (10~30, including 10 and 30) of client data
every 5 seconds to the Kafka stream in 2 different topics based on their origin files.
- For example, if the first random batch of customers' IDs is 1,2, and 3, you should also
send bureau data of them to the bureau topic. For every batch of data, you need to
add a new column 'ts', the current timestamp. The data in the same batch shouldhave the same timestamp.