IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
Secure Data Storage and Searching for
Industrial IoT by Integrating Fog Computing and
Cloud Computing
Abstract—With the fast development of industrial Internet
of things (IIoT), a large amount of data is being generated
continuously by different sources. Storing all the raw data
in the IIoT devices locally is unwise considering that the end
devices’ energy and storage spaces are strictly limited. In
addition, the devices are unreliable and vulnerable to many
threats because the networks may be deployed in remote
and unattended areas. In this paper, we discuss the emerg-
ing challenges in the aspects of data processing, secure
data storage, efficient data retrieval and dynamic data col-
lection in IIoT. Then, we design a flexible and economical
framework to solve the problems above by integrating the
fog computing and cloud computing. Based on the time la-
tency requirements, the collected data are processed and
stored by the edge server or the cloud server. Specifically,
all the raw data are first preprocessed by the edge server
and then the time-sensitive data (e.g., control information)
are used and stored locally. The non-time-sensitive data
(e.g., monitored data) are transmitted to the cloud server
to support data retrieval and mining in the future. A series
of experiments and simulation are conducted to evaluate
the performance of our scheme. The results illustrate that
the proposed framework can greatly improve the efficiency
and security of data storage and retrieval in IIoT.
Index Terms—Cloud computing, fog computing, indus-
trial Internet of things (IIoT), secure data storage and
retrieval.
date
Fujian University of Technology, Fuzhou 350118, China, with the School
of Mathematics and Computer Science, Wuhan Polytechnic University,
Wuhan 430023, China, with the Department of Electrical Engineering,
National Dong Hwa University, Hualien 974, Taiwan, and also with the
Department of Computer Science and Information Engineering, National
Ilan University, I-Lan 26041, Taiwan (e-mail:
[email protected]).
B. K. Bhargava is with the Department of Computer Science, Purdue
University, West Lafayette, IN 47906 USA (e-mail:
[email protected]).
Color versions of one or more of the figures in this paper are available
I. INTRODUCTION
A S WE step into the Internet of things (IoT) era, terabytesof data with different sources and structures are being pro-
duced worldwide per day. In recent years, IoT has been widely
used in the industrial field [1]–[5] and hence industrial IoT
(IIoT) appears. The generated data of IIoT are of great value
and they can be used to run the networks or extract knowledge
and rules. How to process, store, and manage the data securely
and efficiently is a great challenge. Fortunately, fog computing
and cloud computing provide us an opportunity to solve these
problems properly. Fog is close to the networks, i.e., the sources
of the data, and it can access the data in a time-efficient manner.
Consequently, the time-limited data should be processed and
stored locally to run the network normally [6], [7]. However,
storing all the data in the edge servers is unwise considering the
low stability and reliability. Moreover, retrieving and mining the
data stored by numerous edge servers in a distributed manner
is impractical. Cloud computing is treated as a promising IT
infrastructure, which can gather and organize huge IT resources
to support on-demand access service in a flexible and economi-
cal manner [8]. Pushed by the data storage requirement of IIoT
and attracted by these excellent features of cloud computing,
an intuitive approach is outsourcing the nontime-sensitive data
to the cloud [9]–[14] while guaranteeing both the security and
searchability of the data. Note that, though quite a large portion
of the data is stored in the cloud, the whole system needs to
employ the edge server as a fundamental tool. In fact, cloud
computing and edge computing are interdependent with each
other and they together form a service continuum between the
cloud and the end devices of IIoT [15]–[17].
In this paper, we design a data processing framework for IIoT
by integrating the functions of data preprocessing, storage and
retrieval based on both the fog computing and cloud computing.
The overall data processing system of IIoT consists of five main
entities as shown in Fig. 1: IIoT, Edge server, Proxy server,
Cloud server and Data users. The black arrows in the left half
figure represent the process of data collection, processing, and
outsourcing. The red arrows mainly in the right half figure rep-
resent the process of secure data query. The IIoT continuously
collects data from physical environments and then sends the data
to the edge server. The time-sensitive data are first extracted and
processed by the edge server and then the data will be dropped
Authorized licensed use limited to: Norges Teknisk-Naturvitenskapelige Universitet. Downloaded on April UTC from IEEE Xplore. Restrictions apply.
4520 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 14, NO. 10, OCTOBER 2018
Fig. 1. System of data process, storage, and retrieval in IIoT.
if they will not be used in the future. However, some archived
data need to be preprocessed and uploaded to the cloud server
for storage and retrieval. The proxy server is responsible for im-
proving the quality of the data generated by a set networks and
making the data suitable for being stored in the cloud server.
Moreover, the data need to be encrypted by the proxy server
while maintaining both the security and searchability which
will be discussed in Section IV-C. When an authorized data
user wants to obtain some specific historical data, he just needs
to build a trapdoor with the help of the proxy server and then
send the trapdoor to the cloud server. Based on the trapdoor, the
cloud server searches the encrypted index structures by a search
engine to get the result and sends the encrypted data to the data
user. At last, the data user decrypts the search result to get the
plaintext data.
To store the data in a secure, searchable, and dynamic manner,
there are many challenges and considerations in the process of
designing our framework and they are presented as follows.
1) In general, the end devices of IIoT are redundantly de-
ployed and they may collect redundant, heterogeneous,
dynamic, one-sided and inaccurate data [18]. As a con-
sequence, the raw data are not suitable for being stored
and retrieved.
2) The ultimate goal of storing the data in the cloud is reusing
them in the future and hence searching a specific set of
data efficiently and accurately is an essential requirement
for the data users.
3) For security, we need to protect the data’s confidential-
ity without obvious decreasing of usability. Specifically,
privacy-preserving schemes should be designed and em-
ployed in the data storage system.
4) Considering that the data are dynamically collected and
some new data may be generated in the future, the data
in the cloud need to be organized dynamically and mean-
while the index structure also needs to support dynamic
update.
In this paper, we discuss the emerging challenges and basic so-
lutions to move the IIoT forward. Specifically, we design a data
processing framework for IIoT and attempt to solve the above
problems properly. The rest of this paper is organized as follows:
We first summarize the related work of our scheme in Section II
and present the network and system models in Section III.
We discuss the considerations when refining the raw data and
design an object-oriented data index structure based on retrieval
features (RF) and a privacy-preserving data retrieval scheme
based on secure kNN algorithm [19] in Section IV. Moreover, a
dynamic data collection method is also provided. The security
of our framework is analyzed in Section V. Performance evalu-
ation by both experiments and simulation is given in Section VI.
Finally, we conclude this paper in Section VII.
II. RELATED WORK
In this section, we mainly introduce the existing schemes of
data storage for IoT. Recently, with the emerging of IoT and
cloud computing, many IoT data storage schemes have been de-
signed based on cloud computing in the literatures. Jiang et al.
proposed a data storage framework [12] enabling efficient
storage of both structured and unstructured data. The frame-
work combines and extends multiple existing databases such
as Hadoop, NoSQL database and relational database to store
and manage diverse types of IoT data. The data users can ac-
cess the stored data through the interfaces provided by the cloud
server. However, a disadvantage of this framework is the long la-
tency which is an inherent property of cloud-based data storage
schemes. A framework including frontend layer, middle layer,
and backend layer [9] is proposed to seamlessly integrate IoT
data storage schemes to existing enterprise information systems.
This method can be easily accepted by the data owners consid-
ering that existing information systems are mature. To store a
huge amount of heterogeneous data, a hybrid approach [13] is
proposed to optimize data storage and retrieval which couples
the document and object-oriented strategies. Moreover, some
implementation details are also discussed. Kim et al. [14] de-
signed a polynomial-time algorithm for efficiently downloading
the packages from the cloud to the IoT devices. This approach
can compute the amount of power allocation based on buffer
backlog and the state of communication links to improve the
overall performance.
Except for cloud computing, the fog computing technology
is also employed to store and share the data in IoT. To support
the latency sensitive data processing and storage, an efficient
data sharing scheme [6] that allows smart devices to share
the data with others at the edge of the IoT is proposed. In
addition, the data users can search and retrieve interested
data by keywords and their secret keys. Simulation result
demonstrates that the proposed scheme has the potential to be
effectively used in the IoT. However, the size of the network is
strictly limited in the scheme and in addition it is impractical to
store a large amount of data for further processing and mining
Authorized licensed use limited to: Norges Teknisk-Naturvitenskapelige Universitet. Downloaded on April 12,2021 at 08:09:52 UTC from IEEE Xplore. Restrictions apply.
FU et al.: SECURE DATA STORAGE AND SEARCHING FOR INDUSTRIAL IOT BY INTEGRATING FOG COMPUTING AND CLOUD COMPUTING 4521
considering the efficiency and security problems. Similar to
our framework proposed in this paper, some other schemes also
attempt to combine the fog computing and cloud computing to
improve the quality of service in terms of latency, security, and
flexibility. Sharma et al. [20] discussed the advantages of cloud
computing and edge computing, respectively. In summary, the
cloud computing can construct a shared pool of computing
and storage resources and the edge computing can process the
data in real time. By combining these two techniques, the pro-
posed framework can obtain the network-wide knowledge by
exploiting the historical information stored in the cloud center
and the knowledge can be used to guide the edge computing to
satisfy various performance requirements of heterogeneous IoT
networks. An attribute-based encryption scheme is proposed in
[21] to make full use of edge servers. The collected data are first
encrypted by the edge server before being outsourced to the
cloud server. Experimental results illustrate that the edge servers
bear a large portion of the workload. However, this scheme does
not support efficient data search and hence the functionalities
are limited. Choi et al. [22] took the lessons of designing an
operating system from the long history of operation systems
and designed a distributed operation system specifically for the
IoT, i.e., FogOS, which can manage both the cloud resources
and fog resources. In addition, FogOS is also a platform
of incentivizing and connecting individually owned IoT
devices.
III. NETWORK AND THREAT MODELS
We assume that a large number of IIoT terminal nodes are
randomly deployed in an interested area to monitor the surround-
ing physical environment such as the work status of machines
or the gas density in a factory. Each node consist the power
module, perceptive module, data processing module, communi-
cation module, etc. We further assume that each pair of sensor
nodes can negotiate a common session key to securely commu-
nicate with each other. The nodes in the network can transmit
the data to the edge server in a relay manner by employing
proper routing algorithms. The edge server is assumed to be
stronger than the common IIoT nodes in both power and com-
puting capability. We further assume that the edge server can
preprocess the raw data of IIoT efficiently and execute compli-
cated instructions to run the network properly. The edge server
is connected to the proxy server and cloud server by wire or
wireless links. The edge server and proxy server are assumed to
be honest to the IIoT. This is reasonable considering that they
are closer to the data sources in IIoT and they can be controlled
by the network operators in general. For example, we may de-
ploy an IIoT to monitor the status of industrial machines where
an edge server is employed to run the network. Apparently, the
edge server is deployed locally and it is totally controlled by
the industrial factory. However, the cloud server is public and
it is assumed to be “honest-but-curious” which is similar to the
models in [23] and [24]. Specifically, the cloud server can hon-
estly execute the instructions and however it is curious to infer
and analyze all the received data from the proxy server and data
users.
IV. FRAMEWORK OF SECURE DATA STORAGE AND
SEARCHING FOR IIOT
A. Data Integration and Fusion
Data integration and fusion is the most important basement
of the total framework and it is briefly discussed as follows. In
IIoT, the data are collected from multiple sources such as radio
frequency identification (RFID), GPS devices and smart meters,
and the data carriers can be messages, pictures, videos, numer-
ical data, etc. Even for one type of the data carriers, such as the
numerical data, the specific data models are various [25], includ-
ing probability model, fuzzy set model, possibility model, rough
set model, D-S evidence theory model, etc. Though integrating
massive structured, semistructured, and unstructured data into a
unified framework is a huge challenge, it is meaningful to merge
the data and create a comprehensive and meaningful view for
future utility [26]. Specifically, the data need to be first trans-
formed to a unified resource description framework and then
fused to eliminate the redundant data.
A novel concept called Information Object is proposed in
[27] to model the data coming from several sources and transfer
them to a unified structure for storage and mining. Further, an
event information management platform is designed to collect
and analyze heterogeneous data streams. In [28], a resource
description framework called heterogeneous event processing
(HEP) is proposed in which the representations of relational
and XML event streams are integrated. To decrease the storage
space in the cloud server and communication burdens of the
network, the data need to be fused at different levels according
to the requirements of the data users. An elaborate survey of
data fusion techniques at numerical level is presented in [25].
Moreover, Thing Broker [29] is designed to integrate totally
different IoT objects by employing abstracts to represent the
objects, while maintaining simple and flexible interfaces for
various applications.
The low quality of the data is another challenge, and some
false and missing values often appear because the end devices
are often unstable and unreliable. To guarantee the completeness
of the dataset, the imperfect data should be eliminated by the
outlier detection technique. In addition, the missing data values
should also be modified and predicted by the edge server. A
missing data prediction scheme for IoT is achieved in [30] by
implementing the least mean square dual prediction algorithm
and the optimal step size is obtained by minimizing the mean-
square derivation.
B. Object-Oriented Data Organization and Retrieval
After getting the high-quality data, an effective index struc-
ture needs to be built in order to improve the search efficiency. In
this paper, the data are organized around the monitored objects
of the IIoT. For example, a set of smart devices may be deployed
in an industrial machine to monitor its work status and thus all
the generated data about the machine share the same identifier
which is related to the monitored machine. This is reasonable
considering that the data fragmentations are meaningless un-
less they are collected together to describe the machine. To
Authorized licensed use limited to: Norges Teknisk-Naturvitenskapelige Universitet. Downloaded on April 12,2021 at 08:09:52 UTC from IEEE Xplore. Restrictions apply.
4522 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 14, NO. 10, OCTOBER 2018
identify the monitored objects, an easy method is assigning se-
rial numbers to them and we can also add some basic description
information into the identifiers, such as the categories. How to
encrypt the object identifiers and search the interested data based
on the encrypted identifiers will be discussed in Section IV-C,
where an Identifier-Adelson-Velsky-Landis tree is built to im-
prove the search efficiency.
Except for searching the data by identifiers, the data users may
also want to search the data of the monitored objects by some
features. For example, the data users may query the monitored
data about the buildings around Purdue University which have
been used for about ten years. We assume that the top-k relevant
buildings’ data are needed. To support the feature-based data
query, we can describe each monitored object Oi by an n-
dimensional attribute vector AVi which is another identifier of
the stored data. An n-dimensional feature dictionary A with n
important features is employed to regulate the feature vectors.
For object Oi ,AVi [j] is the value on featureA[j] and the default
value is set as 0. Similarly, a query request is mapped to a query
vector Q. An open issue is how to transform the feature values
to numerical values and define the relevance scores between the
feature vectors and query vectors. In this paper, we assume that
the value of each feature ranges from 0 to 1 and the relevance
score between a query vector and a feature vector is defined as
the inner product of the two vectors.
To improve the search efficiency of feature-based data query,
we design an RF tree to organize the objects’ feature vectors
as hierarchical clusters. Given K n-dimensional feature vectors
AV1,AV2, . . . ,AVK in a cluster, the RF vector of the cluster
is defined as a quadruple (K,LS, SS, Vmax), where K is the
number of feature vectors in the cluster, LS is the linear sum of
theK vectors,SS is the square sum of theK vectors, andVmax is
defined as Vmax[j] = max(AV1[j],AV2[j], . . . , AVK [j]) and
AVi [j] is the jth dimension of vectorAVi . Based on a RF vector,
the center and radius of the cluster can be easily calculated as
discussed in [31]. Specifically, the center of the cluster can be
calculated as follows:
c = LS/K. (1)
The radius of the cluster is defined as follows:
R =
√
√
√
√
K∑
j=1
(AVj − c)2/K (2)
and it can be calculated based on the RF vector as follows:
R =
√
(SS − LS2/K) /K. (3)
As a consequence, the RF vector is an important summariza-
tion about the cluster.
An RF tree is presented in Fig. 2. It can be observed that the
tree is height balanced and we can easily infer this based on
the construction process of the tree which will be presented in
the following. The structure of the tree is mainly controlled by
three parameters: branching factors B1, B2 and a threshold T .
We call a node as a nonleaf node if it represents a macrocluster
and the node represents a microcluster is defined as a leaf node.
Each nonleaf node NLi contains at most B1 child nodes and it
Fig. 2. Structure of an RF tree.
is denoted as [RF,RF1, child1, . . . , RFB1 , childB1 ], where RF
is the RF vector of the whole cluster, RFi is the RF vector of the
ith subcluster and childi is the pointer to ith subcluster. A leaf
node Li contains at most B2 feature vectors and it is defined
as [RF, child1, . . . , childB2 ], where RF is the RF vector of
the cluster, childi is the pointer to the ith feature vector in the
cluster. Further, all the feature vectors in a leaf node must satisfy
a threshold requirement, i.e., the radius of a microcluster has to
be less than T .
The RF tree is constructed in an incremental manner and the
process of inserting a feature vector AVi into the RF tree is
presented as follows.
1) Identifying the leaf node: starting from the root node,AVi
recursively descends the RF tree by choosing the closest
child node according to the Euclidean distances between
AVi and the centers of the clusters.
2) Modifying the leaf node: when AVi reaches a leaf node
Li , we test whether node Li can “absorb” AVi without
violating the constraints ofB2 andT . If so,AVi is inserted
into Li and the RF vector of Li is updated to reflect this.
If not, we must split Li to two new leaf nodes. Node
splitting is done by choosing the farthest pair of feature
vectors as seeds, and redistributing the remaining vectors
based on the closest criteria. Apparently, the RF vectors
of the two new leaf nodes need to be recalculated in this
case.
3) Modifying the path from the root node to the leaf node:
after inserting AVi into a leaf node, we need to update
the RF vectors of all the nodes on the path from the root
node to the leaf node Li . In the absence of a split, this
simply involves updating RF vectors in order from the
leaf node to the root based on [31, Th. 4.1]. A leaf split
requires us to insert a new leaf node to the parent node. If
the parent node has space for the new leaf, we just need
to insert the new leaf node and then update the RF vector
of the parent node. In general, however, we may have to
split the parent node as well, and so up to the root. If the
root is split, the tree height increases by 1.
In the RF tree, all the feature vectors of the objects are orga-
nized based on their relative similarities. It is of high probability
that two similar vectors are assigned to the same cluster and this
property can greatly improve the search efficiency. For a query
vector Q provided by a data user, a parallel data search process
can be easily executed in the cloud server to get the top-k rele-
vant objects. Assume that there are l processors {p1, p2, . . . , pl}
Authorized licensed use limited to: Norges Teknisk-Naturvitenskapelige Universitet. Downloaded on April 12,2021 at 08:09:52 UTC from IEEE Xplore. Restrictions apply.
FU et al.: SECURE DATA STORAGE AND SEARCHING FOR INDUSTRIAL IOT BY INTEGRATING FOG COMPUTING AND CLOUD COMPUTING 4523