Investigation And Management Of Correlated Failures In Cloud Computing

Literature Review

Today the world of computation is moving towards pay asper use model due to the numerous benefits provided by this model. Hence cloud computing services are predicted as the best option for future computational world Cloud computing is compared with other previous synonymous existing services and underlying technologies viz. utility computing, services provided via the internet using the web browsers, Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS),Grid Computing, and data centres . The reasons for the advances in cloud computing mainly include the recent advances in Internet backbone, high performance, and scalable infrastructure in the web technologies and data centres where

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

The consumer only pays for services consumed based on the metering of those services actually used (McAfee, 2011). Accordingly, this emergent technology is anticipated to become the fifth utility service available to consumers and is anticipated to become the dominant computing delivery platform of the future (Daylami, 2015). Successively, the cloud computing service model is rapidly transforming the way organizations conduct business around the world, and it is quickly becoming the most utilized technology in the computing era, having wide organizational implications (Knoblauch, 2013) by using virtualization technology. By the end of 2016, cloud computing has achieved more than $155 billion market, from the growth of public cloud services. Although this virtualization technology has gained sufficient reputation and trust with its dynamic, easy, and reliable and fastness features, there are numerous challenges it has been facing, primarily, because of the complex architecture and large scale operations. The primary key challenges face by cloud computing are space correlation and temporal correlation, while considering the cloud data centers’ scale and complexity. Cloud computing system reliability is defined, in the context of resource, context of security or service failures. Failures are inevitable, because of the cloud architecture complexity. When a system having 100,000 processors are considered, failures are occurred, for every couple of minutes. Failures are possible and occurred usually, because of software failure, hardware failure, etc. When the failure is occurred in context of service, it costs both the customers and providers significantly. A survey conducted in the data centers, by P. Institute (2016), the statistics and reports show that the cost of average down-time, experienced by each data center rose from $500.00 in the year 2010, to $740.357, which is an increase of 38%. Loss of approximately $108,000 is experienced in business sector, for every hour and the outages of the Information Technology result in the loss of more than $26.5 billion revenue, each year. Cloud resources provisioning according to the applications’ demand, accurately, plays a crucial role, towards making the cloud computing system as energy efficient and reliable, by minimizing the errors. The resource requirement prediction is hard to do accurately, in cloud computing, during or before the task or application submission. The resources provisioned, usually either over utilized or under-utilized. The average resources utilization is usually, fluctuating, from 6% to 12%, in cloud based data centers. According to Javadi et al. (2016), Failure is defined as an event in which the system fails to operate according to its specifications. A system failure occurs, when a system deviates from fulfilling its normal system function for which it was aimed at. According to Google, the cost for each repair of failure includes $100 for technician’s time and 10% of the total cost of server ($200), which reaches to $300 per repair. Therefore the cost of repairing the hardware exceeds its buying cost after only 7 repairs. Sound knowledge of the type of failure and causes of failure will help computer scientists and computer engineers to design more scalable algorithms and to deploy infrastructure in more fault tolerable way. This will help to reduce the repair/replacement cost


And engineering expenditures makes the computing, specifically service computing such as cloud computing, more reliable. This project will focus on failures or correlated failures of cloud computing.

Cloud computing is a potential breakthrough, in remote storage system and also as a revolution in Information and Communication Technology, providing a computing environment with lots of flexibility and power, by using virtualization technology. By the end of 2016, cloud computing has achieved more than $155 billion market, from the growth of public cloud services. Although this virtualization technology has gained sufficient reputation and trust with its dynamic, easy, and reliable and fastness features, there are numerous challenges it has been facing, primarily, because of the complex architecture and large-scale operations. The two primary and significant key challenges are energy efficiency and reliability, while considering the cloud data centers’ scale and complexity. Cloud computing system reliability is defined, in the context of resource, context of security or service failures. Failures are inevitable, because of the cloud architecture complexity. When a system having 100,000 processors are considered, failures are occurred, for every couple of minutes. Failures are possible and occurred usually, because of software failure, hardware failure, etc. When the failure is occurred in context of service, it costs both the customers and providers significantly. A survey conducted in the data centers, by P. Institute (2016), the statistics and reports show that the cost of average down-time, experienced by each data center rose from $500.00 in the year 2010, to $740.357, which is an increase of 38%. Loss of approximately $108,000 is experienced in business sector, for every hour and the outages of the Information Technology result in the loss of more than $26.5 billion revenue, each year. Cloud resources provisioning according to the applications’ demand, accurately, plays a crucial role, towards making the cloud computing system as energy efficient and reliable, by minimizing the errors. The resource requirement prediction is hard to do accurately, in cloud computing, during or before the task or application submission. The resources provisioned, usually either over utilized or under-utilized. The average resources utilizaiton is usually, fluctuating, from 6% to 12%, in cloud based data centers.

The issues of reliability and energy efficiency are the two important key challenges for the present cloud computing system and the first one is majorly caused by the failures that result in correlated failures. Hence, the research problem is to focus majorly on the failures that result in correlated failures, increasing the number of issues and complexity of managing the cloud computing resources, while fulfilling continuously varying demand of resources from the customers.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Summary of Research on Correlated Failures

Hence failure investigation and failure management have become the key considerations to focus on while reliability is considered as a primary concern, to continue to repute the cloud computing system. The research problem is how the cloud computing system can be maintained with reliable resources and services, while reducing the failures, especially, correlated failures, which are seriously concerned as these failures double and multiply the issues and failures of each of the data storage drives.

The research question is how to reduce or even nullify the correlated failures that cause serious and multiplying storage drive failures in cloud computing system, so that reliability is not compromised in providing the cloud computing services to the customers.

The objective of the research and study is to investigate the causes of the failures and correlated failures in the cloud computing system and build a viable and realistic solution to minimize or nullify these failures in the cloud computing system.

The research and study is continued with the following sections. The section 2 covers the literature review, regarding the technicalities about the cloud computing services and the failures of the systems. The section 3 covers the methodology of how the research and study are conducted for the project to achieve the objective. The section 4 covers the summary of the research for the correlated failures and the respective fault tolerance methods and classes explored and presented. The following section 5 gives the conclusion of the research and study.

This section covers the details about the system failures, in general, followed by the cloud computing failures. Then the next subsection shows the recent failures and the respective anlaysis. The following subsection covers the details of the failure correlation and classification of the correlated failures. As a subsequent subsection, fault tolerance for the correlated failures is presented. The next section covers the fault tolerances in terms of classes, indicating the solutions for the correlated failures of cloud computing.

Large scale computing resources are used in cloud computing and distributed computing systems. Such platforms have thousands of nodes of distributed cluster federations, clustered geographically. In such large scale computing, failures of nodes as well as networks are no more exceptions, because of huge effect on the business operations and so on the profits of the companies. So, such systems have to tolerate failures and evolution of the same has to consider the reactions of the failures.

Failure can be defined as an event, where the system gets failed to continue its operation, with respect to intended specifications. So, when a deviation is occurred by a system, from fulfilling the regular and normal function, for which is it is aimed at, is called occurrence of system failure.


Cloud computing is a flexible, scalable computing resources service offered for the purpose of huge data and database storage, enabling larger number of companies to survive on clouds. As the technology is a double edged sword, cloud computing offers greater level of computing resources with enough flexibility and also leaves the companies to get into losses, with even a few hours of outage, apart from enabling them to sustain with larger volumes of data. Though cloud computing stands as a major service for storage requirements, the failures are quite common.

Net-flix is an entertainment company, based in America, serving the industry by providing video on demand, streaming media and DVD by mail. The company was founded in the year 1997 and by 2017, it reached to earn the revenue of US $11.692 billion. The journey of Netflix in cloud is started in 2008, because of the corruption in major database. The database corruption hit the DVD by mail service them very hard, causing disruption of DVD shipping for continuous three days. Eventually, the company has lost billions of dollars.

In another instance, Netflix has lost $200,000 in just one hour downtime. Netflix has decided and developed a design by own for their cloud infrastructure in a way that the applications are enabled to be switched in between the zones, automatically to avoid the disruptions in service, when the failures occur.

Later a new approach has been considered and it led the team of the company to move to the AWS (Amazon Web Services) cloud. It helped the company to scale up the data warehouse to be flexible to move up or down.

On 29th June, the three hour downtime was caused by power outage. The outage uncovered the issues with the own system of the company, Ariel Tseitlin and Greg Orzell technologists’ oversee over the set-up of the cloud of Netflix.

When their biggest outages are investigated for the root-causes, they added a pattern of resiliency that mitigate such disruptions in the future. They revealed the root cause of their latest outage as an edge case in mid tier load balancing service, internally.

One feature of the cloud infrastructure of Netflix caused cascading failure.

WhatsApp experienced downtime on 3rd May, 2017, worldwide from 4:50 PM EST to 5:54 PM EST. The company responded and promoted with the message that the company is aware of the issue and started working on it to fix sooner. Consequently, the messaging service was not available to the users.

Facebook has faced down of its services, on February 2018, for about 3 hours. Eventually, users of Facebook in Brazil, Europe, Australia and Europe cannot log into their accounts and it happened second time in the same week. Facebook did not allow them to get back logging in, after they are forced by the company to log off the page. It gave the appearance that users’ accounts are viewed and used by other people. In the previous outage, two day prior to the date, user were able to log in, but were not able to see the News Feed. An outage was immediately announced by Facebook, on Twitter to calm the masses

Later, Facebook has given an explanation that the functionality meant for guarding the user accounts against hackers. However, later it confirmed that there was no actual breach of security occurred. 

A major blow has been happened in Amazon Web Services, for a few days embarrassing the entire world all around. On February 28, 2017, an engineer from AWS tried to debug a storage system, called S3, in the data center, in Virginia. As part of it, the engineer has typed a command incorrectly and accidentally. Eventually, much of the internet including various platforms of enterprise, like Trello, Quora and Slack, has turned to down, for a few hours.

After the post-mortem, the employee found to use ‘an established playbook’ and intended to use pull down some of servers, which were hosting subsystems for the process of billing. An accidental command, instead had resulted in servers far broader swath, taking offline including one necessary subsystem to serve some particular requests for data storage functions and another new storage allocation.

The outage from the cloud service provider, owning close to one third part of the global market in cloud, had reignited debate on the public cloud risks.

GitLab’s popular online code repository,, suffered an 18-hour service outage that ultimately couldn’t be fully remediated. The problem resulted when an employee removed a database directory from the wrong database server during maintenance procedures. Some customer production data was ultimately lost, including modifications to projects, comments, and accounts.”Our best estimate is that it affected roughly 5,000 projects, 5,000 comments and 700 new user accounts,” the company said in a post-mortem. In an apology to users, GitLab’s CEO said “losing production data is unacceptable

Types of Occurrence based Failures in CCS – Independent and correlated failures in CCS + taxonomy about correlated failures _ link the recent failures with the basis of the properties of occurrence with the types of failures

In June 27, 2016 multiple social media feeds reported availability problems with Apple’s iCloud Backup service. Apple’s systems status page said iCloud Backup was only down for less than 1 percent of users. The problem, in which those affected could not restore iOS devices from previous backups, lasted for at least 36 hours. While the restore process would hang without completion, there was no problem initiating new backups of devices to protect data.

The concept of correlation is associated with the interdependency of various activities. In case failure in some part of a system, it would further result in multiple failures in various other associated and related parts of the system. Eventually, the entire system gets failed and in that context, it can be expressed that the failures have correlation. The distributed computing systems, like grids and clouds, a group or set of computing components is known to be a shared risk domain or a shared risk group, if a common failure affects multiple computing components. These are called shared risk domain, since a common failure risk is shared by them, similar to how a communication medium acts and affect the entire topology in network topologies. If any break down of the communication medium happens, then the entire data transfer happens among all the nodes that use the same medium of communication and will go down (Pezoa & Hayat, 2014). There had been most of the research to ensure that the environment of the cloud is reliable, while considering the key failure cause as the independent distribution failures (Mickens & Noble 2006). This evaluation was simpler to deal, however, it is proved in practice, to be error prone. A single node that is faulty can influence the entire health and working of the entire system (Wang & Wang 2014). Failure correlation not only may make the entire system to be faulty, but also cause the reduction of the several fault tolerance mechanisms’ effectiveness. These affected tolerance mechanisms can be replication, encoding schemes and backups (Rangarajan et al, 1998).

Correlated failures can be of the following types basically,

Temporal correlation is based on time. Temporal correlation is regarding the attempt to review the failure occurrence pattern and find the periodicity from the same. Such temporal correlation can be found from one of the best methods called ACF (Auto Correlation Function). If the ACF value is nearly equal to value 1, it indicates that there is certain periodicity and if the value is near to zero, then failure occurrence is taken as random. However, various failures happened in the larger systems of distributed computing, are not distributed uniformly to each and every node. A very smaller number of nodes or minor part of nodes, of even lesser than 4% are prone to the failures of close to 70% of the failures resulting in the system.

The experiment has also revealed a strong varying correlation of failure in the failure occurrence pattern on these nodes. The failure information is taken from different traces of failure with various time lags, through an autocorrelation function, to measure the degree of failure correlation. The work has indicated the shift in the plot collected from the information of failure, according to different lags, like weeks, days, hours so that repeated pattern can be found. And the failures behaviour is measured, by time variation in larger distributed systems. A formal method is proposed by the experts, in order to characterize the repetition of failure pattern, along with the peaks in failures and also to identify the periods, causing the system’s downtime.

Spatial correlation is based on the space. These failures are occurred on various nodes of one system, in just short time intervals.

Failure occurrence could be correlated in space, in a failure burst and proven either numerically or empirically. It requires general numerical methods to prove that the correlation is in space, between the failures (Gallet et al, 2010). The result is proposal of a numerical model or numerical method, based on aspects based on lognormal distribution, like downtime, caused from failures, group size and group arrival, so that space-correlation among the failures that occur in very short intervals of time can be found. For example, a moving window based method is used for finding correlation among the series of failures, in the empirical data. The data can be taken from a public failure repository or FTA (or Failure Trace Archive). There were certain experiments conducted and out of fifteen, seven traces show a strong correlation, among the failures occurrence and the report has challenged the regular assumption that the component failures occurrence is distributed independently.  

Space-correlated or spatial correlated failures in cloud computing do occur in very short intervals of time. These space-correlated failures are more likely to occur for tightly coupled systems. However, in certain space-correlated failures investigation, these failures are hampered because e of the information lack in the traces of the failures. And the same investigation shows that no failure traces records failures with failures reconstruct groups having sufficient detail. When numeric approach is adopted, based on the start timestamps and finish timestamps based group failures. Examples for the numerical approach are extending windows, time partitioning and moving windows.

  1. Extending Windows

In this approach, a group of failures is considered as a maximal events sub-sequence in such a way that each two events occurring consecutively, at the time apart for ? time. Hence, the size of the window is considered as ?, extending the horizon for each added new event to the group. This approach is used for model the failures that are possible to occur in Grid 5000.

  1. Time Partitioning

Time partitioning approach allows the time to be partitioned with fixed ? size that starts from the first event present in O or hypothetical time 0. This process is called space-correlated failure generation through the approach of time partitioning.

  1. Moving Windows

Moving windows process is iterative that starts from O1 that generates the failures of space-correlated with , time parameter. Here, F, a group generator is selected, in each step of the process. The process gets completed after the selection of all the events in O. Then the maximum number of correlated failures indicates the total number of events present in O. A time window of ? size time window, where, at each step, where the window moves to the event, O, that is selected next (Gallet et al, 2010). So, this process is called as space correlated failures generation through the method of Moving Windows. A single process of generation is selected out of the three processes, when the selection is motivated by the following two considerations. Time partitioning is selected, as it may introduce the boundaries of artificial time, in between the events of failure that belong to space-correlated failures, consecutively, since each of the failure gets started at a ? multiple. So, the resulting identified groups cannot related to the naturally occurring groups and also result in confusing algorithms and mechanisms related. But moving and extending windows has no such problem. Secondly, the process of extending windows could generate the failures that may be infinitely long, in between consecutive failures, extending window is considered, and may also occur a failure, long after the respective group generator, and so may reduce the fault tolerance mechanisms efficiency, reacting to the failures instantaneous bursts. So, moving windows process is selection for modelling.

When all the three approaches are considered and used, with the same events set, O, they generate the space –correlated failures that are different (Kondo et al, 2010).

This section discusses the existing fault tolerance for correlated failures and cloud computing failures, along with the failures in the distributed computing (de Oliveira et al, 2013).

Over the last decade, the cloud computing landscape has changed significantly, along with the mechanisms and methods of cloud failure tolerance. However, data center model and a single provider method pose various challenges. Large data center consumes a lot of energy to continue the operation of the same. And in addition, like any typical model of centralized computing, this centralized cloud data centers too are susceptible to the single point failures (Engelmann & Geist, 2005). Apart from that, since the users and data centers are distant geographically and demand data to be transferred to resources from sources, processing in the data center. It indicates that the application that generates personal or sensitive data and applications that use the same data may have to be get stored in another country, rather than the place, where it is originated (Institute, 2016).

There are various strategies implemented for failure mitigation on the cloud. These strategies include usage of multiple zones, redundant compute systems within the data center, back data and data centers and multiple zones in individual zones. But, in recent years, alternate models to make use of cloud infrastructure, rather than making usage of a single provider for operating data centers is proposed (Incel et al, 2012). Considering the micro cloud, multi-cloud and cloudlet, heterogeneous cloud and ad hoc could to be operations, the trends in changes and upgrading the cloud are demonstrated. For facilitation of the ad hoc clouds and multi-cloud environments, changes are needed from upwards in the stack, from the middleware layer.

The concern remains for the cloud is the reliability of the cloud, while adopting the cloud for computing and storage remotely. It has been reported about the cloud failures, affecting various and popular services, like Netflix and Dropbox. It should also be noted that 49 minute outage by, has suffered the company and cost it more than $4 million, to lose in sales (Habak et al, 2015).

Losses from outages would continue to escalate with the e-commerce rapid business growth, as unplanned and sudden outages are unpreventable (Trapero et al, 2017). As the infrastructure would become distributed, reliability would be more challenging. More efforts are put recently to make the design to be more reliable data cetners and services of cloud.

Dealing with the hardware failures, on the infrastructure level, because of natural disasters and targeted attacks, data and VMs are replicated rigorously, in different geographical locations. (Ferdousi et al, 2015). Strategies that are proactive and reactive are followed for backing up VMs, considering the bandwidth of the network and associated metrics are inherent now to the cloud data centers design.

For delivering cloud architectures to be delivering disaster resilient, Microsoft has initiated FailSafe that can be used by an application of cloud. Disaster recovery notwithstanding is an expensive operation and it is required to minimise costs and time of recovery, as a service, after occurrence of the failure. (Nguyen et al, 2016). Single points of failure can be avoided by recommending multi-region and multi-cloud architectures, which both scale vertically, which does not limit to one cloud data centers, but expanding through the network and horizontally, through geographical distribution (Wood et al, 2010).

When correlated failures are considered, they can best be encountered by novel and innovative system architectures that enables the clouds to distribute the data center geographically and improve sustainability of the cloud and distributed systems (Lee et al, 2014).

In this space, significant contributions are achieved through development of algorithms that are based on geographically distributed data coordination, energy aware provisioning, carbon footprint-aware and resource provisioning in data centers (Younge et al, 2010). Such algorithms reduce the probability of failures, by maximising the green energy usage and minimizing consumption of energy by data centers, by meeting QoS expectations of applications. Energy efficiency incorporated as a QoS metric is also suggested recently (Wang et al, 2015). However, it risks the Service Level Agreements violation, as the policies of VM management will be increasingly aiming rigorously for energy efficiency optimization. But, a trade off is there in between the energy efficiency and resource of cloud performance.  

Inter and intra networking plays vital role in efficient data centers set up. Virtualising network functions by software defined networking stands as an upcoming for key services management offered through the network. But, consumption of energy does not stand as a key metric, taken for consideration in current implementations (Yuan et al, 2017). The trade off understanding is by an open are, in between network functions and energy consumption. When it is addressed, it provides insights into cloud infrastructure development that can become more distributed.

Failures and correlated failures in cloud and distributed computing can be reduced by achieving more sustainable solutions for such solutions, during the long run, by algorithms for power states’ application aware management of computing servers, incorporating in this direction. New method and mechanisms are needed, as emerging techniques towards management of efficiency of energy of cooling systems, networks and servers (Hameed et al,  2016). Such techniques help in leveraging the interplay in between data center managers, who make dynamic decision, on which switch on and off in both the dimensions of space and time, based on the forecasts of workloads and cooling systems that are enabled by IoT.

Distributed File System over Clouds

Google File System is developed by Google and stands as a proprietary distributed file system. The system is designed and developed for reliable and efficient access to data with large commodity servers clusters (Duan et al, 2017). The systems store the files, by dividing them into chunks of each of 64 megabytes for storing, appending, reading and overwriting. It is optimized to provide extremely low latency, high data throughputs and survive from failures of individual servers.

Ideally, the network infrastructure has to be fault-tolerant against many kinds of failures of sever, server rack failures and link outages. Existing multicast and unicast communications must not go the extent that allowed by the physical connectivity underlying. Resiliency is an important factor, thus to ensure fault tolerance of the cloud computing.

To control and increase the fault tolerance of the cloud computing, it needs mechanisms relatively. Conventional cloud computing demands many applications having two tier architecture. In such architectures, front end nodes, like user devices, uses cloud services and the logic of business and database are then located in the cloud. As the sensor rich devices are increasing, like large data, tables, smartphones and wearable, it leads to generation of large volumes of data. According to Gartner, there will be over 20 billion devices, by 2020, connected to the internet and will generate data of 43 trillion data, resulting in significant challenges of computing and networking that further degrade Quality of Experience (QoE) and Quality of Service(QoS), because of increased probability of general and correlated failures in both cloud and distributed computing. The solution is also proposed to be performing certain changes in the current architecture of the cloud computing and fault tolerance mechanisms have to be set accordingly. The problem cannot be addressed by eliminating data centers from centralized clouds or adding more of them. However, it needs to extend the ecosystem of computing beyond data centers of cloud to pave the forward path towards the user.  It also includes the resources at the resources or network edge, voluntarily contributed by owners that are not considered typically in traditional or conventional cloud computing.

The large scale applications with increased fault tolerance need a mechanism of evolving new computing models by changing the computing infrastructure. In the mechanism proposed, four computing models are considered, like fog and mobile edge computing, software defined computing, volunteer computing and serverless computing that can define new trends in the clouds, in the future (Shahri et al, 2014).

Volunteer Computing

Cloudlets and Ad hoc clouds are emerging for accommodation of more applications of mobile or user-driven that benefit closer to user devices from computing. In a conventional data center, the computer resources availability is not guaranteed, as in cloudlet or ad hoc cloud. So, upfront payment or pay-as-you-go payment for the resources of compute, network or storage would not be suitable (Caton et al, 2012). So, in place of it, an approach of crowd funding, in which spare resources, from devices or computers of users are volunteered for ad hoc cloud creation. Such model can increase the fault tolerance and can be used for the applications of support, having scientific or societal focus.

Volunteer cloud computing generally takes its forms in different ways. For instance, social network users may share their resources of heterogeneous computing, in an ad hoc cloud form and so it is known to be ‘social cloud computing’ (Costa et al, 2011). Rewarding of the reliable owners is done through a reputation market, in the social network. Here, another incentive reported is the gamification. There is another research conducted, known as ‘peer-to-peer cloud computing’.

The challenges of fault tolerance are in overcoming the faults that benefit fully from computing of volunteer cloud will be primarily in overheads minimisation to set up a virtualized environment, with the condition that the hardware underlying will be ad hoc and heterogeneous.

Fog and Mobile Edge Computing

The fog computing premise is leveraging the computer resources that exist on edge nodes, like router & switches, integrate additional computing capability and mobile base stations, to those network nodes, between cloud data cetners and user devices, along the entire data path. These nodes are resource constrained. It gets viable, in case facilitation of general purpose computing is facilitated on additional infrastructures, like cloudlets or micro clouds or existing edge nodes. However, fog computing can be applicable for use-cases, like in face recognition and online games (Ranjan & Zhao, 2013). There is a benefit of fog computing is improving QoE and QoS and minimising application latency for users, while hierarchical networking is leveraged and tapping into resources, which is different from general purpose computing. So, fog computing may be anticipated to enable realising the vision of the Internet-of-Things. Fog computing is expected to work in conjunction with the centralized clouds so that more distributed computing is facilitated and fault tolerance is improved (Chen et al, 2016).

Fog computing has an important characteristics that it can be scaled vertically, across various computing tiers, enabling only traffic of essential data, beyond the data source (Baset, 2012). Offloading of workloads can be done to edge nodes from cloud data centers or to edge nodes from user devices, for data processing, near its source, instead of the data in distant locations, geographically.

Mobile edge computing is considered to be same as the fog computing, where the network edge is employed (Buyya et al, 2011). But its limitation is the mobile cellular work and so this model allows the radio access network to share with the aim of reducing the congestion of network. The application areas are data analytics, content delivery and computational offloading so that the response time can be improved.

Towards realization of MEC and fog computing, the challenges to address are firstly, complex management connecting to the service level agreements at multi-service, responsibilities articulation and obtaining a platform that is unified, for the management, provided that varied parties may get edge nodes owned. Secondly, security enhancement and privacy issues addressing, during the interaction of multiple nodes, in between cloud data center and user device  (Stojmenovic et al, 2016). First step in this direction is the Open Fog Consortium.

Serverless Computing

A conventional cloud computing hosts on a Virtual Machine offering services to the users. The owner of the service pays for hosting the server application entire time. Such application has performance metrics of elasticity, scalability and latency. So, efforts of development focus on these three metrics. Here, idle time is not considered and the pay is per VM per hour, since VM has to be running (McGrath & Brenner, 2017). However, it may have less processing power, relatively, with the infrastructure of decentralised data center, it cannot be ideal for continuous server hosting that remains idle for time for prolonged periods. In place of it, modularization of MEC environment or fog application is done, with respect to the time consumed for module execution or usage of memory by the application. It demands different models of cost proportional to memory consumed and number of processed requests.

Here, serverless means that the server is not used for rent, as thought as in the conventional cloud server. Challenges of this model are application deployment on a VM or under or over resources provision for scalability, application and fault tolerance. Other considerations are flexibility, cost, control and other properties that are abstracted away from the server.

Application’s functions in this novel approach are executed, only when needed with no need of continuous application running (Spillner, 2017). It is also known to be event based programming or Function-as-a-Service. The function execution may be triggered by an event. Examples are Google Cloud Functions, IBM Open Whisk and AWS Lambda. Such serverless computing is achieved by OpenLamba, an open source project and achieves better fault tolerance.

Fault tolerance mechanism can be achieved by increased usage of serverless computing with the need for connecting billions of devices to the data cetners and network edge. Its challenges are radical shift in the application properties needed by a programmer to focus on elasticity, scalability and latency and application modularity, like flexibility and control. Other challenge is programming models development. The trade-offs and effect of such traditional external services usage, along with these services demand much more investigation further in future (Jararweh et al, 2016).

Software Defined Computing

Usually, in a architecture of two-tier cloud, there is no large traffic existing, unlike in traditional system, because of increasing total devices the internet caters. Consequently, multiple cloud services need transfer of increasing volume of data, from one location to another. A dynamic architecture is demanded to manage this efficiently. SDN approach can isolate the hardware underlying, in the network from components controlling traffic of data (Malawski, 2016). Such abstraction enables the network control components programming for getting an architecture of dynamic network.

The opportunities and challenges in SDN development are many. Firstly, hybrid SDN development is a challenge, in lieu of distributed or centralised SDNs and needs physically distributed protocols facilitation needs research (Kreutz et al, 2015). Other challenge is in the techniques development by considering both cloud infrastructure and network, for capturing Quality of Service. Thirdly, it needs Information Centric Networking (ICN) interoperability facilitation, since ICN is to be adopted over SDN, by the cloud networks. The fourth challenge is mechanisms development for network virtualization facilitation for varied granularities (Hakiri et al, 2014).

As the distributed cloud computing architectures are emerging, the approach of SDN can be applied to storage and compute, all resources, beyond data centers, apart from networking, for effective cloud environment delivery (Zhang et al, 2017). It converts to Software Define Computing (SDC), when this concept is applied to data center and resources’ storage, compute and network. It allows for adaption of physical resources and easy reconfiguration, towards delivery of the QoS metrics agreed, including fault tolerance. In such cases, the complexity in operating and configuring the infrastructure can be alleviated. 

The project has been started with exploring the basic details and concept of the cloud computing. The research is done by exploring and studying the resource material from various sources. The major source is the online resource, apart from the University library. And another important source is the printed book material. The standard and authenticated books are identified, related to the research subject and studied for understanding the concept of cloud computing and related failures. Some of the books are collected from the libraries and others are from the internet, in the form of PDF files. Initial understanding of cloud computing and possible failures has been completed and then the correlated failures are studied in detail. The statistics about the cloud computing failures, in the giant companies are explored to understand its affect and influence to the respective companies and industries. Later, the root causes of the correlated failures are explored, to understand how the failures are occurred. And correlated failures are studied in more details, to understand how one failure will lead to other failures and eventually, how the reliability of the computing system is affected. The correlated failures are researched further in details, from the existing literature. The research and study is conducted both from the printed text books and online pdf files.

After studying the correlated failures and the respective root causes, solutions are researched and studied. There are certain classes that have been developed towards solution of these problems. Hence, some of these classes are studied in details and presented the same here. After exploring the solutions, some of the solutions, in developed forms, have been presented as the solution for the correlated failures problems. The objective of the research and study to explore the root causes of the correlated failures in cloud computing and proposing of the solution has been believed to be completed. The proposed solutions are believed to be implemented in advanced cloud computing systems that have been serving the computer resource requirements for many of the large and giant companies, so that reliability of the cloud resources and services are expected to be improved to a better extent.

Cloud computing needs to be well controlled in terms of outages and failures caused from general failure causes and correlated failures, since there are huge losses through outages of business operations that hinder the overall performance and profits proportionally. The general trend for the solution for this problem is to make use of the infrastructure to be distributed in terms of multiple providers and computing decentralization away from the resources concentrated currently in the data centers. However, it stands as contrast to the offerings from single providers by traditional cloud. As consequence, new methods, models and mechanisms are emerging to suit the market demands.

And still, resilient computing incorporation into application of distributed cloud is continued to be challenging, yet requiring an open area of research and significant programming efforts (Couto et al, 2014).

There is a need for research to balance and find the trade off in between the energy efficiency and resource of cloud performance. However, it demands newer and newer methods incorporating resilience, in the instances of failure and outages.

Another important mechanism is to keep the temperatures of the drives and so data centers in control, by focusing primarily on the VMs consolidation towards minimization of servers’ consumption of energy. But the network and cooling systems consume a great proportion in the total consumed energy.


Cloud computing services stand as one of the best options to capture the future computational world. The cloud computing technology stands synonymous to the existing technologies of SaaS, PaaS and IaaS. The cloud computing services, great and potential breakthrough in remote storage systems, are preferred by a range of purposes, starting from an individual’s personal purpose to the huge data storage purposes of the giant companies, such as Facebook, Twitter, AWS, Yahoo, etc. The preference to the cloud computing is because of the flexibility of usage and the pay per services that are only utilized. Though the cloud computing systems stands as considerably reliable storage for many businesses and corporate, there are certain issues, in terms of failures of the cloud computing system. Though the issues and failures can be in control and corrected within very less time, that time is enough for huge losses of the larger companies, because of the demand of online presence. Hence, the research is conducted in exploring the correlated failures, which are severe than the other failures in cloud computing. Correlated failures usually are caused by one or multiple failures of the drives and result in further multiple failures of drives in a few hours to days, eventually, increasing the failure percentage in the cloud storage system. The research question of the ways of reducing these correlated failures in the cloud computing system is considered and the research is done on the same. Recent failure analysis is done for the giant companies, like Net-Flix, Whatsapp, Facebook, AWS, Gitlab, Twitter and Apple iCloud. Each of these companies has been suffered and incurred huge losses for the failures of the cloud storage, on which they have relied for data storage. As a next step, failure correlation is studied in detail, in terms of interdependency activities. And broader range of correlated failures is studied as classification of the same. Temporal correlation, which are based on time, spatial correlation, which are based on space along with the respective methods have been studied and presented. Once the reasons and methods are clear, the solutions are explored and studied, in the form of fault tolerance for correlated failures, so that the new classes can be defined and developed to reduce the failures of the cloud systems. The existing and available literature has been researched and studied to see how these fault tolerances can be obtained and improved to improve overall reliability of the cloud computing systems.  


  1. Sams, S. L. 2011. Discovering hidden costs in your data centre a of perspective.
  2. Nguyen, T. Shi, W. 2010. Improving resource e_ciency in data centers using reputation-based resource selection, in: Green Computing Conference, International, Chicago, IL, USA, 2010, pp. 389.
  3. Cook, G. Horn, J. V.  2011. How dirty is your data? a look at the energy choices that power cloud computing.
  4. Shawish, A. Salama, M. 2014. Cloud computing: paradigms and technologies, in: Inter-cooperative Collective Intelligence: Techniques and Applications, Springer. pp. 39.
  5. Jula, A. Sundararajan, E. Othman, Z. 2014. Cloud computing service composition: A systematic literature review, Expert Systems with Applications 41 (8).
  6. Portnoy, M. 2012. Virtualization essentials, Vol. 19, John Wiley & Sons.
  7. Javadi, B. Thulasiraman, P. Buyya, R. 2013. “Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources”, The Journal of Supercomputing.
  8. Barroso, L. A. Clidaras, J. olzle, U. H2013. The datacenter as a computer: An introduction to the design of warehouse-scale machines, Synthesis lectures on computer architecture 8 (3).
  9. Javadi, B. Abawajy, J. Buyya, R. 2012. “Failure-aware resource provisioning for hybrid cloud infrastructure”, Journal of parallel and distributed computing72 (10).
  10. Fu, S. 2010. “Failure-aware resource management for high-availability computing clusters with distributed virtual machines”, Journal of Parallel and Distributed Computing70 (4).
  11. Engelmann, C. Geist, A. 2005. Super-scalable algorithms for computing on 100,000 processors.
  12. Institute, P. 2016. Cost of data center outages
  13. Pezoa, J. E. Hayat, M. M. 2014. Reliability of heterogeneous distributed computing systems in the presence of correlated failures, IEEE Transactions on Parallel and Distributed Systems 25 (4)
  14. Mickens, J. W. & Noble, B. D. 2006. Exploiting availability prediction in distributed systems, in: (NSDI’06), 3rd Symposium on Networked Systems Design and Implementation, San Jose, CA, USA, 2006, pp.86
  15. Wang, S.-S., Wang, S.-C. 2014. The consensus problem with dual failure nodes in a cloud computing environment, Information Sciences 279
  16. Rangarajan, S. Garg, S. Huang, Y. 1998. Checkpoints-on-demand with active replication, in: Seventeenth Symposium on Reliable Distributed Systems, IEEE, West Lafayette, Indiana, USA. pp. 75, 83
  17. Gallet, M. Yigitbasi, N. Javadi, B. Kondo, D. Iosup, A. Epema, D. 2010. A model for space-correlated failures in large-scale distributed systems, in: Euro-Par 2010-Parallel Processing, Springer, Ischia, Italy. pp. 88-100.
  18. Kondo, D. Javadi, B. Iosup, A. and Epema. D. 2010. The Failure Trace Archive: Enabling comparative analysis of failures. in diverse distributed systems. In CCGRID, pages 1–10.
  19. Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., and Epema, D. 2010. A Model for Space-Correlated Failures in Large-Scale Distributed Systems, Delft University of Technology. Parallel and Distributed Systems Report Series, The Netherlands.
  20. Costa, F. Silva, L. Dahlin, M. 2011. Volunteer cloud computing: Mapreduce over the internet, in: IEEE International Symposium on Parallel and Distributed Processing Workshops, pp. 1855–1862
  21. Caton, S. Bubendorfer, K. Chard, K. Rana, O.F. 2012. Social cloud computing: A vision for socially motivated resource sharing, IEEE Trans. Serv. Comput. 5. 551–563.
  22. Shahri, A. Hosseini, M. Ali, R. Dalpiaz, F. 2014. Gamification for volunteer cloud computing, in: Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing. pp. 616–617
  23. Ranjan, R. Zhao, L. 2013. “Peer-to-peer service provisioning in cloud computing environments”, Journal of Supercomput.65 (1) (2013) 154–184.
  24. Chen, X. Jiao, L. Li, W. Fu, X. 2016. Efficient multi-user computation offloading for mobile-edge cloud computing, IEEE/ACM Trans. Netw. 24 (5)
  25. Baset, S.A. 2012. Cloud SLAs: Present and future, Oper. Syst. Rev. 46
  26. Buyya, R. Garg, S.K. Calheiros, R.N. 2011. SLA-oriented resource provisioning for cloud computing: Challenges, architecture, and solutions, in: Proceedings of the International Conference on Cloud and Service Computing. pp. 1–10.
  27. Stojmenovic, I. Wen, S. Huang, X. Luan, H. 2016. An overview of fog computing and its security issues, Concurr. Comput.: Pract. Exp. 28 (10) 2991–3005.).
  28. Spillner, J. 2017. Snafu: Function-as-a-Service (FaaS) Runtime Design and Implementation. CoRR abs/1703.07562.
  29. McGrath, G. Brenner, P.R. 2017. Serverless computing: Design, implementation, and performance, in: IEEE 37th International Conference on Distributed Computing Systems Workshops. pp. 405–410.
  30. Malawski, M. 2016. Towards serverless execution of scientific workflows HyperFlow case study, in: Workflows in Support of Large-Scale Science Workshop.
  31. Kreutz, D. Ramos, F.M.V. Verssimo, P.E. Rothenberg, C.E. Azodolmolky, S. Uhlig, S. 2015. Software-defined networking: A comprehensive survey, Proc. IEEE 103 (1) 14–76.
  32. Zhang, Z. Bockelman, B. Carder, D.W. Tannenbaum, T. 2017. Lark: An effective approach for software-defined networking in high throughput computing clusters, Future Gener. Comput. Syst. 72.  105–117.
  33. Jararweh, Y. Al-Ayyoub, M. Darabseh, A. Benkhelifa, E. Vouk, M. Rindos, A. 2016. Software defined cloud, Future Gener. Comput. Syst. 58 (C). 56–74.
  34. Couto, R.D.S. Secci, S. Campista, M.E.M. Costa, L.H.M.K. 2014. Network design requirements for disaster resilience in IaaS clouds, IEEE Commun. Mag. 52 (10). 52–58
  35. Incel, O. Ghosh, A. Krishnamachari, B. Chintalapudi, K. 2012. Fast data collection in tree-based wireless sensor networks, IEEE Trans. Mob. Comput. 11 (1).
  36. de Oliveira, H.A.B.F. Ramos, H.S. Boukerche, A. Villas, L.A. de Araujo, R.B. Loureiro, A.A.F. 2013. DRINA: A lightweight and reliable routing approach for in-network aggregation in wireless sensor networks, IEEE Trans. Comput. 62.  676– 689.
  37. Wang, S. Urgaonkar, R. Zafer, M. He, T. Chan, K. Leung, K.K. 2015. Dynamic service migration in mobile edge-clouds, in: IFIP Networking Conference,  pp. 1–9.
  38. Habak, K. Ammar, M. Harras, K.A. Zegura, E. 2015. Femto clouds: Leveraging mobile devices to provide cloud service at the edge, in: Proceedings of the 8th IEEE International Conference on Cloud Computing. pp. 9–16.
  39. Trapero, ] R. Modic, J. Stopar, M. Taha, A. Suri, N. 2017. A novel approach to manage cloud security SLA incidents, Future Gener. Comput. Syst. 72.
  40. Ferdousi, S. Dikbiyik, F. Habib, M.F. Tornatore, M. Mukherjee, B. 2015. “Disaster aware datacenter placement and dynamic content management in clod networks”, IEEE/OSA Opt. Commun. Networking 7 (7).
  41. Nguyen, T.A. Kim, D.S. Park, J.S. 2016. Availability modeling and analysis of a data center for disaster tolerance, Future Gener. Comput. Syst. 56. 27–50.
  42. Wood, T. Cecchet, E. Ramakrishnan, K.K. Shenoy, P. van der Merwe, J. Venkataramani, A. 2010. Disaster recovery as a cloud service: Economic benefits & deployment challenges, in: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing
  43. Yuan, X. Min, G. Yang, L.T. Ding, Y. Fang, Q. 2017. A game theory-based dynamic resource allocation strategy in geo-distributed datacenter clouds, Future Gener. Comput. Syst. 76. 63–72.
  44. Hameed, A. Khoshkbarforoushha, A. Ranjan, R. Jayaraman, P.P. Kolodziej, J. Balaji, P. Zeadally, S. Malluhi, Q.M. Tziritas, N. Vishnu, A. Khan, S.U. Zomaya, A. 2016. A survey and taxonomy on energy efficient resource allocation techniques for cloud computing systems, Computing 98 (7). 751–774.
  45. Lee, C.-W. Hsieh, K.-Y. Hsieh, S.-Y. Hsiao, H.-C. 2014. A dynamic data placement strategy for hadoop in heterogeneous environments, Big Data Res. 1 (C). 14–22.
  46. Younge, A.J. von Laszewski, G. Wang, L. Lopez-Alarcon, S. Carithers, W. 2010. Efficient resource management for cloud computing environments, in: Proceedings of the International Conference on Green Computing, pp. 357–364.
  47. Duan, H. Chen, C. Min, G. Wu, Y. 2017. Energy-aware scheduling of virtual machines in heterogeneous cloud computing systems, Future Gener. Comput. Syst. 74. 142–150.
  48. Hakiri, A. Gokhale, A. Berthou, P. Schmidt, D.C. Gayraud, T. 2014. Software-defined networking: Challenges and research opportunities for Future Internet, Comput. Netw. 75, Part A. 453–471.