Technologies are developing rapidly, and they are getting smarter and smarter with every update. One such technology is the Cloud! Be it Google Cloud, AWS, Microsoft Azure, or any other cloud performers, all have gained significant popularity because of its potential. However, no matter how useful these technologies are, some glitches can affect its working leading to various problems. One major issue that people come across with the cloud technology was cloud outage.
Sometimes some glitches may happen & can create an outage in the cloud. There are some common reasons responsible for causing an outage in any cloud platform.
We have highlighted all the issues that can lead to a cloud outage. All these things are taken into consideration by the cloud merchants to guarantee that the service always delivers on the SLAs with sufficient acceptability.
Read More: The Two Sides of Cloud Computing!
#1. Power Outage
It is the most common cause of the outage of cloud service in the matter of unavailability of the power that powers the underlying data stations. The cloud is operated on a vast scale – a single datacenter may consume 10s to 100s of megawatts of power, for which they typically rely on the national grid or power plants independently operated by third parties.
It performs the constant availability of sufficient electricity a challenge for data center corporations, particularly as rapid increase and scalable market requirements require a scalable power source, which is unless only possible in limited quantity.
#2. Human Error
Companies have the best employees and teams that can solve every issue as soon as possible. However, the only incorrect command can cause the entire IT infrastructure service down, and even they have strong protocols and systems in place to avoid it. It is the most common cause of the outage of cloud service in the matter of unavailability of the power that powers the underlying data stations.
The cloud is operated on a vast scale – a single datacenter may spend 10s to 100s of megawatts of power, for which they depend on the national grid or any power plant individually managed by third parties. It makes the consistent availability of adequate electricity a challenge for data center companies, especially as rapid growth and scalable market demands require a scalable power source, which is otherwise only available in limited quantity.
Cyber-attacks such as the “Distributed Denial of Service (DDoS)” cause datacenters to overload with incoming traffic, limiting certain users from obtaining the service via the corresponding networking channels. Even companies have the best sufficient assurance systems, but the hackers manage to utilize secret loopholes that either triggers protection mechanisms that isolate the services from legitimate users, leak data or closed the service altogether.
#4. Networking Issues
Cloud platform companies may have partnered with telecommunication service providers and government organizations operating communication networks over vast distances. The problems connected with the networks are not restricted to the organization, and the network providers cannot manage it. In this case, cloud merchants and consumers depend on their telecommunication associates to guarantee the service is replaced.
Cloud companies are liable for the operations, maintenance, and administration of their IT support. End-users only pay for the services utilized, while merchants spend in service development continuingly. It is necessary to have both registered and unscheduled support and upgrades. The support system may need assistance suspension, a variety of workloads beyond data markets, or common fixes that need a full system restart.
#6. Overload or System Failure
Cloud platforms consist of huge users and providing them appropriate services requires a huge amount of systems and components. Therefore, this type of outage occurs in which cloud platforms face issues regarding the overload of users or system failures.
In the last google cloud outage, an overload of the users was the main reason and that outage created so many issues for the maintenance team as well as for the users too.
Google Cloud Outage
The most recent outage that happened in the Google cloud was a few months ago, and the reason was a failure to scale that resulted in an outage of the G Suite and Google cloud platform.
Google Cloud outage affected several services including Snap and Discord, as well as Google services like Gmail, Nest, and others. The problem was first published by East Coast users in the U.S. in the afternoon, but reports from interruption monitor DownDetector mean that more countries may have been affected by the outage.
On March 26th, Google’s cloud services in multiple regions, including Big Query, Dataflow, Cloud Firestore, DialogFlow, Kubernetes Engine, App Engine, and Cloud Console, were entirely down for a total of 14 hours. There were a few assumptions in the past weeks, but then the internal investigation team of Google confirmed that the outage happened due to the lack of memory in the company’s cache servers.
In the statement, they cleared that the outage was caused by the bulk update of group memberships that increased to an unexpectedly high number of modified permissions, which generated a significant backlog of queued mutations to be applied in real-time. Moreover, the backlog processing was degraded by a latent regarding the cache servers that results as the out of memory concerns and all requests to IAM (Identity and Access Management) timing out.
Majorly the issue was temporarily exacerbated in various regions by emergency rollouts performed to mitigate the high memory usage. The company installed more memories into the cache servers and restarted them for resolving the issue. However, the issues didn’t stop here because a heap of stale data had built up that resulted in further issues and took more hours to get stabled.
The systems were back up and operating at 05:55 AM UTC the following morning. As per the future precautions, the team stated that they will allow batch processing for parallelization and a more frequent process because this outage created so many issues for the users. Apart from this outage, the last outage happened in November 2019, and the reason behind this outage was the failure of the underlying leader election system that caused the components in the control plane to lose and get leadership in short succession.
Hence the issues occurred due to excessive load or failure of components but most importantly Google just recovered with the issues as soon as possible and still providing the best services.
Read More: Comparing Google Cloud, AWS, and Azure!
The outage of the cloud platform can create critical situations for the core users because the outage can cause the shut down in all types of work regarding the cloud platform. Hence it is essential to maintain the cloud services but sometimes expected things cause outages.
In this article, we have discussed the most recent outage in the google cloud platform and also the causes of outage. As we have seen that various types of outage causes can result in the shutdown and there is no way to pre-determine or stop this outage. The outage can easily take the time and effort of the maintenance teams in the cloud platform.