Gremlin updated its “failure-as-a-service” platform with the ability to automatically identify Docker containers and to simulate real-world outages. The firm said the updates help organizations build more resilient container environments using chaos engineering.
The Gremlin software-as-a-service (SaaS) platform recreates common Docker container failures across three categories: resource, network, and state. It allows developers to see how the system reacts to failures, validates that defense mechanisms will work to prevent system outages, and minimizes the blast radius of testing for safe experimentation in production environments.
Gremlin CEO Kolton Andrus said the company’s platform was initially built to run in bare metal environments. The update now allows the platform to run in a containerized environment. “We are making containers first-class citizens,” Andrus said. “We see a lot of customers are beginning to look at migrating workloads to a Kubernetes and containerized world, but they need to build trust and confidence in those platforms before they make the move.”
Customers can deploy the Gremlin platform as a container to test a container pod, or it can be programmed to attack specific containers.This allows for more granularity in terms of probing for potential failures. It can also test multiple potential failure points at one time.
“We can max-out the memory or CPU of a system while attacking different points,” Andrus said. “This is more reflective of real-world environments.
Blast Radius
This level of detail also allows for testing in production environments.
“You really need to be able to test in production,” Andrus said. “You can start in the [development] or testing phase but you ultimately want to test in production to see how that containerized platform is going to perform.”
With Gremlin, a company can target a specific container and not just at the host level. “At the host level you can have 20 containers, and if you attack that host and it breaks, you have 20 broken containers,” he said. “We can now just break one container and reduce that blast radius.”
Andrus had previously worked on similar projects at Amazon.com and Netflix. While at Netflix, he worked with that firm’s open source Chaos Monkey platform. He explained that the Gremlin platform allows for more detailed control over testing compared with Chaos Monkey.
“[Chaos Monkey] has been great for generating awareness, but it just randomly breaks things, and I don’t agree with that strategy,” Andrus said. “There was a time for that but we are more focused on safer testing.”
The Gremlin platform is proprietary, which Andrus said was needed to ensure security and stability of the product. He did note that the company was looking to open source some aspects of the platform at some point. The firm is also looking at tighter integration with the Kubernetes container orchestration platform.
Andrus formed Gremlin last year with Matthew Forni, who serves as CTO. The company has secured $8.8 million in funding from Amplify Partners and Index Ventures.