Chaos Engineering: Types, Experiments, and Best Practices

Types of Chaos Experiments (+ How To Run Them According to Pros)

Chaos Engineering Guides

10.04.2024 Benjamin Wilms - 18 min read

Types of Chaos Experiments (+ How To Run Them According to Pros)

The primary objective of a Chaos Experiment is to uncover hidden bugs, weaknesses, or non-obvious points of failure in a system that could lead to significant outages, degradation of service, or system failure under unpredictable real-world conditions.

What is a Chaos Experiment?

A Chaos Experiment is a carefully designed, controlled, and monitored process that systematically introduces disturbances or abnormalities into a system鈥檚 operation to observe and understand its response to such conditions.

It forms the core part of 鈥楥haos Engineering鈥�, which is predicated on the idea that 鈥榯he best way to understand system behavior is by observing it under stress.鈥� This means intentionally injecting faults into a system in production or simulated environments to test its reliability and resilience.

This practice emerged from the understanding that systems, especially distributed systems, are inherently complex and unpredictable due to their numerous interactions and dependencies.

Components of a Chaos Engineering Experiment

Hypothesis Formation. At the initial stage, a hypothesis is formed about the system鈥檚 steady-state behavior and expected resilience against certain types of disturbances. This hypothesis predicts no significant deviation in the system鈥檚 steady state as a result of the experiment.

Variable Introduction. This involves injecting specific variables or conditions that simulate real-world disturbances (such as network latency, server failures, or resource depletion). These variables are introduced in a controlled manner to avoid unnecessary risk.

Scope and Safety. The experiment鈥檚 scope is clearly defined to limit its impact, often called the 鈥渂last radius.鈥� Safety mechanisms, such as automatic rollback or kill switches, are implemented to halt the experiment if unexpected negative effects are observed.

Observation and Data Collection. Throughout the experiment, system performance and behavior are closely monitored using detailed logging, metrics, and observability tools. This data collection is critical for analyzing the system鈥檚 response to the introduced variables.

Analysis and Learning. After the experiment, the data is analyzed to determine whether the hypothesis was correct. This analysis extracts insights regarding the system鈥檚 vulnerabilities, resilience, and performance under stress.

Iterative Improvement. The findings from each chaos experiment inform adjustments in system design, architecture, or operational practices. These adjustments aim to mitigate identified weaknesses and enhance overall resilience.

馃挕Note 鈫� The ultimate goal is not to break things randomly but to uncover systemic weaknesses to improve the system鈥檚 resilience. By introducing chaos, you can enhance the understanding of your systems, leading to higher availability, reliability, and a better user experience.

Types of Chaos Experiments

1. Dependency Failure Experiment

馃挕翱产箩别肠迟颈惫别: To assess how microservices behave when one or more of their dependencies fail. In a microservices architecture, services are designed to perform small tasks and often rely on other services to fulfill a request. The failure of these external dependencies can lead to cascading failures across the system, resulting in degraded performance or system outages. Understanding how these failures impact the overall system is crucial for building resilient services.

Possible Experiments

Network Latency and Packet Loss. Simulate increased latency or packet loss to understand its impact on service response times and throughput.

Service Downtime. Emulate the unavailability of a critical service to observe the system’s resilience and failure modes.

Database Connectivity Issues. Introduce connection failures or read/write delays to assess the robustness of data access patterns and caching mechanisms.

Third-party API Limiting. Mimic rate limiting or downtime of third-party APIs to evaluate external dependency management and error handling.

How to Run a Dependency Failure Experiment

Map Out Dependencies.

Begin with a comprehensive inventory of all the external services your system interacts with. This includes databases, third-party APIs, cloud services, and internal services if you work in a microservices architecture.

For each dependency, document how your system interacts with it. Note the data exchanged, request frequency, and criticality of each interaction to your system’s operations.

Rank these dependencies based on their importance to your system鈥檚 core functionalities. This will help you focus your efforts on the most critical dependencies first.

Simulate Failures

Use service virtualization or proxy tools like SteadyBit to simulate various failures for your dependencies. These can range from network latency, dropped connections, and timeouts to complete unavailability.

For each dependency, configure the types of faults you want to introduce. This could include delays, error rates, or bandwidth restrictions, mimicking real-world issues that could occur.

Start with less severe faults (like increased latency) and gradually move to more severe conditions (like complete downtime), observing the system鈥檚 behavior at each stage.

Test Microservices Isolation

Implement Resilience Patterns. Use libraries like Hystrix, resilience4j, or Spring Cloud Circuit Breaker to implement patterns that prevent failures from cascading across services. This includes:
- Bulkheads. Isolate parts of the application into 鈥渃ompartments鈥� to prevent failures in one area from overwhelming others.
- Circuit Breakers. Automatically, 鈥渃ut off鈥� calls to a dependency if it鈥檚 detected as down, allowing it to recover without being overwhelmed by constant requests.

Carefully configure thresholds and timeouts for these patterns. This includes setting the appropriate parameters for circuit breakers to trip and recover and defining bulkheads to isolate services effectively.

Monitor Inter-Service Communication

Utilize monitoring solutions like Prometheus, Grafana, or Datadog to monitor how services communicate under normal and failure conditions. Service meshes like Istio or Linkerd can provide detailed insights without changing your application code.

Focus on metrics like request success rates, latency, throughput, and error rates. These metrics will help you understand the impact of dependency failures on your system’s performance and reliability.

馃挕Recommendation 鈫� Monitoring in real-time allows you to quickly identify and respond to unexpected behaviors, minimizing the impact on your system.

Analyze Fallback Mechanisms

Evaluate the effectiveness of implemented fallback mechanisms. This includes static responses, cache usage, default values, or switching to a secondary service if the primary is unavailable.

Assess if the 鈥榬etry logic鈥� is appropriately configured. This includes evaluating the retry intervals, backoff strategies, and the maximum number of attempts to prevent overwhelming a failing service.

Ensure that fallback mechanisms enable your system to operate in a degraded mode rather than failing outright. This helps maintain a service level even when dependencies are experiencing issues.

2. Resource Manipulation Experiment

馃挕翱产箩别肠迟颈惫别: To understand how a system behaves when subjected to unusual or extreme resource constraints, such as CPU, memory, disk I/O, and network bandwidth. The aim is to identify potential bottlenecks and ensure that the system can handle unexpected spikes in demand without significantly degrading service.

Possible Experiments

CPU Saturation. Increase CPU usage gradually to see how the system prioritizes tasks and whether essential services remain available.
Memory Consumption. Simulate memory leaks or high memory demands to test the system’s handling of low memory conditions.
Disk I/O and Space Exhaustion. Increase disk read/write operations or fill up disk space to observe how the system copes with disk I/O bottlenecks and space limitations.

How to Run a Resource Manipulation Experiment

Define Resource Limits

Start by monitoring your system under normal operating conditions to establish a baseline for CPU, memory, disk I/O, and network bandwidth usage.
Based on historical data and performance metrics, define the normal operating range for each critical resource. This will help you identify when the system is under stress, or resource usage is abnormally high during the experiment.

Check and Verify the Break-Even Point

Understand your system’s maximum capacity before it requires scaling. This involves testing the system under gradually increasing load to identify the point at which performance starts to degrade, and additional resources are needed.

If you鈥檙e using auto-scaling (either in the cloud or on-premises), clearly define and verify the rules for adding new instances or allocating resources. This includes setting CPU, memory usage thresholds, and other metrics that trigger scaling actions.

Use load testing tools like JMeter, Gatling, or Locust to simulate demand spikes and verify that your auto-scaling rules work as expected. This will ensure that your system can handle real-world traffic patterns.

Select Manipulation Tool

While Stress and Stress-ng are powerful for generating CPU, memory, and I/O load on Linux systems, they might not be easy to use across distributed or containerized environments. Tools like 黑料不打烊 offer more user-friendly interfaces for various environments, including microservices and cloud-native applications.

馃挕Pro Tip 鈫� Ensure that the tool you select can accurately simulate the types of resource manipulation you鈥檙e interested in, whether it鈥檚 exhausting CPU cycles, filling up memory, saturating disk I/O, or hogging network bandwidth.

Apply Changes Gradually

Start by applying small changes to resource consumption and monitor the system鈥檚 response.
Monitor system performance carefully to identify the thresholds at which performance degrades or fails. This will help you understand the system鈥檚 resilience and where improvements are needed.

Monitor System Performance

Use comprehensive monitoring solutions to track the impact of resource manipulation on system performance. Look for changes in response times, throughput, error rates, and system resource utilization.

馃挕Pro Tip 鈫� Platforms like 黑料不打烊 can integrate with monitoring tools to provide a unified view of how resource constraints affect system health, making it easier to correlate actions with outcomes.

Evaluate Resilience

Analyze how effectively your system scales up resources in response to the induced stress. This includes evaluating the timeliness of scaling actions and whether the added resources alleviate the performance issues.
Evaluate the efficiency of your resource allocation algorithms. This involves assessing whether resources are being utilized optimally and whether unnecessary wastage or contention exists.

Test the robustness of your failover and redundancy mechanisms under 鈥榗onditions of resource scarcity鈥�. This can include switching to standby systems, redistributing load among available resources, or degrading service gracefully.

3. Network Disruption Experiment

馃挕翱产箩别肠迟颈惫别: To simulate various network conditions that can affect a system鈥檚 operations, such as outages, DNS failures, or limited network access. By introducing these disruptions, the experiment seeks to understand how a system responds and adapts to network unreliability, ensuring critical applications can withstand and recover from real-world network issues.

Possible Experiments

DNS Failures. Introduce DNS resolution issues to evaluate the system鈥檚 reliance on DNS and its ability to use fallback DNS services.

Latency Injection. Introduce artificial delay in the network to simulate high-latency conditions, affecting the communication between services or components.

Packet Loss Simulation. Simulate the loss of data packets in the network to test how well the system handles data transmission errors and retries.

Bandwidth Throttling. Limit the network bandwidth available to the application, simulating congestion conditions or degraded network services.

Connection Drops. Forcing abrupt disconnections or intermittent connectivity to test session persistence and reconnection strategies.

How to Run a Network Disruption Experiment

Identify Network Paths

Start by mapping out your network鈥檚 topology, including routers, switches, gateways, and the connections between different segments. Tools like Nmap or network diagram software can help visualize your network鈥檚 structure.

Focus on identifying the critical paths data takes when traveling through your system. These include paths between microservices, external APIs, databases, and the Internet.

Document these paths and prioritize them based on their importance to your system’s operation. This will help you decide where to start with your network disruption experiments.

Choose Disruption Type

Decide on the type of network disruption to simulate. Options include;
- complete network outages,
- latency (delays in data transmission),
- packet loss (data packets being lost during transmission), and
- bandwidth limitations.

Next, choose disruptions based on their likelihood and potential impact on your system.
- For example, simulating latency and packet loss might be particularly relevant if your system is distributed across multiple geographic locations.

Use Network Chaos Tools

Traffic Control (TC). The 鈥榯c鈥� command in Linux is a powerful tool for controlling network traffic. It allows you to introduce delays, packet loss, and bandwidth restrictions on your network interfaces.

鈿狅笍Note 鈫� Simulating DNS failures can be complex but is crucial for understanding how your system reacts to DNS resolution issues. Consider using specialized tools or features for this purpose.

On the flip side, chaos experiment solutions like 黑料不打烊 provide user-friendly interfaces for simulating network disruptions. For example, you get safety features like built-in rollback strategies to minimize the risk of long-term impact on your system.

Monitor Connectivity and Throughput

During the experiment, use network monitoring tools and observability platforms to track connectivity and throughput metrics in real time.

Focus on monitoring packet loss rates, latency, bandwidth usage, and error rates to assess the impact of the network disruptions you’re simulating.

Assess Failover and Recovery

Evaluate how well your system鈥檚 failover mechanisms respond to network disruptions. For example, you could switch to a redundant network path, use a different DNS server, or take other predefined recovery actions.

Measure the time it takes for the system to detect and recover the issue. This includes the time it takes to failover and return to normal operations after the disruption ends.

馃挕Recommended 鈫� Analyze the overall resilience of your system to network instability. This assessment should include how well services degrade (if at all) and how quickly and effectively they recover once normal conditions are restored.

START YOUR CHAOS ENGINEERING JOURNEY: We help you proactively chaos experiment your systems. Identify system weaknesses before they cause outages to release with confidence. Experiment with 黑料不打烊鈥� . 馃憟

The Role of Fault Injections in Chaos Testing

Fault injection in chaos testing is a technique that intentionally introduces errors or disruptions into a system to assess its resilience and fault tolerance capabilities.

This approach is grounded in the belief that: by simulating real-world failures, teams can identify potential weaknesses in their systems, improve their understanding of how systems behave under stress, and enhance the overall reliability and robustness of their services.

Practical Application of Fault Injections in Chaos Testing

Consider owning a web application that relies on a microservices architecture, where one of the services is a payment processing service.

To ensure the application remains operational even if the payment service becomes unavailable, you can design a fault injection experiment to simulate the service鈥檚 failure.

Here鈥檚 how it鈥檒l play out:

馃挕翱产箩别肠迟颈惫别: Target the communication between the web application and the payment processing service.

First, the hypothesis is that if the payment processing service fails, the web application should automatically switch to a fallback payment service, ensuring that payment transactions can still be processed, perhaps with a slight delay.

With a tool like SteadyBit, you can simulate the failure of the payment processing service. This could involve shutting down the service鈥檚 instances or introducing network rules to block traffic between the web application and the payment service.

Monitor the application鈥檚 logs, payment transaction success rates, and user experience metrics during the experiment.
- The expected behavior is that: the web application detects the failure and switches to the fallback payment service without significant user impact.

Investigate the cause if the application fails to switch to the fallback service or if user transactions are significantly delayed.
- It might involve issues in service discovery, failure detection, or the fallback mechanism itself. Based on the findings, adjust the application鈥檚 codebase or configuration as needed.

Document the experiment, including the setup, execution, observations, and improvements. Share the results with your development and operations teams to enhance the system鈥檚 resilience against similar future incidents.

Principles of Chaos Engineering for Experimentation

Define 鈥楽teady State鈥� as a Measurable Output

The 鈥榮teady state鈥� refers to the normal behavior or output of your system under typical conditions. This includes identifying KPIs such as response times, error rates, throughput, and availability metrics.

馃敄How to do this: Collect and analyze historical data to understand the system鈥檚 behavior under normal conditions. Use this data to set thresholds for acceptable performance, which will serve as a baseline for detecting anomalies when introducing chaos experiments.

Hypothesize Based on Steady State

This principle involves forming hypotheses about what will happen when chaos is introduced to the system. The hypothesis should predict that 鈥榯he steady state will continue despite the chaos introduced, based on the assumption that the system is resilient.鈥�

馃敄How to do this: Based on the defined steady state, develop scenarios that could potentially disrupt this state. For each scenario, predict the system’s response and define the desired outcome of the experiment, such as failover to a redundant system, graceful degradation of service, or triggering of alerts and recovery processes.For example, if latency is injected into a service, hypothesize that it should not affect the overall error rate beyond a specific threshold.

Run Experiments but NOT in 黑料不打烊ion

Run chaos experiments in a pre-production environment to avoid unintended disruptions to real users and services. This controlled setting allows for identifying and remedying issues without risking production stability.

馃敄How to do this: Replicate the production environment as closely as possible to ensure the validity of the experiment results. Conduct the experiments by introducing the planned disturbances and observing the system鈥檚 response. Make adjustments and fixes in this environment and re-test before considering moving to production. Once you鈥檙e confident in the system鈥檚 resilience through thorough testing and remediation in pre-production, begin planning for controlled experiments in the production environment.

Automate Chaos Tests

Automating your chaos experiments ensures consistency, repeatability, and scalability of testing. This continuous experimentation helps catch issues arising from system or environment changes.

馃敄How to do this: Integrate chaos engineering tools into your CI/CD pipeline to automatically trigger experiments based on certain conditions, such as after a deployment or during off-peak hours.

Automate Your Chaos Tests with 黑料不打烊

Robust API Functionality. 黑料不打烊鈥檚 API offers several functionalities to manage teams, tailor your workspace settings, and orchestrate the lifecycle of chaos experiments. Whether you鈥檙e looking to automate experiment creation, scheduling, or analysis, 黑料不打烊鈥檚 API endpoints offer the granularity needed to integrate chaos engineering deeply into your development and operational workflows.

Seamless CLI Integration. If you prefer a console over a UI to get your work done but also don鈥檛 want to make API calls every time manually, we鈥檝e got you covered. Our open-source CLI provides basic features to create, edit, and run experiments. The CLI also proves very useful when integrating 黑料不打烊 into your CD pipeline and allows you to versionize your experiments as code next to your application.

Other tools for automating chaos tests include Gremlin, Chaos Monkey, and LitmusChaos. Each tool has features tailored to different infrastructures and failure scenarios.

馃挕Pro Tip 鈫� While you can run experiments at any time, it also makes sense to run experiments automatically on build or deploy jobs. You can make experiments a part of your CI pipeline through SteadyBit鈥檚 API, GitHub action or CLI to continuously verify resilience automatically.

Minimize Blast Radius

Limit the impact of chaos experiments to prevent widespread disruption. This principle involves starting with the smallest possible scope and gradually expanding as confidence in the system鈥檚 resilience grows.

馃敄How to do this: Use feature flags, canary deployments, or service mesh capabilities to isolate the experiment鈥檚 impact. Additionally, you can utilize throttling, segmentation, or shadow traffic techniques to control the impact of the experiment.Monitor the experiment closely and have rollback mechanisms in place to quickly revert changes if unexpected issues arise.

Observe and Measure

Monitor performance using observability tools to collect data on how the system responds to the introduced disturbances. This data is critical for analyzing the experiment鈥檚 impact and making informed decisions on system improvements.

馃敄How to do this: Implement monitoring and observability across the system to track performance metrics, log anomalies, and trace transactions. This will provide deep insights into the system鈥檚 behavior during and after experiments.

Chaos Engineering Tools for Running Experiments

Open-Source Tools

黑料不打烊 Extension Kit

extends the capabilities of the SteadyBit platform, enabling custom chaos experiments.

These extension kits include:

Action Kit. The 黑料不打烊 ActionKit enables the extension of 黑料不打烊 with new action capabilities that you can use within experiments. For example, ActionKit can be used to author open/closed source:
- attacks to attack AWS, Azure, and Google Cloud services that 黑料不打烊 cannot natively attack,
- integrate load testing tools,
- health and state checks and
- every other runnable action!

You can learn more about ActionKit through its .

DiscoveryKit. The 黑料不打烊 DiscoveryKit enables the extension of 黑料不打烊 with new discovery capabilities. For example, DiscoveryKit can be used to author open/closed source discoveries for:
- proprietary technology,
- non-natively supported open-source tech,
- hardware components and
- every other “thing” you would want to see and attack with 黑料不打烊.

You can learn more about DiscoveryKit through its .

EventKit. This allows extensions to consume events from the 黑料不打烊 platform to integrate with third-party systems. Extensions leveraging EventKit are similar to webhooks but do not face the typical web routing issues as 黑料不打烊 agents handle this aspect. You can use EventKit to:
- Forward audit logs to an external system.
- Add markers to monitoring systems’ charts during experiment runs.
- Capture experiment run statistics.
- Report information about experiment runs to Slack, Discord, etc.

You can learn more about EventKit through its .

ExtensionKit. Through kits like ActionKit and DiscoveryKit, 黑料不打烊 can be extended with new capabilities. ExtensionKit, on the other hand, contains helpful utilities and best practices for extension authors leveraging the Go programming language.

You can learn more about ExtensionKit through its .

LitmusChaos

LitmusChaos is a CNCF project that provides a framework for running chaos experiments in Kubernetes environments. It includes a variety of predefined chaos experiments and allows for custom experiment creation.

LitmusChaos integrates directly with Kubernetes, using CRDs to manage chaos experiments. It also offers a ChaosHub where users can share and use chaos charts.

馃毄颁丑补濒濒别苍驳别蝉: While powerful, its Kubernetes-centric approach may limit its applicability to non-containerized environments. Users must also be comfortable with Kubernetes concepts to use it effectively.

Chaos Mesh

Chaos Mesh is another CNCF sandbox project focused on Kubernetes. It offers a comprehensive toolkit for orchestrating chaos experiments across Kubernetes clusters.

It provides a rich set of fault injection types, including pod failures, network latency, and file system IO. Chaos Mesh uses CRDs to define chaos experiments and has a dashboard for managing and visualizing them.

馃毄颁丑补濒濒别苍驳别蝉: Similar to LitmusChaos, its Kubernetes-specific nature means it鈥檚 less suited for non-Kubernetes environments. It requires a good understanding of Kubernetes to deploy and manage experiments.

Chaos Monkey

One of the earliest tools for chaos engineering, Chaos Monkey was developed by Netflix for randomly terminating instances in their production environment to ensure that engineers implement their services to be resilient to instance failures.

It was originally designed to target Amazon Web Services (AWS) to test how remaining systems respond to such failures. It also integrates with Spinnaker for managing application deployments.

馃毄颁丑补濒濒别苍驳别蝉: Chaos Monkey requires integration with other tools for a comprehensive chaos engineering use. It also requires specific expertise to adapt and manage.

Cloud-Based Platforms

黑料不打烊

黑料不打烊 provides a platform for executing chaos experiments across various environments, including cloud and on-premises. It stands out for its user-friendly interface and ability to define, execute, and monitor experiments without extensive chaos engineering knowledge.

The platform integrates with major cloud providers and Kubernetes, enabling seamless transitions between different environments and facilitating experiments across a hybrid cloud setup. One key feature is its scenario-based approach, which allows users to simulate complex, real-world scenarios involving multiple types of failures across different system components.

馃挕Pro Tip 鈫� You can use 黑料不打烊鈥檚 to try out some commonly used attacks and see how they impact your system.

FIS via AWS (Amazon Web Services)

AWS provides the Fault Injection Simulator (FIS), a fully managed service designed to inject faults and simulate outages within AWS environments. It supports experiments like API throttling, instance termination, and network latency. There are also targeted chaos experiments on EC2 instances, EKS clusters, and Lambda functions.

As part of the AWS ecosystem, the FIS leverages IAM for security and CloudWatch to monitor the impact of experiments.

Chaos Studio via Azure

Azure鈥檚 chaos engineering toolkit includes Chaos Studio, which provides a controlled environment for running experiments on Azure resources. This allows real-time tracking of experiments鈥� effects on application performance and system health. It also supports various fault injections, including virtual machine reboots, network latency, and disk failures.

Google Cloud

Google Cloud offers a range of tools and services that facilitate chaos engineering, including managed Kubernetes and network services that can simulate real-world network conditions.

The platform integrates with the Google Cloud Operations suite (formerly Stackdriver) for monitoring, logging, and diagnostics, enabling detailed visibility into the impact of chaos experiments.

Observability Platforms

DataDog

DataDog provides comprehensive monitoring and observability for cloud-scale applications. It captures data from servers, containers, databases, and third-party services, offering real-time visibility into system performance.

The solution integrates with several chaos engineering platforms to track the impact of experiments in real-time. This allows you to correlate chaos events with changes in system metrics and logs.

New Relic

New Relic offers observability with real-time application performance monitoring. It provides detailed insights into the health and performance of distributed systems, including applications, infrastructure, and digital customer experiences.

A key addition to the New Relic鈥檚 hub is the programmable platform, New Relic One, which allows users to customize their observability environment. This includes creating custom applications and dashboards tailored to needs, such as getting detailed analyses of chaos experiments.

Platforms and Integrations

Kubernetes Clusters

Chaos engineering in Kubernetes involves introducing failures at various levels, including the pod, node, network, and service levels, to test the resilience of applications and the Kubernetes orchestrator itself.

馃挕Integration benefits: Tools like Chaos Mesh and LitmusChaos specifically target Kubernetes environments. This is partly because they allow for the definition of chaos experiments as custom resources, enabling scenarios such as pod deletions, network latency, and resource exhaustion directly within the Kubernetes ecosystem.

Cloud Services

Cloud providers like AWS, Azure, and Google Cloud offer native services and features that support chaos engineering, such as managed Kubernetes services (EKS, AKS, GKE), serverless environments, and specific fault injection services (AWS FIS).

Utilizing these services for chaos experiments allows teams to simulate real-world scenarios that could affect their applications in the cloud, including region outages, service disruptions, and throttled network connectivity.

馃挕Integration benefits: Cloud providers often offer extensive documentation and support for running chaos experiments within their ecosystems, reducing the learning curve and speeding up the experimentation process.

Version Control Integration

Integrating chaos engineering workflows with version control systems like GitHub enables the automation of chaos experiments through CI/CD pipelines. GitHub Actions or similar automation tools can trigger chaos experiments based on specific events, such as a push to a branch, a pull request, or on a scheduled basis.

馃挕Integration benefits: This integration supports the 鈥榮hift-left鈥� approach in resilience testing, allowing for early detection of issues before they impact production. It also facilitates tracking and versioning of chaos experiment configurations alongside application code.

馃敄黑料不打烊 Nuggets 鈫� The 鈥榮hift-left鈥� approach is a methodology that emphasizes the integration of testing early in the software development life cycle rather than at the end or after the development phase.

Container Orchestration

Tools like OpenShift and Docker Swarm extend Kubernetes鈥� capabilities, providing additional features for managing containerized applications across diverse environments. These platforms support the deployment and scaling of containers, which are critical for microservices architectures.

馃挕Integration benefits: Many chaos engineering tools offer specific functionalities for container environments, allowing you to inject failures into containerized applications directly. This direct integration facilitates more granular control over the experiments and the ability to monitor the impact on containerized services closely.

API Testing

Microservices architectures rely heavily on API communications. Disrupting these communications can help teams understand the impact of network issues, latency, and failures on the overall system.

馃挕Integration benefits: Focused disruption of API communications allows you to test the resilience of service-to-service interactions, which is critical in a microservices architecture. This helps validate API contracts, ensuring that services can handle failures and maintain functionality even when dependencies are unstable.

鈿狅笍Note 鈫� Effective API testing within chaos experiments requires a deep understanding of the expected service interactions and dependencies. This includes not just direct service-to-service communications but also the cascading effects of failures through the system.

Best Practices for Chaos Experimentation

Start Small. Identify components with isolated impact, lower traffic, or those that are non-critical to business operations as initial targets for your experiments.
- Recommendation: Begin your experiments with less critical systems to minimize the impact on your production environment and users.

Use Templates Employing YAML or JSON to Define Chaos Experiments. Utilize YAML or JSON formats to describe the parameters, scope, and actions of each experiment. These templates can be version-controlled and shared, ensuring consistency and reproducibility.

Automation. Once initial experiments are defined and validated, automate their execution using CI/CD pipelines or chaos engineering platforms. This includes automating the setup, execution, monitoring, and teardown of experiments.

Monitor Metrics. Monitoring specific metrics provides insights into the system鈥檚 behavior under test conditions. Focus on:
- Latency. Indicates how quickly the system responds to requests. An increase may suggest issues with resource contention or network problems.
- CPU Usage. High CPU usage can indicate inefficiencies in the code or that the system is under high load. Monitoring CPU usage is essential for understanding the system鈥檚 capacity and planning autoscaling.
- Error Rates. The frequency of errors can signal problems with the application logic, dependencies, or underlying infrastructure.
- Server Responsiveness. Whether servers are still responding adequately under stress conditions. This helps assess the system’s availability.

Top Indicators of Issues. Be mindful of:
- High Error Rates. A significant increase suggests systemic issues that need immediate attention.
- Readiness Failures. Kubernetes readiness probes indicate that applications are not ready to serve traffic, possibly due to startup or runtime issues.
- Litmus Probes. In Kubernetes environments, LitmusChaos probes can help detect specific conditions or states that indicate failures.
- Resilience Scores. While not universally applicable, they can quantify how well the system withstands and recovers from failures. However, their implementation and interpretation should be approached with caution.

Controlled Environment. Start chaos experiments in a staging environment that closely mirrors production. This allows for the identification and mitigation of risks in a controlled setting.

Document Result. Document the objectives, execution details, observations, and conclusions of each experiment. Include metrics, logs, and any remediations applied.

Educate and Collaborate. DevOps, development teams, and other stakeholders in planning, executing, and analyzing chaos experiments. Provide education on the principles and benefits of chaos engineering.

Security and Permissions. Implement strict access controls and permissions for conducting experiments. Conduct security reviews of chaos engineering tools and procedures.

Experimentation Use Case in Software Development

Simulating Outages

Simulating outages involves intentionally bringing down services or components within a system to test its recovery processes and resilience. This can range from shutting down virtual machines, killing processes, or disconnecting network services.

The goal here is to validate and improve the system鈥檚 ability to detect failures, reroute traffic, or spin up new instances without significant impact on the end-user experience.

Use cases include:

Automation of Failure Injection. Automated tools can schedule and execute outages in a controlled manner, reducing the risk of human error.
Observability. Enhanced monitoring and logging are crucial to observe the system’s behavior during the outage and for post-mortem analysis.
Fallback Mechanisms. Implementation of fallback mechanisms like circuit breakers, retries with exponential backoff, and service degradation strategies.

Load Testing

Load testing involves simulating unexpected spikes in traffic or processing load to understand how systems behave under extreme conditions. This helps identify bottlenecks, resource limitations, and scalability issues within the application or infrastructure.

Use cases include:

Scalability Testing. Verifying if auto-scaling policies are effective in managing increased loads.

Resource Utilization Monitoring. Tracking CPU, memory, disk I/O, and network utilization to identify potential resource contention or leaks.

Throttling and Rate Limiting. Ensuring that systems can gracefully degrade performance by prioritizing critical services when under stress.

Dependency Testing

Dependency testing focuses on the resilience of a system when external services or internal components fail. This could involve simulating API downtimes, database connection failures, or corrupted data inputs. The purpose is to ensure that the system can handle such failures gracefully, without cascading failures to the user level.

Use cases include:

Service Virtualization. Mimicking external service responses, including failures, to test how the internal system reacts.

Contract Testing. Verifying that interactions with external services meet predefined contracts, ensuring that changes in external services don’t break the system.

Timeouts and Retry Mechanisms. Configuring timeouts and retries appropriately to avoid prolonged failures due to dependency issues.

Disaster Recovery

Chaos experimentation can simulate catastrophic events, such as data center failures, to validate the effectiveness of disaster recovery plans. This testing is critical, especially for industries required by law to maintain high availability and data integrity, such as finance, healthcare, and insurance.

Use cases include:

Data Replication and Backup Strategies. Ensuring data is replicated across geographically dispersed locations and can be restored efficiently.

Failover Procedures. Automated failover to secondary systems or data centers with minimal downtime.

Recovery Point and Time Objectives (RPO/RTO). Testing if the system can meet the business requirements for data recovery and system availability after a disaster.

Discover Potential Problems with 黑料不打烊

Thinking of running a chaos experiment within your system? Think 黑料不打烊.

黑料不打烊 offers a library of predefined attack scenarios, such as CPU stress, disk fill, network latency, and packet loss, enabling teams to quickly set up and run chaos experiments.

With this, you can design custom experiments tailored to your specific infrastructure and application architecture, including the ability to target specific services, containers, or infrastructure components.

黑料不打烊 also includes automated safety checks to prevent experiments from causing unintended damage, such as halt conditions that automatically stop the experiment if certain thresholds are exceeded.

馃挕Read Case Study 鈫� Discover how Salesforce achieved unmatched system resilience with 黑料不打烊’s innovative Chaos Engineering solution. Get the case study now and learn why they chose 黑料不打烊.

Why Choose 黑料不打烊?

Add reliability to your system

With our Experiment Editor your journey toward reliability is faster and easier: everything is at your fingertips, and you have full control over your experiments. All is meant to help you achieve your goals and roll out Chaos Engineering safely at scale in your organization.

You can add new targets, attacks, and checks by implementing extensions inside 黑料不打烊.

A unique discovery and selection process makes it easy to pick the targets.

Remove friction when collaborating between teams: export and import experiments using JSON or YAML.

Discover, understand, and describe your system

Using 黑料不打烊鈥檚 landscape, you can see your software’s dependencies and relationships between components – the perfect start to kick off your Chaos Engineering journey.

All targets are automatically and continuously discovered.

Figure out where to start your Chaos Engineering roll-out.

Easily detect common architecture pitfalls.

Limit and control the Chaos injected in your system

It鈥檚 never been easier to successfully and safely scale Chaos Engineering in your organization: with 黑料不打烊 you can limit and control all the turbulence injected into your system.

Using the powerful query language, divide your system(s) into different environments based on the same information you use elsewhere.

Explicitly assigning environments to specific users and teams in which they’re allowed to work and prevent unwanted damages.
Integrate 黑料不打烊 with your SAML provider or using the on-prem installation with your OIDC provider.

Open Source extension kits to give you flexibility

We treat extensions as first-class citizens in our product. As a result, they are deeply integrated into our user interface: you can add extensions and even create your own.

Choose the language you prefer when creating new extensions: we use GO to create ours, but you’re free to choose what you like best.

AWS, Datadog, Kong, Kubernetes, Postman, Prometheus: we got you covered!

Use our Kits to develop your own discoveries and actions for your custom targets where 黑料不打烊 out-of-the-box support doesn’t fit your needs.

Get started today

Full access to the 黑料不打烊 Chaos Engineering platform.
Available as SaaS and On-Premises!

or sign up with

Book a Demo

Let us guide you through a personalized demo to kick-start your Chaos Engineering efforts and build more reliable systems for your business!

Book Demo