黑料不打烊

黑料不打烊 Academy

Running Your First Experiment

Running Your First Experiment

Viewing Experiment Runs in Real-Time

While you鈥檙e designing an experiment, nothing happens to your system. Everything is safe and you can鈥檛 harm it in any way. Ready to change that?

Once you click to run an experiment, you will be injecting the designed faults into your system to see the effect. Before you hit run, make sure that you鈥檙e doing it in an environment that is safe for chaos engineering. As we mentioned earlier, we also recommend that you start with narrow targeting and only expand your scope after you have built confidence in how your systems will respond.

Running Experiments

You can run experiments manually, with our internal scheduler, or with automation. For this lesson, we鈥檒l focus on manually running an experiment using the 黑料不打烊 UI.聽

Once you are ready and comfortable with your experiment design, click 鈥淩un Experiment鈥.聽

You can then switch from the 鈥淒esign鈥 tab in the Editor to the 鈥淩un鈥 tab to view your experiment run step-by-step.

At the top of the screen, you will see the same timeline-based view of steps with a progress bar showing how each step is being executed.

experiment run view

Below that, you will see a 鈥淩un Status鈥 section that will show the聽聽of each step.

Additional Checks & Monitoring Options

You will see additional sections depending on your installed extensions and experiment designs. For example, here are some common sections:

  • 鈥淗TTP Responses for 鈥淴鈥 Service鈥: If you want to monitor whether your fault injection has significantly degraded performance, you can run an聽action on your service throughout your experiment and see how the responses change in real-time.
  • 鈥淜ubernetes Events鈥: If you add the聽聽action to your experiment design, you鈥檒l see a section with Kubernetes logs.
  • 鈥淒eployment Readiness鈥: If you add the聽聽action, you鈥檒l be able to monitor whether you drop below a desired number of pods for any given deployment.
  • Observability Integration: If you have installed an extension for your Observability tool of choice, you can add an action like聽聽to collect information about the state of your monitors and whether they detect issues as expected.

With additional checks, you鈥檒l be able to clearly see in one place how your systems respond to this type of event.

Stopping Running Experiments

If you need to stop an experiment for any reason, you can just hit the red button timer in the 鈥淩un鈥 view. You can also use the 鈥淓mergency Stop鈥 action while viewing the 鈥淐urrent Activities鈥 tab in the left-hand navigation to stop all running experiments and temporarily prevent new experiments from being executed.

If you stop a running experiment, you will see it labelled with the 鈥淐anceled鈥 experiment status.

Experiment Results

When your experiment has finished running uninterrupted, you will see one of the following聽:

  • Completed:聽The entire experiment and all actions were successfully executed with no failures or errors.
  • Failed:聽The experiment run failed because the results did not prove the hypothesis, or failed one of your checks, such as the HTTP Check action meeting a specified success rate.
  • Errored:聽The experiment failed to run due to a technical issue and will require some debugging to troubleshoot.

A completed experiment shows that your hypothesis was correct. A failed experiment shows that your expectation was not met and you need to revisit either your hypothesis or how your systems are configured.

Debugging with Agent Logs & Tracking

In the 鈥淩un鈥 view, you will start on the 鈥淎ttack Monitor鈥 tab by default, but you can also聽 toggle over to 鈥淎gent Logs鈥 or 鈥淭racing鈥 for more information.

Agent Logs are helpful if you want to debug an experiment run. If you are troubleshooting with the 黑料不打烊 team, these logs will be especially helpful.

Tracing is documentation of the communication between the 黑料不打烊 platform and agents. We collect distributed tracing spans using OpenTelemetry. The primary use of this information is also debugging.

Lesson Summary

Now, you know how to run an experiment and watch it run in real-time. You can see if your system is responding as you expected and iterate to better understand your systems and improve their reliability. You also know what to look for when you run into an error and need to debug it. Next, we鈥檒l explain how you can scale your approach with run scheduling and automation.