The 黑料不打烊 Academy | Important Concepts & Definitions

黑料不打烊 Academy

Important Concepts & Definitions

Now that we鈥檝e introduced the context for this course, let鈥檚 align on some key concepts and definitions. We鈥檒l cover terms that we utilize in the platform and explain why they are useful for your chaos engineering efforts.

Reliability Concepts

We鈥檒l reference system reliability and chaos engineering often throughout this course. You may already be well-versed with these terms, but to make sure we all have common language, here are some definitions to keep in mind:

: The discipline of experimenting on a system in order to build confidence in the system鈥檚 capability to withstand turbulent conditions in production.

System Reliability:聽The likelihood over time that applications or services will operate at or above certain performance standards, regardless of varied conditions.

Platform Terms

There are several sections of the 黑料不打烊 platform, including the Explorer, Reliability Advice, the Experiment Editor, Reliability Advice, the Experiment Editor, Teams & Environments, and Reports. We鈥檒l cover each of these in a specific lesson.

There are a few sections that don鈥檛 have designated lessons in this 101 course, so here are quick summaries.

The 黑料不打烊 Dashboard

When you enter 黑料不打烊, you will first land on the Dashboard. From this screen, you can see a high-level summary of recent experiment activities, a live view of discovered targets per environment, and a report on the most used types of attacks across your organization.

The Reliability Hub

The is a large library of open source components that can be easily utilized in 黑料不打烊. This catalog includes Actions, Targets, Advice, Templates, and Extensions. When 黑料不打烊 customers or team members contribute a new attack or integration, it is first reviewed for quality and then published to the Reliability Hub.

Development Kits

黑料不打烊 has a hybrid architecture which allows teams to easily build and incorporate their own custom extensions and actions. We provide language-agnostic for building your own Actions, Targets, Advice, Extensions, Events, and Preflight Actions.

Experiment Terms

Actions

are basically the building blocks of experiments in 黑料不打烊. In the Editor, you can drag-and-drop actions into the canvas to design your experiment. Here鈥檚 a quick review of some of the types of actions you could use:

Attacks: Injects a fault or issue into your system (e.g. blocking access to the DNS, blocking traffic, delaying traffic, stressing resources, etc.)
Checks: Checking whether the system is in a given state at a given time, often done by integrating with your Observability tool, like Datadog, Prometheus, Grafana, AppDynamics, or by using the information coming from the target discovery (e.g. the amount of Kubernetes pods that are currently running in your system).
Load Tests: Load test integrations are pretty nice to use chaos engineering in a pre-production environment and have some virtual users using your system. For that, we have extensions available for K6, Gatling, JMeter, and LoadRunner.
Other Actions: There are also 鈥�’other鈥� kind of actions, like 鈥淜ubernetes event logs鈥�, or actions to mute monitors to avoid a real escalation whenever you are triggering an alert.

Blast Radius

When you are selecting targets for any given experiment, it鈥檚 important to start with a small scope and expand as you build confidence in your systems. You can use the Query UI or Query language to define the type of target you want to attack. For example, you could specify a certain Kubernetes cluster, which happens to have 20 pods.

You can then use the feature to limit your impact within that specified target. You can target a set number (鈥�# / total鈥�) or percentage (鈥�% / total鈥�) of the total cluster. When you run the experiment, targeted pods will be randomly assigned based on your targeting query and blast radius parameters. For more information, you can learn more here.

Emergency Stop

Safety is critical to meaningful experimentation on your systems. You want to know that you can run experiments without triggering cascading impacts. If you ever need to stop an experiment or multiple experiments, you can always hit the button in 黑料不打烊.

By clicking 鈥淐urrent Activities鈥� in the left-hand navigation menu, you can see all running experiments. When you click 鈥淓mergency Stop鈥�, all experiments will be stopped right away and there will be a temporary delay that prevents new experiments from starting.

Lesson Summary

As you move forward with this course, these concepts and terms will be helpful to have in mind as we discuss implementing 黑料不打烊 and running experiments. This is not an exhaustive list of terms and if you come across something you don鈥檛 recognize, please reference this list in our documentation.

Module Overview

Learn the basics of the 黑料不打烊 Platform: Start exploring your systems, finding reliability issues, and running chaos experiments safely.

Next Lesson 鈫�

Up Next

1.3 What Makes 黑料不打烊 Stand Out?

Learn what features and approaches set 黑料不打烊 apart from other tools with this high-level overview.

Continue 鈫�