Intel Telemetry Meets Containers

JoshHilliker · ‎10-15-2020

Well it’s certainly been a little bit since I posted, however the amount of cool telemetry for the Modern Autonomous Data Center has not slowed down to say the least. In the past blogs I mentioned the what, why, Intel Telemetry, how and OTP startup; so let’s start there. One Telemetry Package [OTP] is now called Intel® Telemetry Collector and it has been extended and the metrics have been improved. We’ve created a few videos on use cases posted—watch the Modern Autonomous Data Center track.

Let’s shift gears to containers…. Samantha and I have heard for a while, being on the virtual road, that we need to think more about container telemetry. Specifically, what can Intel do to assist with obtaining the right telemetry that is part of our platforms for each container or workload? This post is all about container telemetry, so let’s jump in!

First, let’s talk about containers. Containers are: A standardized unit of software. Lightweight, standalone, executable units (containers) of software which include everything needed to run an application: code, runtime, system tools, and system libraries. Containers are ideal for cloud native applications, stateless microservices, and scale out applications. Containers enable developing new applications for the cloud, single tenant clusters, and maximizing the number of applications that can be deployed per server (i.e. pack more compute into each node, or better said “increase density”).

Second, as I was reminded recently by Mr. David Shade, if we talk about containers, we have to talk about deployment of the containers to have a complete picture, and we start that journey with cloud orchestration. Cloud orchestration is the management and orchestration of cloud resources. Examples of orchestrators are: Apache Mesos, Docker Swarm, and Kubernetes. These tools enable scale up and scale down of containers and also assist with rolling out new management containers to do things like Telemetry (of course I’m going to use that example). For example, you can create one telemetry container that collects the key “must have” metrics and then push it all nodes while having them all push/pull their data to the management node to report out.

Third, now we can dive into container telemetry, specifically Google cAdvisor. cAdvisor (Container Advisor) provides container users an understanding of the resource usage and performance characteristics of their running containers. It is a running daemon that collects, aggregates, processes, and exports information about running containers. More specifically it captures resource usage per control group (cgroup) which can be applied to containers or VMs, but in this blog we are just focusing on containers. For each container it keeps resource isolation parameters, historical resource usage, and histograms of complete historical resource usage. This data is exported by container and machine-wide. We have worked to integrate performance counter support into the cAdvisor repo, therefore you can pull certain events for the container & specific for the host node as well. Here is a reference architecture we are leveraging to capture, collect, store & visualize.

How do you get started? Here’s a reference script to pull down the cAdvisor pull with perf support & starting up cAdvisor with perf events support:

# clone cadvisor source code to local directory
echo "Download cAdvisor source..."
CADVISOR_SRC=${DIR}/build/src
CADVISOR_COMMIT=a6e4fcb
mkdir -p $CADVISOR_SRC && git clone https://github.com/google/cadvisor $CADVISOR_SRC
cd $CADVISOR_SRC
# checkout to a particular commit and apply the patch (see https://github.com/google/cadvisor/pull/2611)
git reset --hard $CADVISOR_COMMIT && curl https://patch-diff.githubusercontent.com/raw/google/cadvisor/pull/2611.patch | git apply

# change to the {DIR}/build/src from above. - build w/ perf support (important)
$GOPATH/src/github.com/google/cadvisor $ GO_FLAGS="-tags=libpfm,netgo" make build

# start with perf support
sudo ./cadvisor -perf_events_config=perf/testing/perf.json

After starting cAdvisor you can explore the direct node on port :8080 and dig into the containers, however for the management node you will leverage :8080/metrics for your TSDB. Here’s a sample graph of cycles per instruction metric. {id=”/”} is the main cgroup metrics.

What’s next? “AI for Infrastructure” is top-of-mind for me. Getting there faster; doing more, while doing less.