Google Open Source Blog: observability

Showing posts with label observability. Show all posts

OpenTelemetry's First Release Candidates

Wednesday, October 21, 2020

OpenTelemetry has hit another milestone with the tracing specification reaching release candidate status.

With the specification now ready to go, expect to see tracing release candidates of the official APIs and SDKs over the next few weeks, along with updated exporters for Cloud Trace. In the coming months the same will follow for the metrics specification, followed by metrics release candidates of the APIs and SDKs and Cloud Monitoring exporters, followed by the project’s general availability. At this point we’ll switch our default application metrics and distributed tracing instrumentation from OpenCensus to OpenTelemetry.

This is exciting news for Google Cloud customers, as OpenTelemetry will enable even better observability experiences, both with Cloud Monitoring and Cloud Trace, or the third party monitoring and operations tools of your choice.

Originally posted on the on the OpenTelemetry blog.

Monday, March 30, 2020

OpenTelemetry and OpenCensus have been a critical part of our goal of making platforms like Kubernetes more observable and more manageable. This has been a multi-year journey for us, from creating OpenCensus and growing it into a core part of major web services’ observability stack, to our announcement of OpenTelemetry last year and the rapid growth of the OpenTelemetry community.

Beta is a big milestone for OpenTelemetry, as developers can now use the SDKs, integrations, and Collector to capture distributed traces and metrics from their applications and send them to backends like Prometheus, Jaeger, Cloud Monitoring, Cloud Trace, and others for analysis. This is a great time to try out OpenTelemetry and get involved in the observability community— whether you’re looking to improve your visibility into production services, giving your users performance data from client libraries that you maintain—or want to join a rapidly-growing open source project!

To learn more, please read our official community announcement, which copied below:

Co-authored by maintainers, community contributors, and members of the OpenTelemetry governance committee.

OpenTelemetry has just begun its first wave of beta releases, starting with the Collector and the Erlang, Go, Java, JavaScript, and Python SDKs, followed by the .Net SDK and Java auto-instrumentation agent. This means that you can begin integrating OpenTelemetry into your applications and client libraries to capture app-level metrics and distributed traces.

If you’re not already familiar with OpenTelemetry, the project provides a single set of language-specific APIs, SDKs, agents, and other components that you can use to collect distributed traces, metrics, and related metadata from your applications. In addition to its core capabilities, much of OpenTelemetry’s utility comes from integrations for HTTP and RPC libraries, storage clients, etc. that allow developers to capture critical observability data from their applications with almost zero effort. After capturing these signals, each OpenTelemetry component can export them to your backends of choice, including Prometheus, Jaeger, Zipkin, Azure Monitor, Dynatrace, Google Cloud Monitoring + Trace, Honeycomb, Lightstep, New Relic, and Splunk.

This first beta release includes:

APIs and SDKs for Erlang, Go, Java, JavaScript, and Python, which include the interfaces and implementations that you need to define and create distributed traces and metrics, manage sampling and context propagation, etc. The .Net API + SDK will follow shortly.
Language-specific API integrations for at least one popular HTTP framework, gRPC, and at least one popular storage client, which can be enabled with one line of code, and will automatically capture relevant traces and metrics and handle context propagation.
Language-specific exporters that allow SDKs to send captured traces and metrics to any supported backends.
The OpenTelemetry Collector, which can receive data from OpenTelemetry SDKs and other sources, and then export this telemetry to any supported backend.
Auto-Instrumentation for Java that captures telemetry from 47 Java libraries and frameworks without requiring any modification to your application.
Documentation for each component including getting started guides.

As these and subsequent OpenTelemetry components enter beta (requirements and release plan), we are declaring that they are ready to start integrating with. This means that service developers can begin to include OpenTelemetry in their applications and that maintainers of storage, RPC, etc. clients should start testing the OpenTelemetry APIs to provide better observability of their users.

However, this does come with some caveats:

Each OpenTelemetry component will likely undergo several beta releases in the coming weeks — this is simply the first.
While functional, beta components have not gone through thorough testing or benchmarking and they are not intended for production workloads.
While we aim to avoid any major changes to the OpenTelemetry APIs between beta and GA release candidates, we cannot guarantee that there will not be any changes during this period.
Some functionality is still missing from the first beta and will be added in subsequent releases; this is documented in each component’s GitHub repository.

In the coming weeks, you can expect additional beta releases from the first wave of OpenTelemetry components and others. In particular, we expect the API + SDK for .Net and the Java auto-instrumentation agent to be ready soon. Eventually, components will reach a level of maturity and testing where we’ll feel confident in naming them a release candidate (RC), after which we will not make any breaking changes to the APIs for that component.

This beta milestone is a huge accomplishment for the OpenTelemetry community, and every contributor should be proud of the fact that OpenTelemetry is now working and ready to integrate with. This is a great opportunity for the maintainers of client libraries to begin integrating with the OpenTelemetry APIs, for end-users to start integrating it into their services, and for anyone interested in contributing to join our rapidly growing community by joining our mailing lists, Gitter chats, and the monthly community meeting!

By Morgan McLean, Product Manager

Easier and More Powerful Observability with the OpenCensus Agent and Collector

Tuesday, June 4, 2019

The OpenCensus project has grown considerably over the past year, and we’ve had several significant announcements in the first half of 2019, including the project’s merger into OpenTelemetry. The features discussed in this post will move into OpenTelemetry over the coming months.

For those who aren’t already familiar with the project, OpenCensus provides a set of libraries that collect distributed traces and application metrics from your applications and send them to your backend of choice. Google announced OpenCensus one year ago, and the community has since grown to include large companies, major cloud providers, and APM vendors. OpenCensus includes integrations with popular web, RPC, and storage clients out of the box, along with exporters that allow it to send captured telemetry to your backend of choice.

We’ve recently enhanced OpenCensus with two new components: an optional agent that can manage exporters and collect system telemetry from VMs and containers, and an optional collector service that offers improved trace sampling and metrics aggregation. While we’ve already demonstrated these components on stage at Kubecon and Next, we’re now ready to show them more broadly and encourage their use.

The OpenCensus Agent

The OpenCensus agent is an optional component that can be deployed to each of your virtual machines or kubernetes pods. The agent receives traces and metrics from OpenCensus libraries, collects system telemetry, and can even capture data from a variety of different sources including Zipkin, Jaeger, and Prometheus. The agent has several benefits over exporting telemetry directly from the OpenCensus libraries:

The OpenCensus libraries export data to the OpenCensus agent by default, meaning that backend exporters can be reconfigured on the agent without having to rebuild or redeploy your application. This provides value for applications with high deployment costs and for PaaS-like platforms that have OpenCensus already integrated.
While the OpenCensus libraries collect application-level telemetry, the agent also captures system metrics like CPU and memory consumption and exports these to your backend of choice.
You can configure stats aggregations without redeploying your application.
The OpenCensus agent will host z-pages. While we originally made these a part of the libraries, we’ll be moving this functionality to the agent. This should result in a higher quality diagnostic page experience, as we’ll no longer have to reimplement the pages in each language.
The OpenCensus agent uses the same exporters already written for the Go OpenCensus library.

While directly exporting to a backend will remain a common use case for developers, we expect most larger deployments to start using the OpenCensus agent. The agent is currently in beta and we expect it to reach production ready quality and feature completeness later this year.

The OpenCensus Collector

The OpenCensus collector and agent share the same source code and are both optional – the key difference is how they’re deployed. While the agent runs on each host, the collector is deployed as a service and receives data from OpenCensus libraries and agents running across multiple hosts. The collector has several benefits over exporting telemetry directly from the OpenCensus libraries or agent:

Intelligent (tail based) trace sampling is one of the biggest benefits of the collector. By configuring your OpenCensus libraries to sample all traces and send all spans to the collector, you can have the collector perform sampling decisions after the fact! For example, you can configure the collector to sample the slowest one percent of traces at 30%, traces with errors at 100%, and all other traces at 1%!
The collector performs well and can be sharded across multiple instances. Current performance scales linearly across cores, allowing 10,000 spans to be collected per 1.2 cores.
The collector can be used as a proxy for other telemetry sources. In addition to receiving data from OpenCensus libraries and agents, Zipkin clients, Jaeger clients, and Prometheus clients, the service can be used to receive telemetry from client applications running on the web or on mobile devices.
The collector will soon host z-pages for your entire cluster. This is simply an expansion of the z-page functionality that we’ve added to the OpenCensus agent.
The collector can be used to apply telemetry policies across your entire application including adding span attributes such as region to all spans received, stripping personally identifiable information by removing or overwriting span attributes, or mapping span attributes to different names.

When to Use Each

As mentioned above, both the agent and collector are optional, and we expect that some developers will continue to export traces and metrics directly from the OpenCensus libraries. However, we expect both to become quite popular in the following scenarios:

Many organizations don’t want to have to rebuild and redeploy their apps when they change exporters. The agent provides the flexibility to change exporters without having to modify and redeploy your code.
With the OpenCensus agent you can capture system metrics via the same pipeline used to extract application metrics and distributed traces from your application.
If you want to make trace sampling decisions more intelligently, you’ll need to start using the collector.
With the OpenCensus collector you can minimize egress points and support features including batching, queuing and retry. These features are important when sending tracing and metric data to SaaS-based backends.
Platform providers can include the OpenCensus agent and collector into their services, making them available out of the box to customers.

Status

Both the agent and collector are currently in beta, though we know that several companies are already using them in their production services. We’re working towards the 1.0 release of each of these, and we expect this to occur by the end of Q2.

In the meantime, please join us on GitHub, Gitter, and in our monthly community meetings!

OpenTelemetry: The Merger of OpenCensus and OpenTracing

Tuesday, May 21, 2019

We’ve talked about OpenCensus a lot over the past few years, from the project’s initial announcement, roots at Google and partners (Microsoft, Dynatrace) joining the project, to new functionality that we’re continually adding. The project has grown beyond our expectations and now sports a mature ecosystem with Google, Microsoft, Omnition, Postmates, and Dynatrace making major investments, and a broad base of community contributors.

We recently announced that OpenCensus and OpenTracing are merging into a single project, now called OpenTelemetry, which brings together the best of both projects and has a frictionless migration experience. We’ve made a lot of progress so far: we’ve established a governance committee, a Java prototype API + implementation, workgroups for each language, and an aggressive implementation schedule.

Today we’re highlighting the combined project at the keynote of Kubecon and announcing that OpenTelemetry is now officially part of the Cloud Native Computing Foundation! Full details are available in the CNCF’s official blog post, which we’ve copied below:

A Brief History of OpenTelemetry (So Far)

After many months of planning, discussion, prototyping, more discussion, and more planning, OpenTracing and OpenCensus are merging to form OpenTelemetry, which is now a CNCF sandbox project. The seed governance committee is composed of representatives from Google, Lightstep, Microsoft, and Uber, and more organizations are getting involved every day.

And we couldn't be happier about it – here’s why.

Observability, Outputs, and High-Quality Telemetry

Observability is a fashionable word with some admirably nerdy and academic origins. In control theory, “observability” measures how well we can understand the internals of a given system using only its external outputs. If you’ve ever deployed or operated a modern, microservice-based software application, you have no doubt struggled to understand its performance and behavior, and that’s because those “outputs” are usually meager at best. We can’t understand a complex system if it’s a black box. And the only way to light up those black boxes is with high-quality telemetry: distributed traces, metrics, logs, and more.

So how can we get our hands – and our tools – on precise, low-overhead telemetry from the entirety of a modern software stack? One way would be to carefully instrument every microservice, piece by piece, and layer by layer. This would literally work, it’s also a complete non-starter – we’d spend as much time on the measurement as we would on the software itself! We need telemetry as a built-in feature of our services.

The OpenTelemetry project is designed to make this vision a reality for our industry, but before we describe it in more detail, we should first cover the history and context around OpenTracing and OpenCensus.

OpenTracing and OpenCensus

In practice, there are several flavors (or “verticals” in the diagram) of telemetry data, and then several integration points (or “layers” in the diagram) available for each. Broadly, the cloud-native telemetry landscape is dominated by distributed traces, timeseries metrics, and logs; and end-users typically integrate with a thin instrumentation API or via straightforward structured data formats that describe those traces, metrics, or logs.

For several years now, there has been a well-recognized need for industry-wide collaboration in order to amortize the shared cost of software instrumentation. OpenTracing and OpenCensus have led the way in that effort, and while each project made different architectural choices, the biggest problem with either project has been the fact that there were two of them. And, further, that the two projects weren’t working together and striving for mutual compatibility.

Having two similar-yet-not-identical projects out in the world created confusion and uncertainty for developers, and that made it harder for both efforts to realize their shared mission: built-in, high-quality telemetry for all.

Getting to One Project

If there’s a single thing to understand about OpenTelemetry, it’s that the leadership from OpenTracing and OpenCensus are co-committed to migrating their respective communities to this single and unified initiative. Although all of us have numerous ideas about how we could boil the ocean and start from scratch, we are resisting those impulses and focusing instead on preparing our communities for a successful transition; our priorities for the merger are clear:

Straightforward backwards compatibility with both OpenTracing and OpenCensus (via software bridges)
Minimizing the time where OpenTelemetry, OpenTracing, and OpenCensus are being co-developed: we plan to put OpenTracing and OpenCensus into “readonly mode” before the end of 2019.
And, again, to simplify and standardize the telemetry solutions available to developers.

In many ways, it’s most accurate to think of OpenTelemetry as the next major version of both OpenTracing and OpenCensus. Like any version upgrade, we will try to make it easy for both new and existing end-users, but we recognize that the main benefit to the ecosystem is the consolidation itself – not some specific and shiny new feature – and we are prioritizing our own efforts accordingly.

How you can help

OpenTelemetry’s timeline is an aggressive one. While we have many open-source and vendor-licensed observability solutions providing guidance, we will always want as many end-users involved as possible. The single most valuable thing any end-user can do is also one of the easiest: check out the actual work we’re doing and provide feedback. Via GitHub, Gitter, email, or whatever feels easiest.

Of course we also welcome code contributions to OpenTelemetry itself, code contributions that add OpenTelemetry support to existing software projects, documentation, blog posts, and the rest of it. If you’re interested, you can sign up to join the integration effort by filling in this form.

By Ben Sigelman, co-creator of OpenTracing and member of the OpenTelemetry governing committee, and Morgan McLean, Product Manager for OpenCensus at Google since the project’s inception

OpenMetrics project accepted into CNCF Sandbox

Friday, August 10, 2018

For the past several months, engineers from Google Cloud, Prometheus, and other vendors have been aligning on OpenMetrics, a specification for metrics exposition. Today, the project was formally announced and accepted into the CNCF Sandbox, and we’re currently working on ways to support OpenMetrics in OpenCensus, a set of uniform tracing and stats libraries that work with multiple vendors’ services. This multi-vendor approach works to put architectural choices in the hands of developers.

OpenMetrics stems from the stats formats used inside of Prometheus and Google’s Monarch time-series infrastructure, which underpins both Stackdriver and internal monitoring applications. As such, it is designed to be immediately familiar to developers and capable of operating at extreme scale. With additional contributions and review from AppOptics, Cortex, Datadog, InfluxData, Sysdig, and Uber, OpenMetrics has begun the cross-industry collaboration necessary to drive adoption of a new specification.

OpenCensus provides automatic instrumentation, APIs, and exporters for stats and distributed traces across C++, Java, Go, Node.js, Python, PHP, Ruby, and .Net. Each OpenCensus library allows developers to automatically capture distributed traces and key RPC-related statistics from their applications, add custom data, and export telemetry to their back-end of choice. Google has been a key collaborator in defining the OpenMetrics specification, and we’re now focusing on how to best implement this inside of OpenCensus.

“Google has a history of innovation in the metric monitoring space, from its early success with Borgmon, which has been continued in Monarch and Stackdriver. OpenMetrics embodies our understanding of what users need for simple, reliable and scalable monitoring, and shows our commitment to offering standards-based solutions,” said Sumeer Bhola, Lead Engineer on Monarch and Stackdriver at Google.

For more information about OpenMetrics, please visit openmetrics.io. For more information about OpenCensus and how you can quickly enable trace and metrics collection from your application, please visit opencensus.io.

By Morgan McLean, Product Manager for OpenCensus and Stackdriver APM

OpenCensus: A Stats Collection and Distributed Tracing Framework

Wednesday, January 17, 2018

Today we’re pleased to announce the release of OpenCensus, a vendor-neutral open source library for metric collection and tracing. OpenCensus is built to add minimal overhead and be deployed fleet wide, especially for microservice-based architectures.

The Need for Instrumentation & Observability

As a startup, often the focus is to get an initial version of the product out the door, rapidly prototype and iterate with customers. Most startups start out with monolithic applications as a simple model-view-controller (MVC) web application. As the customer base, code, and number of engineers increase, they migrate from monolithic architecture to a microservices architecture. A microservices architecture has its advantages, but often makes debugging more challenging as traditional debugging and monitoring tools don’t always work in these environments or are designed for monolithic use cases. When operating multiple microservices with strict service level objectives (SLOs), you need insights into the root cause of reliability and performance problems.

Not having proper instrumentation and observability can result in lost engineering hours, violated SLOs and frustrated customers. Instead, diagnostic data should be collected from across the stack. This data can be used for incident management to identify and debug potential bottlenecks or for system tuning and performance improvement.

OpenCensus

At Google scale, an instrumentation layer with minimal overhead is a requirement. As Google grew, we realized the importance of having a highly efficient tracing and stats instrumentation library that could be deployed fleet wide.

OpenCensus is the open source version of Google’s Census library, written based on years of optimization experience. It aims to make the collection and submission of app metrics and traces easier for developers. It is a vendor neutral, single distribution of libraries that automatically collects traces and metrics from your app, displays them locally, and sends them to analysis tools. OpenCensus currently supports Prometheus, SignalFx, Stackdriver and Zipkin.

Developers can use this powerful, out-of-the box library to instrument microservices and send data to any supported backend. For an Application Performance Management (APM) vendor, OpenCensus provides free instrumentation coverage with minimal work, and affords customers a simple setup experience.

Below are Stackdriver Trace and Monitor screenshots showing traces generated from a demo app, which calls Google’s Cloud Bigtable API and uses OpenCensus.

We’d love to hear your feedback on OpenCensus. Try using it in your app, tell us about your success story, and help by contributing to our existing language-specific libraries, or by creating one for an not-yet-supported language. You can also help us integrate OpenCensus with new APM tools!

We hope you find this as useful as we have. Visit opencensus.io for more information.

By Pritam Shah, Census team

opensource.google.com