Open Telemetry in NestJs (and React)

Open Telemetry is good enough to use in production projects now and most cloud providers and telemetry services have integrated open telemetry into their products.

In this article I'll briefly describe what Open Telemetry is, how it can help you build better products and systems and why you should consider using it now.

In the second half of the article I'll describe how to set up Open Telemetry in javascript applications with open telemetry collector and some backends like Zipkin, Prometheus and Grafana.

What is Open Telemetry?

Telemetry in software context means the metrics, events logging and tracing generated by an application or a whole distributed system when running.

This data is used to improve our applications. Product managers, DevOps and developers can monitor a full distributed system from a customer perspective. We can detect issues in code early and alert on them. We can find the sources of problems quickly despite the complexity in modern systems.

Without telemetry data, finding the root cause of an issue in Service3 below could be very difficult and time-consuming.

Modern application

With telemetry available you can correlate calls between services and any logs the developer(s) added. You can use those almost as a callstack to debug your problem in a distributed system.

Tracing with Open Telemetry

There have been products and services to do this in the market for decades but up until now there wasn't a standard. So you typically had to instrument your application and systems with proprietary libraries from the service providers.

The transmission of telemetry data often used custom propagation patterns and data models. These were incompatible with other providers so it was difficult to build universal tooling for working with telemetry data.

Open Telemetry standardises how you instrument your system by providing vendor-neutral instrumentation libraries with common terminology and usage.

Open Telemetry standardises propagation providers and it gives you vendor-neutral infrastructure to collect the telemetry data and forward on to any provider your organisation supports now or in the future - with no changes to the code in your system!

Why is open telemetry worth investigating now

Open Telemetry is a massive project from the Cloud Native Computing Foundation (CNCF) with many component parts.

CNCF projects are given a status based on their level of industry adoption in the "Crossing the Chasm" chart.

Chasm Chart (source: cncf)

Open telemetry is currently in the incubating stage. This is because some SDKs are still being perfected before entering release candidate stage.

Open Telemetry uses "Stable" to describe APIs that are fixed and essentially production ready. As of writing this post - tracing is stable, metrics is in release candidate stage and logging is in draft stage.

The Open Telemetry observability framework is currently being adopted by early adopters in production software. All of the major telemetry producers and services are adopting the standard for their offerings. Many already have full support for tracing.

For vendors that don't support open telemetry yet there are often collectors or exporters available to temporarily convert open telemetry data to the proprietary format of the vendor until they have implemented support.

The following vendors have excellent support for open telemetry data generation and collection today.

Honeycomb
Datadog
New Relic
Azure App Insights
AWS X-Ray (ADOT)
GCP Cloud Monitoring and Cloud Trace

How Open Telemetry helps you build better systems

Open telemetry standardises the components needed for observability in a software system. These components are typically:

Instrumentation
Propagation
Collection

and

Backends

This diagram is a messy hodge-podge! Each colour-coded section is described in detail below so read on for it make more sense.

Open telemetry data flows

Instrumentation

Automatic Instrumentation is already provided by open telemetry libraries that are available for most major software development languages and frameworks like .Net, Java, Go, Javascript/NodeJs. Depending on the framework's standard library there is significant auto-instrumentation. Most frameworks have their http and grpc clients have automatically instrumented for example.

There is also wide support for common libraries used in each ecosystems - e.g. for postgres database the NpPGSQL library on .Net and the pg library on NodeJs has an existing open telemetry instrumentation library you can just plug in to your code.

If you're using a service mesh sidecar service like Dapr, Istio or Linkerd they usually support an open telemetry exporter or an open census exporter in the case of linkerd. The open telemetry collector can accept these formats for you.

The instrumentation libraries are extremely "pluggable". They are designed so you can inject functionality you need to gradually move your existing code to open telemetry.

The pluggable components include instrumentation trace "exporters" for popular formats like Jaeger, ZipKin and Open Telemetry itself. For example if you had been using Jaeger backend then you can keep it and change your instrumentation to open telemetry gradually.

Propagation

Propagation is the method by which you trace a request around your stateless systems. Because they are stateless you have to include the trace context identifier in the request somehow.

Because the most popular message transport mechanisms these days are http based, the preferred propagation mechanisms are based around http capabilities - headers.

The proprietary observability systems all use different headers at the moment. For example AWS X-Ray uses X-Amzn-Trace-Id and Dynatrace uses x-dynatrace. Open telemetry supports a few different standards like B3 but it has standardised on the W3C propagation standard.

The W3C propagation standard was created in conjunction with all the major cloud providers and will be the recommended standard in the future. W3C propagation standard uses http headers traceparent and tracestate to propagate the trace context in your requests.

W3C trace propagation

The standardisation of these headers means that all cloud middleware can be written by the large cloud vendors to automatically support the propagation of those headers. This increases the depth of your tracing, especially in modern cloud-native systems.

Because the headers are now standard it means a cloud native service that originates a request (a timed trigger for a serverless function for example) can initiate tracing for you, right from the origin of a request.

Open Telemetry Collection

The open telemetry collector is a service you would run in your system. It receives telemetry data in many many formats. It processes this data and then exports it to many possible destinations (backends).

The collector service is not required if you only use one backend and it is supported by all of your instrumentation sources. But it's highly recommended because it can handle retries, batching, encryption and removing sensitive data.

The collector is designed as a "pluggable" architecture so you can create a telemetry handler that exactly matches the needs of your system. The components are

Receivers - you can receive from multiple sources e.g. OTEL format, AWS X-Ray format (e.g. the lambda decorators), directly from kafka
Processors - Filtering sensitive data, cleaning unwanted traces, batching to exporters, modification of traces
Exporters - you can export to multiple destinations - X-Ray, honeycomb, zipkin, jaeger. These are not to be confused with instrumentation trace exporters.

The flexibility of the collector means you can collect from your existing instrumented systems while adding support for open telemetry systems. You can export to your existing backend while adding support for others if you want to try them out. This is all configured easily via yaml.

The collector supports modern transport standards between instrumentation trace exporters and collector receivers like protobuf over grpc or http. Support for JSON in http is experimental.

How to integrate open telemetry in a NestJs application

The best way to learn is to set up an application with open telemetry so lets do that!

The code for this example is available here in this repository: https://github.com/darraghoriordan/dapr-inventory

See packages/products-api for the node js example.

Install Open Telemetry dependencies

yarn add @opentelemetry/api @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http @opentelemetry/exporter-zipkin @opentelemetry/sdk-node @opentelemetry/semantic-conventions

// or

npm i @opentelemetry/api @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http @opentelemetry/exporter-zipkin @opentelemetry/sdk-node @opentelemetry/semantic-conventions

Create Open Telemetry Configuration

Create a new class to hold the configuration. We configure many of the instrumentation components discussed so far. There are inline comments in the code that should be helpful.

The important thing here is the getNodeAutoInstrumentations. This automatically detects and instruments many popular node libraries by monkey patching them. If performance or package size is a concern (e.g. for lambda functions) then you might only include instrumentations you actually need.

The list of auto instrumentations are listed on the npm package page: https://www.npmjs.com/package/@opentelemetry/auto-instrumentations-node

import { W3CTraceContextPropagator } from "@opentelemetry/core";
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { diag, DiagConsoleLogger, DiagLogLevel } from "@opentelemetry/api";
import { Resource } from "@opentelemetry/resources";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";

// Set an internal logger for open telemetry to report any issues to your console/stdout
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.WARN);

export const initTelemetry = (config: {
  appName: string;
  telemetryUrl: string;
}): void => {
  // create an exporter to an open telemetry exporter. We create this collector instance locally using docker compose.
  const exporter = new OTLPTraceExporter({
    url: config.telemetryUrl, // e.g. "http://otel-collector:4318/v1/traces",
  });

  // We add some common meta data to every trace. The service name is important.
  const resource = Resource.default().merge(
    new Resource({
      [SemanticResourceAttributes.SERVICE_NAME]: config.appName,
      application: config.appName,
    })
  );

  // We use the node trace provider provided by open telemetry
  const provider = new NodeTracerProvider({ resource });

  // The batch span provider is more efficient than the basic provider. This will batch sends to
  // the exporter you have configured
  provider.addSpanProcessor(new BatchSpanProcessor(exporter));

  // Initialize the propagator
  provider.register({
    propagator: new W3CTraceContextPropagator(),
  });

  // Registering instrumentations / plugins
  registerInstrumentations({
    instrumentations: getNodeAutoInstrumentations(),
  });
};

Run Telemetry Configuration on NestJs startup

Note it is vital that you run the open telemetry configuration before anything else bootstraps in your NestJs application.

You must run the initialisation before even importing any NestJs libraries in your bootstrapping method.

import { initTelemetry } from "./core-logger/OpenTelemetry";
// ----- this has to come before imports! -------
initTelemetry({
  appName: process.env.OPEN_TELEMETRY_APP_NAME || "",
  telemetryUrl: process.env.OPEN_TELEMETRY_URL || "",
});
console.log("initialised telemetry");
// -------------

// follow with your nest js imports and bootstrapping....

import { ClassSerializerInterceptor, ValidationPipe } from "@nestjs/common";
import { NestFactory, Reflector } from "@nestjs/core";
// ... etc

const app = await NestFactory.create(MainModule, { cors: true });
// ... etc

Custom instrumentation

That's it for instrumenting our app. Super simple.

If you need to you can create custom trace spans.

Here is an example of a method with a custom span.

async getAllProductsScan(): Promise<ProductDto[]> {
    // get a trace context
    const tracer = opentelemetry.trace.getTracer("basic");
    // create a span
    const span = tracer.startSpan("getAllProductsScan");
    // do some work
    const products = await this.client.send(
        new ScanCommand({TableName: "products"})
    );
    // add some meta data to the span
    span.setAttribute("thisAttribute", "this is a value set manually");
    span.addEvent("got the data from store", {
        ["manualEventAttribute"]: "this is a value",
    });
    const mappedProducts = (products.Items || []).map((i) => {
        return this.mapOne(i);
    });
    // finalise the span
    span.end();
    return mappedProducts;
}

Implementing Open Telemetry in a React application

Implementing open telemetry in React is similar to the node js implementation. You'll see that the trace provider changes to one that works in a browser context.

There are some interesting engineering issues with storing context across asynchronous methods in a browser javascript context that have been resolved in nodejs - you can read more about it on the npm for the zonejs package - https://www.npmjs.com/package/zone.js.

The instrumentations for a web browser application are different to a nodejs application also. In this case we only trace calls to the browser fetch functionality.

In this development environment I have set up a local api gateway with envoy.

The envoy gateway is one url that handles all urls. This includes any frontend traces which pass through the envoy gateway to the open telemetry collector. This makes handling cors locally much easier for front end developers.

import { WebTracerProvider } from "@opentelemetry/sdk-trace-web";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { W3CTraceContextPropagator } from "@opentelemetry/core";
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";
import { Resource } from "@opentelemetry/resources";
import { ZoneContextManager } from "@opentelemetry/context-zone";
import { FetchInstrumentation } from "@opentelemetry/instrumentation-fetch";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";

export const initInstrumentation = () => {
  const exporter = new OTLPTraceExporter({
    url: `${import.meta.env.VITE_API_HOST}/trace`,
  });

  const resource = new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: "DaprInventoryClient",
    application: "DaprInventoryClient",
  });

  const provider = new WebTracerProvider({ resource });
  provider.addSpanProcessor(new BatchSpanProcessor(exporter));

  // Initialize the provider
  provider.register({
    propagator: new W3CTraceContextPropagator(),
    contextManager: new ZoneContextManager(),
  });

  // Registering instrumentations / plugins
  registerInstrumentations({
    instrumentations: [
      new FetchInstrumentation({
        propagateTraceHeaderCorsUrls: [/.+/g], // this is too broad for production
        clearTimingResources: true,
      }),
    ],
  });
};

Call this initialisation before bootstrapping the react application.

// bootstrap the instrumentation
initInstrumentation();

const queryClient = new QueryClient();

ReactDOM.createRoot(document.getElementById("root") as HTMLElement).render(
  <React.StrictMode>
    <QueryClientProvider client={queryClient}>
      <App />
    </QueryClientProvider>
  </React.StrictMode>
);

Configuring the Open Telemetry collector

The open telemetry collector is run as a docker container with a configuration file attached as a volume. There were some notable configuration things worth mentioning here.

You have to set cors origins for anything that requires them using http to connect to the collector. I use the wildcard but this would be bad practice on production system.

Unless you're in a completely isolated environment where external calls are passed through a gateway and/or firewall. Even then you might consider setting origins correctly.

The zipkin exporter actually pushes traces to zipkin, it literally exports. But the prometheus exporter is an endpoint on the collector that the prometheus server will poll for data.

These two paradigms are quite different! It's important to understand what "exporter" means for each one you configure, especially important for port allocation on the open telemetry collector.

As of writing this open telemetry log support is still in development so I log directly to seq from all of my applications, rather than going through the otel collector.

receivers:
  otlp:
    protocols:
      grpc:
        include_metadata: true
      http:
        cors:
          allowed_origins:
            - "*"
          allowed_headers:
            - "*"
        include_metadata: true
  zipkin:

processors:
  batch:
  attributes:
    actions:
      - key: seq
        action: delete # remove sensitive element
exporters:
  zipkin:
    endpoint: "http://zipkin:9411/api/v2/spans"
  logging:
    loglevel: debug
  prometheus:
    endpoint: "127.0.0.1:9091" # this is weird because the exporter is actually an endpoint that must be scraped
extensions:
  health_check:
  pprof:
  zpages:

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp, zipkin]
      processors: [batch]
      exporters: [zipkin]

    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    # logs:
    #     receivers: [otlp]
    #     processors: [batch]
    #     exporters: [zipkin]

Configuring the infrastructure

I use docker compose to run infrastructure locally but on production you might use any infrastructure.

Telemetry docker-compose.yaml - I just use this to setup the container tags to get.

version: "3.4"

services:
  seq:
    image: datalust/seq:latest
  zipkin:
    image: openzipkin/zipkin-slim
  otel-collector:
    image: ${REGISTRY:-daprinventory}/otelcollector:${TAG:-latest}
    build:
      context: ./packages/otel-collector
    depends_on:
      - grafana
      - pushgateway
  prometheus:
    image: prom/prometheus:v2.35.0
    restart: unless-stopped
    depends_on:
      - pushgateway
      - alertmanager
  alertmanager:
    image: prom/alertmanager:v0.24.0
    restart: unless-stopped
    depends_on:
      - pushgateway
  pushgateway:
    image: prom/pushgateway:v1.4.3
    restart: unless-stopped
  grafana:
    image: grafana/grafana:9.0.5
    restart: unless-stopped
    depends_on:
      - prometheus

These are the Local overrides for development.

Some notable configuration here were the ports! So many ports. Find yourself a system with ranges you can roughly remember, to expose any required ports.

Pay close attention to the prometheus volumes and grafana volumes. They are a bit complex in how they're configured.

Alert manager and prometheus push gateway are additions to the prometheus service. They're not really required in development, especially for my little demo application but very likely are required in production.

version: "3.4"

services:
  seq:
    environment:
      - ACCEPT_EULA=Y
    ports:
      - "5340:80"
  zipkin:
    ports:
      - "5411:9411"
  prometheus:
    volumes:
      - ./.docker-compose/.persist/prometheus/runtime:/prometheus
      - ./packages/prometheus:/etc/prometheus
    command:
      - "--web.listen-address=0.0.0.0:9090"
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      # - "--web.console.libraries=/etc/prometheus/console_libraries"
      # - "--web.console.templates=/etc/prometheus/consoles"
      # - "--storage.tsdb.retention.time=200h"
      - "--web.enable-lifecycle"
      - "--web.enable-admin-api"
      - "--web.enable-remote-write-receiver"
      - "--web.page-title=DaprInventoryTimeseries"
      - "--log.level=debug"
    ports:
      - "9090:9090"
  alertmanager:
    volumes:
      - ./packages/alertmanager:/etc/alertmanager
    command:
      - "--config.file=/etc/alertmanager/config.yml"
      - "--storage.path=/alertmanager"
    restart: unless-stopped
    ports:
      - "9093:9093"
  pushgateway:
    expose:
      - "9091:9091"
    ports:
      - "9091:9091"
  grafana:
    volumes:
      - ./.docker-compose/.persist/grafana:/var/lib/grafana
      - ./packages/grafana:/etc/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=auser
      - GF_SECURITY_ADMIN_PASSWORD=apassword
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_LOG_LEVEL=info
    ports:
      - "9000:3000"
  otel-collector:
    command: ["--config=/etc/otel-collector-config.yaml"]
    ports:
      - "1888:1888" # pprof extension
      - "8888:8888" # Prometheus metrics exposed by the collector
      - "8889:8889" # Prometheus exporter metrics
      - "13133:13133" # health_check extension
      - "4317:4317" # OTLP gRPC receiver
      - "4318:4318" # OTLP http receiver
      - "55679:55679" # zpages extension
      - "5414:9411" # zipkin receiver

These are the main open telemetry configurations for the local development setup.

It's incredible that we can have all of this running locally, and have full telemetry locally.

We just change to a managed service like AWS X-ray just by changing some yaml configuration used in the open telemetry collector for production environments.

Conclusion

Open telemetry is supported by every cloud provider and every observability provider now.

Even though it is still in incubation stage you should start moving telemetry to it, especially for any new distributed applications.

Some of the configuration is tricky but once it's working it's incredibly powerful. The instrumentation ecosystem is only going to get better as the entire industry converges on open telemetry.

Let me know if you have any questions about open telemetry on NestJs!

Darragh ORiordan