Securing APIs with OAuth2 Introspection

In today’s interconnected world, APIs are the backbone of modern apps. Protecting these APIs and ensuring only authorised users access sensitive data is now more crucial than ever. While many authentication and authorisation methods exist, OAuth2 Introspection stands out as a robust and flexible approach. In this post, we will explore what OAuth2 Introspection is, why we should use it, and how to implement it in our .NET apps.

Before we dive into the technical details, let’s remind ourselves why API security is so important. Think about it: APIs often handle the most sensitive stuff. If those APIs are not well protected, we are basically opening the door to some nasty consequences. Data breaches? Yep. Regulatory fines (GDPR, HIPAA, you name it)? Potentially. Not to mention, losing the trust of our users. A secure API shows that we value their data and are committed to keeping it safe. And, of course, it helps prevent the bad guys from exploiting vulnerabilities to steal data or cause all sorts of trouble.

The most common method of securing APIs is using access tokens as proof of authorization. These tokens, typically in the form of JWTs (JSON Web Tokens), are passed by the client to the API with each request. The API then needs a way to validate these tokens to verify that they are legitimate and haven’t been tampered with. This is where OAuth2 Introspection comes in.

OAuth2 Introspection

OAuth2 Introspection is a mechanism for validating bearer tokens in an OAuth2 environment. We can think of it as a secure lookup service for our access tokens. It allows an API to query an auth server, which is also the “issuer” of the token, to determine the validity and attributes of a given token.

The workflow of an OAuth2 Introspection request.

To illustrate the process, the diagram above visualises the flow of an OAuth2 Introspection request. The Client sends the bearer token to the Web API, which then forwards it to the auth server via the introspection endpoint. The auth server validates the token and returns a JSON response, which is then processed by the Web API. Finally, the Web API grants (or denies) access to the requested resource based on the token validity.

Introspection vs. Direct JWT Validation

You might be thinking, “Isn’t this just how we normally validate a JWT token?” Well, yes… and no. What is the difference, and why is there a special term “Introspection” for this?

With direct JWT validation, we essentially check the token ourselves, verifying its signature, expiry, and sometimes audience. Introspection takes a different approach because it involves asking the auth server about the token status. This leads to differences in the pros and cons, which we will explore next.

With OAuth2 Introspection, we gain several key advantages. First, it works with various token formats (JWTs, opaque tokens, etc.) and auth server implementations. Furthermore, because the validation logic resides on the auth server, we get consistency and easier management of token revocation and other security policies. Most importantly, OAuth2 Introspection makes token revocation straightforward (e.g., if a user changes their password or a client is compromised). In contrast, revoking a JWT after it has been issued is significantly more complex.

.NET Implementation

Now, let’s see how to implement OAuth2 Introspection in a .NET Web API using the AddOAuth2Introspection authentication scheme.

The core configuration lives in our Program.cs file, where we set up the authentication and authorisation services.

// ... (previous code for building the app)

builder.Services.AddAuthentication("Bearer")
.AddOAuth2Introspection("Bearer", options =>
{
options.IntrospectionEndpoint = "<Auth server base URL>/connect/introspect";
options.ClientId = "<Client ID>";
options.ClientSecret = "<Client Secret>";

options.DiscoveryPolicy = new IdentityModel.Client.DiscoveryPolicy
{
RequireHttps = false,
};
});

builder.Services.AddAuthorization();

// ... (rest of the Program.cs)

This code above configures the authentication service to use the “Bearer” scheme, which is the standard for bearer tokens. AddOAuth2Introspection(…) is where the magic happens because it adds the OAuth2 Introspection authentication handler by pointing to IntrospectionEndpoint, the URL our API will use to send the token for validation.

Usually, RequireHttps needs to be true in production. However, in situations like when the API and the auth server are both deployed to the same Elastic Container Service (ECS) cluster and they communicate internally within the AWS network, we can set it to false. This is because the Application Load Balancer (ALB) handles the TLS/SSL termination and the internal communication between services happens over HTTP, we can safely disable RequireHttps in the DiscoveryPolicy for the introspection endpoint within the ECS cluster. This simplifies the setup without compromising security, as the communication from the outside world to our ALB is already secured by HTTPS.

Finally, to secure our API endpoints and require authentication, we can simply use the [Authorize] attribute, as demonstrated below.

[ApiController]
[Route("[controller]")]
[Authorize]
public class MyController : ControllerBase
{
[HttpGet("GetData")]
public IActionResult GetData()
{
...
}
}

Wrap-Up

OAuth2 Introspection is a powerful and flexible approach for securing our APIs, providing a centralised way to validate bearer tokens and manage access. By understanding the process, implementing it correctly, and following best practices, we can significantly improve the security posture of our apps and protect our valuable data.

References

Observing Orchard Core: Traces with Grafana Tempo and ADOT

In the previous article, we have discussed about how we can build a custom monitoring pipeline that has Grafana running on Amazon ECS to receive metrics and logs, which are two of the observability pillars, sent from the Orchard Core on Amazon ECS. Today, we will proceed to talk about the third pillar of observability, traces.

Source Code

The CloudFormation templates and relevant C# source codes discussed in this article is available on GitHub as part of the Orchard Core Basics Companion (OCBC) Project: https://github.com/gcl-team/Experiment.OrchardCore.Main.

Lisa Jung, senior developer advocate at Grafana, talks about the three pillars in observability (Image Credit: Grafana Labs)

About Grafana Tempo

To capture and visualise traces, we will use Grafana Tempo, an open-source, scalable, and cost-effective tracing backend developed by Grafana Labs. Unlike other tracing tools, Tempo does not require an index, making it easy to operate and scale.

We choose Tempo because it is fully compatible with OpenTelemetry, the open standard for collecting distributed traces, which ensures flexibility and vendor neutrality. In addition, Tempo seamlessly integrates with Grafana, allowing us to visualise traces alongside metrics and logs in a single dashboard.

Finally, being a Grafana Labs project means Tempo has strong community backing and continuous development.

About OpenTelemetry

With a solid understanding of why Tempo is our tracing backend of choice, let’s now dive deeper into OpenTelemetry, the open-source framework we use to instrument our Orchard Core app and generate the trace data Tempo collects.

OpenTelemetry is a Cloud Native Computing Foundation (CNCF) project and a vendor-neutral, open standard for collecting traces, metrics, and logs from our apps. This makes it an ideal choice for building a flexible observability pipeline.

OpenTelemetry provides SDKs for instrumenting apps across many programming languages, including C# via the .NET SDK, which we use for Orchard Core.

OpenTelemetry uses the standard OTLP (OpenTelemetry Protocol) to send telemetry data to any compatible backend, such as Tempo, allowing seamless integration and interoperability.

Both Grafana Tempo and OpenTelemetry are projects under the CNCF umbrella. (Image Source: CNCF Cloud Native Interactive Landscape)

Setup Tempo on EC2 With CloudFormation

It is straightforward to deploy Tempo on EC2.

Let’s walk through the EC2 UserData script that installs and configures Tempo on the instance.

First, we download the Tempo release binary, extract it, move it to a proper system path, and ensure it is executable.

wget https://github.com/grafana/tempo/releases/download/v2.7.2/tempo_2.7.2_linux_amd64.tar.gz
tar -xzvf tempo_2.7.2_linux_amd64.tar.gz
mv tempo /usr/local/bin/tempo
chmod +x /usr/local/bin/tempo

Next, we create a basic Tempo configuration file at /etc/tempo.yaml to define how Tempo listens for traces and where it stores trace data.

echo "
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
storage:
trace:
backend: local
local:
path: /tmp/tempo/traces
" > /etc/tempo.yaml

Let’s breakdown the configuration file above.

The http_listen_port allows us to set the HTTP port (3200) for Tempo internal web server. This port is used for health checks and Prometheus metrics.

After that, we configure where Tempo listens for incoming trace data. In the configuration above, we enabled OTLP receivers via both gRPC and HTTP, the two protocols that OpenTelemetry SDKs and agents use to send data to Tempo. Here, the ports 4317 (gRPC) and 4318 (HTTP) are standard for OTLP.

Last but not least, in the configuration, as demonstration purpose, we use the simplest one, local storage, to write trace data to the EC2 instance disk under /tmp/tempo/traces. This is fine for testing or small setups, but for production we will likely want to use services like Amazon S3.

In addition, since we are using local storage on EC2, we can easily SSH into the EC2 instance and directly inspect whether traces are being written. This is incredibly helpful during debugging. What we need to do is to run the following command to see whether files are being generated when our Orchard Core app emits traces.

ls -R /tmp/tempo/traces

The configuration above is intentionally minimal. As our setup grows, we can explore advanced options like remote storage, multi-tenancy, or even scaling with Tempo components.

Each flushed trace block (folder with UUID) contains a data.parquet file, which holds the actual trace data.

Finally, in order to enable Tempo to start on boot, we create a systemd unit file that allows Tempo to start on boot and automatically restart if it crashes.

cat <<EOF > /etc/systemd/system/tempo.service
[Unit]
Description=Grafana Tempo service
After=network.target

[Service]
ExecStart=/usr/local/bin/tempo -config.file=/etc/tempo.yaml
Restart=always
RestartSec=5
User=root
LimitNOFILE=1048576

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reexec
systemctl daemon-reload
systemctl enable --now tempo

This systemd service ensures that Tempo runs in the background and automatically starts up after a reboot or a crash. This setup is crucial for a resilient observability pipeline.

Did You Know: When we SSH into an EC2 instance running Amazon Linux 2023, we will be greeted by a cockatiel in ASCII art! (Image Credit: OMG! Linux)

Understanding OTLP Transport Protocols

In the previous section, we configured Tempo to receive OTLP data over both gRPC and HTTP. These two transport protocols are supported by the OTLP, and each comes with its own strengths and trade-offs. Let’s break them down.

Ivy Zhuang from Google gave a presentation on gRPC and Protobuf at gRPConf 2024. (Image Credit: gRPC YouTube)

Tempo has native support for gRPC, and many OpenTelemetry SDKs default to using it. gRPC is a modern, high-performance transport protocol built on top of HTTP/2. It is the preferred option when performanceis critical. gRPC also supports streaming, which makes it ideal for high-throughput scenarios where telemetry data is sent continuously.

However, gRPC is not natively supported in browsers, so it is not ideal for frontend or web-based telemetry collection unless a proxy or gateway is used. In such scenarios, we will normally choose HTTP which is browser-friendly. HTTP is a more traditional request/response protocol that works well in restricted environments.

Since we are collecting telemetry from server-side like Orchard Core running on ECS, gRPC is typically the better choice due to its performance benefits and native support in Tempo.

Please take note that since gRPC requires HTTP/2, which some environments, for example, IoT devices and embedding systems, might not have mature gRPC client support, OTLP over HTTP is often preferred in simpler or constrained systems.

Daniel Stenberg, Senior Network Engineer at Mozilla, sharing about HTTP/2 at GOTO Copenhagen 2015. (Image Credit: GOTO Conferences YouTube)

gRPC allows multiplexing over a single connection using HTTP/2. Hence, in gRPC, all telemetry signals, i.e. logs, metrics, and traces, can be sent concurrently over one connection. However, with HTTP, each telemetry signal needs a separate POST request to its own endpoint as listed below to enforce clean schema boundaries, simplify implementation, and stay aligned with HTTP semantics.

  • Logs: /v1/logs;
  • Metrics: /v1/metrics;
  • Traces: /v1/traces.

In HTTP, since each signal has its own POST endpoint with its own protobuf schema in the body, there is no need for the receiver to guess what is in the body.

AWS Distro for Open Telemetry (ADOT)

Now that we have Tempo running on EC2 and understand the OTLP protocols it supports, the next step is to instrument our Orchard Core to generate and send trace data.

The following code snippet shows what a typical direct integration with Tempo might look like in an Orchard Core.

builder.Services
.AddOpenTelemetry()
.ConfigureResource(resource => resource.AddService(serviceName: "cld-orchard-core"))
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://<tempo-ec2-host>:4317");
options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc;
})
.AddConsoleExporter());

This approach works well for simple use cases during development stage, but it comes with trade-offs that are worth considering. Firstly, we couple our app directly to the observability backend, reducing flexibility. Secondly, central management becomes harder when we scale to many services or environments.

This is where AWS Distro for OpenTelemetry (ADOT) comes into play.

The ADOT collector. (Image credit: ADOT technical docs)

ADOT is a secure, AWS-supported distribution of the OpenTelemetry project that simplifies collecting and exporting telemetry data from apps running on AWS services, for example our Orchard Core on ECS now. ADOT decouples our apps from the observability backend, provides centralised configuration, and handles telemetry collection more efficiently.

Sidecar Pattern

We can deploy the ADOT in several ways, such as running it on a dedicated node or ECS service to receive telemetry from multiple apps. We can also take the sidecar approach which cleanly separates concerns. Our Orchard Core app will focus on business logic, while a nearby ADOT sidecar handles telemetry collection and forwarding. This mirrors modern cloud-native patterns and gives us more flexibility down the road.

The sidecar pattern running in Amazon ECS. (Image Credit: AWS Open Source Blog)

The following CloudFormation template shows how we deploy ADOT as a sidecar in ECS using CloudFormation. The collector config is stored in AWS Systems Manager Parameter Store under /myapp/otel-collector-config, and injected via the AOT_CONFIG_CONTENT environment variable. This keeps our infrastructure clean, decoupled, and secure.

ecsTaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Ref ServiceName
NetworkMode: awsvpc
ExecutionRoleArn: !GetAtt ecsTaskExecutionRole.Arn
TaskRoleArn: !GetAtt iamRole.Arn
ContainerDefinitions:
- Name: !Ref ServiceName
Image: !Ref OrchardCoreImage
...

- Name: adot-collector
Image: public.ecr.aws/aws-observability/aws-otel-collector:latest
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !Sub "/ecs/${ServiceName}-log-group"
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: adot
Essential: false
Cpu: 128
Memory: 512
HealthCheck:
Command: ["/healthcheck"]
Interval: 30
Timeout: 5
Retries: 3
StartPeriod: 60
Secrets:
- Name: AOT_CONFIG_CONTENT
ValueFrom: !Sub "arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:parameter/otel-collector-config"
Deploy an ADOT sidecar on ECS to collect observability data from Orchard Core.

There are several interesting and important details in the CloudFormation snippet above that are worth calling out. Let’s break them down one by one.

Firstly, we choose awsvpc as the NetworkMode of the ECS task. In awsvpc, each container in the ECS task, i.e. our Orchard Core container and the ADOT sidecar, receives its own ENI (Elastic Network Interface). This is great for network-level isolation. With this setup, we can reference the sidecar from our Orchard Core using its container name through ECS internal DNS, i.e. http://adot-collector:4317.

Secondly, we include a health check for the ADOT container. ECS will use this health check to restart the container if it becomes unhealthy, improving reliability without manual intervention. In November 2022, Paurush Garg from AWS added the healthcheck component with the new ADOT collector release, so we can simply specify that we will be using this healthcheck component in the configuration that we will discuss next.

Yes, the configuration! Instead of hardcoding the ADOT configuration into the task definition, we inject it securely at runtime using the AOT_CONFIG_CONTENT secret. This environment variable AOT_CONFIG_CONTENT is designed to enable us to configure the ADOT collector. It will override the config file used in the ADOT collector entrypoint command.

The SSM Parameter for the environment variable AOT_CONFIG_CONTENT.

Wrap-Up

By now, we have completed the journey of setting up Grafana Tempo on EC2, exploring how traces flow through OTLP protocols like gRPC and HTTP, and understanding why ADOT is often the better choice in production-grade observability pipelines.

With everything connected, our Orchard Core app is now able to send traces into Tempo reliably. This will give us end-to-end visibility with OpenTelemetry and AWS-native tooling.

References

Observing Orchard Core: Metrics and Logs with Grafana and Amazon CloudWatch

I recently deployed an Orchard Core app on Amazon ECS and wanted to gain better visibility into its performance and health.

Instead of relying solely on basic Amazon CloudWatch metrics, I decided to build a custom monitoring pipeline that has Grafana running on Amazon EC2 receiving metrics and EMF (Embedded Metrics Format) logs sent from the Orchard Core on ECS via CloudFormation configuration.

In this post, I will walk through how I set this up from scratch, what challenges I faced, and how you can do the same.

Source Code

The CloudFormation templates and relevant C# source codes discussed in this article is available on GitHub as part of the Orchard Core Basics Companion (OCBC) Project: https://github.com/gcl-team/Experiment.OrchardCore.Main.

Why Grafana?

In the previous post where we setup the Orchard Core on ECS, we talked about how we can send metrics and logs to CloudWatch. While it is true that CloudWatch offers us out-of-the-box infrastructure metrics and AWS-native alarms and logs, the dashboards CloudWatch provides are limited and not as customisable. Managing observability with just CloudWatch gets tricky when our apps span multiple AWS regions, accounts, or other cloud environments.

The GrafanaLive event in Singapore in September 2023. (Event Page)

If we are looking for solution that is not tied to single vendor like AWS, Grafana can be one of the options. Grafana is an open-source visualisation platform that lets teams monitor real-time metrics from multiple sources, like CloudWatch, X-Ray, Prometheus and so on, all in unified dashboards. It is lightweight, extensible, and ideal for observability in cloud-native environments.

Is Grafana the only solution? Definitely not! However, personally I still prefer Grafana because it is open-source and free to start. In this blog post, we will also see how easy to host Grafana on EC2 and integrate it directly with CloudWatch with no extra agents needed.

Three Pillars of Observability

In observability, there are three pillars, i.e. logs, metrics, and traces.

Lisa Jung, senior developer advocate at Grafana, talks about the three pillars in observability (Image Credit: Grafana Labs)

Firstly, logs are text records that capture events happening in the system.

Secondly, metrics are numeric measurements tracked over time, such as HTTP status code counts, response times, or ECS CPU and memory utilisation rates.

Finally, traces show the form a strong observability foundation which can help us to identify issues faster, reduce downtime, and improve system reliability. This will ultimately support better user experience for our apps.

This is where we need a tool like Grafana because Grafana assists us to visualise, analyse, and alert based on our metrics, making observability practical and actionable.

Setup Grafana on EC2 with CloudFormation

It is straightforward to install Grafana on EC2.

Firstly, let’s define the security group that we will be use for the EC2.

ec2SecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Allow access to the EC2 instance hosting Grafana
VpcId: {"Fn::ImportValue": !Sub "${CoreNetworkStackName}-${AWS::Region}-vpcId"}
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 22
ToPort: 22
CidrIp: 0.0.0.0/0 # Caution: SSH open to public, restrict as needed
- IpProtocol: tcp
FromPort: 3000
ToPort: 3000
CidrIp: 0.0.0.0/0 # Caution: Grafana open to public, restrict as needed
Tags:
- Key: Stack
Value: !Ref AWS::StackName

The VPC ID is imported from another of the common network stack, the cld-core-network, we setup. Please refer to the stack cld-core-network here.

For demo purpose, please notice that both SSH (port 22) and Grafana (port 3000) are open to the world (0.0.0.0/0). It is important to protect the access to EC2 by adding a bastion host, VPN, or IP restriction later.

In addition, the SSH should only be opened temporarily. The SSH access is for when we need to log in to the EC2 instance and troubleshoot Grafana installation manually.

Now, we can proceed to setup EC2 with Grafana installed using the CloudFormation resource below.

ec2Instance:
Type: AWS::EC2::Instance
Properties:
InstanceType: !Ref InstanceType
ImageId: !Ref Ec2Ami
NetworkInterfaces:
- AssociatePublicIpAddress: true
DeviceIndex: 0
SubnetId: {"Fn::ImportValue": !Sub "${CoreNetworkStackName}-${AWS::Region}-publicSubnet1Id"}
GroupSet:
- !Ref ec2SecurityGroup
UserData:
Fn::Base64: !Sub |
#!/bin/bash
yum update -y
yum install -y wget unzip
wget https://dl.grafana.com/oss/release/grafana-10.1.0-1.x86_64.rpm
yum install -y grafana-10.1.0-1.x86_64.rpm
systemctl enable --now grafana-server
Tags:
- Key: Name
Value: "Observability-Instance"

In the CloudFormation template above, we are expecting our users to access the Grafana dashboard directly over the Internet. Hence, we put the EC2 in public subnet and assign an Elastic IP (EIP) to it, as demonstrated below, so that we can have a consistent public accessible static IP for our Grafana.

ecsEip:
Type: AWS::EC2::EIP

ec2EIPAssociation:
Type: AWS::EC2::EIPAssociation
Properties:
AllocationId: !GetAtt ecsEip.AllocationId
InstanceId: !Ref ec2Instance

For production systems, placing instances in public subnets and exposing them with a public IP requires us to have strong security measures in place. Otherwise, it is recommended to place our Grafana EC2 instance in private subnets and accessed via Application Load Balancer (ALB) or NAT Gateway to reduce the attack surface.

Pump CloudWatch Metrics to Grafana

Grafana supports CloudWatch as a native data source.

With the appropriate AWS credentials and region, we can use Access Key ID and Secret Access Key to grant Grafana the access to CloudWatcch. The user that the credentials belong to must have the AmazonGrafanaCloudWatchAccess policy.

The user that Grafana uses to access CloudWatch must have the AmazonGrafanaCloudWatchAccess policy.

However, using AWS Access Key/Secret in Grafana data source connection details is less secure and not ideal for EC2 setups. In addition, AmazonGrafanaCloudWatchAccess is a managed policy optimised for running Grafana as a managed service within AWS. Thus, it is recommended to create our own custom policy so that we can limit the permissions to only what is needed, as demonstrated with the following CloudWatch template.

ec2InstanceRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ec2.amazonaws.com
Action: sts:AssumeRole

Policies:
- PolicyName: EC2MetricsAndLogsPolicy
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: AllowReadingMetricsFromCloudWatch
Effect: Allow
Action:
- cloudwatch:ListMetrics
- cloudwatch:GetMetricData
Resource: "*"
- Sid: AllowReadingLogsFromCloudWatch
Effect: Allow
Action:
- logs:DescribeLogGroups
- logs:GetLogGroupFields
- logs:StartQuery
- logs:StopQuery
- logs:GetQueryResults
- logs:GetLogEvents
Resource: "*"

Again, using our custom policy provides better control and follows the best practices of least privilege.

With IAM role, we do not need to provide AWS Access Key/Secret in Grafana connection details for CloudWatch as a data source.

Visualising ECS Service Metrics

Now that Grafana is configured to pull data from CloudWatch, ECS metrics like CPUUtilization and MemoryUtilization, are available. We can proceed to create a dashboard and select the right namespace as well as the right metric name.

Setting up the diagram for memory utilisation of our Orchard Core app in our ECS cluster.

As shown in the following dashboard, we show memory and CPU utilisation rates because they help us ensure that our ECS services are performing within safe limits and not overusing or underutilizing resources. By monitoring the utilisation, we ensure our services are using just the right amount of resources.

Both ECS service metrics and container insights are displayed on Grafana dashboard.

Visualising ECS Container Insights Metrics

ECS Container Insights Metrics are deeper metrics like task counts, network I/O, storage I/O, and so on.

In the dashboard above, we can also see the number of Task Count. Task Count helps us make sure our services are running the right number of instances at all times.

Task Count by itself is not a cost metric, but if we consistently see high task counts with low CPU/memory usage, it indicates we can potentially consolidate workloads and reduce costs.

Instrumenting Orchard Core to Send Custom App Metrics

Now that we have seen how ECS metrics are visualised in Grafana, let’s move on to instrumenting our Orchard Core app to send custom app-level metrics. This will give us deeper visibility into what our app is really doing.

Metrics should be tied to business objectives. It’s crucial that the metrics you collect align with KPIs that can drive decision-making.

Metrics should be actionable. The collected data should help identify where to optimise, what to improve, and how to make decisions. For example, by tracking app-metrics such as response time and HTTP status codes, we gain insight into both performance and reliability of our Orchard Core. This allows us to catch slowdowns or failures early, improving user satisfaction.

SLA vs SLO vs SLI: Key Differences in Service Metrics (Image Credit: Atlassian)

By tracking response times and HTTP code counts at the endpoint level,
we are measuring SLIs that are necessary to monitor if we are meeting our SLOs.

With clear SLOs and SLIs, we can then focus on what really matters from a performance and reliability perspective. For example, a common SLO could be “99.9% of requests to our Orchard Core API endpoints must be processed within 500ms.”

In terms of sending custom app-level metrics from our Orchard Core to CloudWatch and then to Grafana, there are many approaches depending on our use case. If we are looking for simplicity and speed, CloudWatch SDK and EMF are definitely the easiest and most straightforward methods we can use to get started with sending custom metrics from Orchard Core to CloudWatch, and then visualising them in Grafana.

Using CloudWatch SDK to Send Metrics

We will start with creating a middleware called EndpointStatisticsMiddleware with AWSSDK.CloudWatch NuGet package referenced. In the middleware, we create a MetricDatum object to define the metric that we want to send to CloudWatch.

var metricData = new MetricDatum
{
MetricName = metricName,
Value = value,
Unit = StandardUnit.Count,
Dimensions = new List<Dimension>
{
new Dimension
{
Name = "Endpoint",
Value = endpointPath
}
}
};

var request = new PutMetricDataRequest
{
Namespace = "Experiment.OrchardCore.Main/Performance",
MetricData = new List<MetricDatum> { metricData }
};

In the code above, we see new concepts like Namespace, Metric, and Dimension. They are foundational in CloudWatch. We can think of them as ways to organize and label our data to make it easy to find, group, and analyse.

  • Namespace: A container or category for our metrics. It helps to group related metrics together;
  • Metric: A series of data points that we want to track. The thing we are measuring, in our example, it could be Http2xxCount and Http4xxCount;
  • Dimension:A key-value pair that adds context to a metric.

If we do not define the Namespace, Metric, and Dimensions carefully when we send data, Grafana later will not find them, or our charts on the dashboards will be very messy and hard to filter or analyse.

In addition, as shown in the code above, we are capturing the HTTP status code for our Orchard Core endpoints. We will then use PutMetricDataAsync to send the metric data PutMetricDataRequest asynchronously to CloudWatch.

The HTTP status codes of each of our Orchard Core endpoints are now captured on CloudWatch.

In Grafana, now when we want to configure a CloudWatch panel to show the HTTP status codes for each of the endpoint, the first thing we select is the Namespace, which is Experiment.OrchardCore.Main/Performance in our example. Namespace tells Grafana which group of metrics to query.

After picking the Namespace, Grafana lists the available Metrics inside that Namespace. We pick the Metrics we want to plot, such as Http2xxCount and Http4xxCount. Finally, since we are tracking metrics by endpoint, we set the Dimension to Endpoint and select the specific endpoint we are interested in, as shown in the following screenshot.

Using EMF to Send Metrics

While using the CloudWatch SDK works well for sending individual metrics, EMF (Embedded Metric Format) offers a more powerful and scalable way to log structured metrics directly from our app logs.

Before we can use EMF, we must first ensure that the Orchard Core application logs from our ECS tasks are correctly sent to CloudWatch Logs. This is done by configuring the LogConfiguration inside the ECS TaskDefinition as we discussed last time.

  # Unit 12: ECS Task Definition and Service
ecsTaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
...
ContainerDefinitions:
- Name: !Ref ServiceName
Image: !Ref OrchardCoreImage
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !Sub "/ecs/${ServiceName}-log-group"
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: ecs
...

Once the ECS task is sending logs to CloudWatch Logs, we can start embedding custom metrics into the logs using EMF.

Instead of pushing metrics directly using the CloudWatch SDK, we send structured JSON messages into the container logs. CloudWatch will then auto detects these EMF messages and converts them into CloudWatch Metrics.

The following shows what a simple EMF log message looks like.

{
"_aws": {
"Timestamp": 1745653519000,
"CloudWatchMetrics": [
{
"Namespace": "Experiment.OrchardCore.Main/Performance",
"Dimensions": [["Endpoint"]],
"Metrics": [
{ "Name": "ResponseTimeMs", "Unit": "Milliseconds" }
]
}
]
},
"Endpoint": "/api/v1/packages",
"ResponseTimeMs": 142
}

When a log message reaches CloudWatch Logs, CloudWatch scans the text and looks for a valid _aws JSON object inside anywhere in the message. Thus, even if our log line has extra text before or after, as long as the EMF JSON is properly formatted, CloudWatch extracts it and publishes the metrics automatically.

An example of log with EMF JSON in it on CloudWatch.

After CloudWatch extracts the EMF block from our log message, it automatically turns it into a proper CloudWatch Metric. These metrics are then queryable just like any normal CloudWatch metric and thus available inside Grafana too, as shown in the screenshot below.

Metrics extracted from logs containing EMF JSON are automatically turned into metrics that can be visualised in Grafana just like any other metric.

As we can see, using EMF is easier as compared to going the CloudWatch SDK route because we do not need to change or add extra AWS infrastructure. With EMF, what our app does is just writing special JSON-format logs.

Then CloudWatch Metrics automatically extracts the metrics from those logs with EMF JSON. The entire process requires no new service, no special SDK code, and no CloudWatch PutMetric API calls.

Cost Optimisation with Logs vs Metrics

Logs are more expensive than metrics, especially when we are storing large amounts of data over time. This is also true when logs are stored at a higher retention rate and are more detailed, which means higher storage costs.

Metrics are cheaper to store because they are aggregated data points that do not require the same level of detail as logs.

CloudWatch treats each unique combination of dimensions as a separate metric, even if the metrics have the same metric name. However, compared to logs, metrics are still usually much cheaper at scale.

By embedding metrics into your log data via EMF, we are actually piggybacking metrics into logs, and letting CloudWatch extract metrics without duplicating effort. Thus, when using EMF, we will be paying for both, i.e.

  1. Log ingestion and storage (for the raw logs);
  2. The extracted custom metric (for the metric).

Hence, when we are leveraging EMF, we should consider expire logs faster if we only need the extracted metrics long-term.

Granularity and Sampling

Granularity refers to how frequent the metric data is collected. Fine granularity provides more detailed insights but can lead to increased data volume and costs.

Sampling is a technique to reduce the amount of data collected by capturing only a subset of data points (especially helpful in high-traffic systems). However, the challenge is ensuring that you maintain enough data to make informed decisions while keeping storage and processing costs manageable.

In our Orchard Core app above, currently the middleware that we implement will immediately PutMetricDataAsync to CloudWatch which will then not only slow down our API but it costs more because we need to pay when we send custom metrics to CloudWatch. Thus, we usually “buffer” the metrics first, and then batch-send periodically. This can be done with, for example, HostedService which is an ASP.NET Core background service, to flush metrics at interval.

using Amazon.CloudWatch;
using Amazon.CloudWatch.Model;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Options;
using System.Collections.Concurrent;

public class MetricsPublisher(
IAmazonCloudWatch cloudWatch,
IOptions<MetricsOptions> options,
ILogger<MetricsPublisher> logger) : BackgroundService
{
private readonly ConcurrentBag<MetricDatum> _pendingMetrics = new();

public void TrackMetric(string metricName, double value, string endpointPath)
{
_pendingMetrics.Add(new MetricDatum
{
MetricName = metricName,
Value = value,
Unit = StandardUnit.Count,
Dimensions = new List<Dimension>
{
new Dimension
{
Name = "Endpoint",
Value = endpointPath
}
}
});
}

protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
logger.LogInformation("MetricsPublisher started.");
while (!stoppingToken.IsCancellationRequested)
{
await Task.Delay(TimeSpan.FromSeconds(options.FlushIntervalSeconds), stoppingToken);
await FlushMetricsAsync();
}
}

private async Task FlushMetricsAsync()
{
if (_pendingMetrics.IsEmpty) return;

const int MaxMetricsPerRequest = 1000;

var metricsToSend = new List<MetricDatum>();
var metricsCount = 0;
while (_pendingMetrics.TryTake(out var datum))
{
metricsToSend.Add(datum);

metricsCount += 1;
if (metricsCount >= MaxMetricsPerRequest) break;
}

var request = new PutMetricDataRequest
{
Namespace = options.Namespace,
MetricData = metricsToSend
};

int attempt = 0;
while (attempt < options.MaxRetryAttempts)
{
try
{
await cloudWatch.PutMetricDataAsync(request);
logger.LogInformation("Flushed {Count} metrics to CloudWatch.", metricsToSend.Count);
break;
}
catch (Exception ex)
{
attempt++;
logger.LogWarning(ex, "Failed to flush metrics. Attempt {Attempt}/{MaxAttempts}", attempt, options.MaxRetryAttempts);
if (attempt < options.MaxRetryAttempts)
await Task.Delay(TimeSpan.FromSeconds(options.RetryDelaySeconds));
else
logger.LogError("Max retry attempts reached. Dropping {Count} metrics.", metricsToSend.Count);
}
}
}

public override async Task StopAsync(CancellationToken cancellationToken)
{
logger.LogInformation("MetricsPublisher stopping.");
await FlushMetricsAsync();
await base.StopAsync(cancellationToken);
}
}

In our Orchard Core API, each incoming HTTP request may run on a different thread. Hence, we need a thread-safe data structure like ConcurrentBag for storing the pending metrics.

Please take note that ConcurrentBag is designed to be an unordered collection. It does not maintain the order of insertion when items are taken from it. However, since the metrics we are sending, which is the counts of HTTP status codes, it does not matter in what order the requests were processed.

In addition, the limit of MetricData that we can send to CloudWatch per request is 1,000. Thus, we have the constant MaxMetricsPerRequest to help us make sure that we retrieve and remove at most 1,000 metrics from the ConcurrentBag.

Finally, we can inject MetricsPublisher to our middleware EndpointStatisticsMiddleware so that it can auto track every API request.

Wrap-Up

In this post, we started by setting up Grafana on EC2, connected it to CloudWatch to visualise ECS metrics. After that, we explored two ways, i.e. CloudWatch SDK and EMF log, to send custom app-level metrics from our Orchard Core app:

Whether we are monitoring system health or reporting on business KPIs, Grafana with CloudWatch offers a powerful observability stack that is both flexible and cost-aware.

References

From Design to Implementation: Crafting Headless APIs in Orchard Core with Apidog

Last month, I had the opportunity to attend an online meetup hosted by the local Microsoft MVP Dileepa Rajapaksa from the Singapore .NET Developers Community, where I was introduced to ApiDog.

During the session, Mohammad L. U. Tanjim, the Product Manager of ApiDog, gave a detailed walkthrough of the API-First design and how Apidog can be used for this approach.

Apidog helps us to define, test, and document APIs in one place. Instead of manually writing Swagger docs and using API tool separately, ApiDog combines everything. This means frontend developers can get mock APIs instantly, and backend developers as well as QAs can get clear API specs with automatic testing support.

Hence, for the customised headless APIs, we will adopt an API-First design approach. This approach ensures clarity, consistency, and efficient collaboration between backend and frontend teams while reducing future rework.

Session “Build APIs Faster and Together with Apidog, ASP.NET, and Azure” conducted by Mohammad L. U. Tanjim.

API-First Design Approach

By designing APIs upfront, we reduce the likelihood of frequent changes that disrupt development. It also ensures consistent API behaviour and better long-term maintainability.

For our frontend team, with a well-defined API specification, they can begin working with mock APIs, enabling parallel development. This eliminates dependencies where frontend work is blocked by backend completion.

For QA team, API spec will be important to them because it serve as a reference for automated testing. The QA engineers can validate API responses before implementation.

API Design Journey

In this article, we will embark on an API Design Journey by transforming a traditional travel agency in Singapore into an API-first system. To achieve this, we will use Apidog for API design and testing, and Orchard Core as a CMS to manage travel package information. Along the way, we will explore different considerations in API design, documentation, and integration to create a system that is both practical and scalable.

Many traditional travel agencies in Singapore still rely on manual processes. They store travel package details in spreadsheets, printed brochures, or even handwritten notes. This makes it challenging to update, search, and distribute information efficiently.

The reliance on physical posters and brochures of a travel agency is interesting in today’s digital age.

By introducing a headless CMS like Orchard Core, we can centralise travel package management while allowing different clients like mobile apps to access the data through APIs. This approach not only modernises the operations in the travel agency but also enables seamless integration with other systems.

API Design Journey 01: The Design Phase

Now that we understand the challenges of managing travel packages manually, we will build the API with Orchard Core to enable seamless access to travel package data.

Instead of jumping straight into coding, we will first focus on the design phase, ensuring that our API meets the business requirements. At this stage, we focus on designing endpoints, such as GET /api/v1/packages, to manage the travel packages. We also plan how we will structure the response.

Given the scope and complexity of a full travel package CMS, this article will focus on designing a subset of API endpoints, as shown in the screenshot below. This allows us to highlight essential design principles and approaches that can be applied across the entire API journey with Apidog.

Let’s start with eight simple endpoints.

For the first endpoint “Get all travel packages”, we design it with the following query parameters to support flexible and efficient result filtering, pagination, sorting, and text search. This approach ensures that users can easily retrieve and navigate through travel packages based on their specific needs and preferences.

GET /api/v1/packages?page=1&pageSize=20&sortBy=price&sortOrder=asc&destinationId=4&priceRange[min]=500&priceRange[max]=2000&rating=4&searchTerm=spa
Pasting the API path with query parameters to the Endpoint field will auto populate the Request Params section in Apidog.

Same with the request section, the Response also can be generated based on a sample JSON that we expect the endpoint to return, as shown in the following screenshot.

As shown in the Preview, the response structure can be derived from a sample JSON.

In the screenshot above, the field “description” is marked as optional because it is the only property that does not exist in all the other entry in “data”.

Besides the success status, we also need another important HTTP 400 status code which tells the client that something is wrong with their request.

By default, for generic error responses like HTTP 400, there are response components that we can directly use in Apidog.

The reason why we need HTTP 400 is that, instead of processing an invalid request and returning incorrect or unexpected results, our API should explicitly reject it, ensuring that the client knows what needs to be fixed. This improves both developer experience and API reliability.

After completing the endpoint for getting all travel packages, we also have another POST endpoint to search travel packages.

While GET is the standard method for retrieving data from an API, complex search queries involving multiple parameters, filters, or file uploads might require the use of a POST request. This is particularly true when dealing with advanced search forms or large amounts of data, which cannot be easily represented as URL query parameters. In these cases, POST allows us to send the parameters in the body of the request, ensuring the URL remains manageable and avoiding URL length limits.

For example, let’s assume this POST endpoint allows us to search for travel packages with the following body.

{
"destination": "Singapore",
"priceRange": {
"min": 500,
"max": 2000
},
"rating": 4,
"amenities": ["pool", "spa"],
"files": [
{
"fileType": "image",
"file": "base64-encoded-image-content"
}
]
}

We can also easily generate the data schema for the body by pasting this JSON as example into Apidog, as shown in the screenshot below.

Setting up the data schema for the body of an HTTP POST request.

When making an HTTP POST request, the client sends data to the server. While JSON in the request body is common, there is also another format used in APIs, i.e. multipart/form-data (also known as form-data).

The form-data is used when the request body contains files, images, or binary data along with text fields. So, if our endpoint /api/v1/packages/{id}/reviews allows users to submit both text (review content and rating) and an image, using form-data is the best choice, as demonstrated in the following screenshot.

Setting up a request body which is multipart/form-data in Apidog.

API Design Journey 02: Prototyping with Mockups

When designing the API, it is common to debate, for example, whether reviews should be nested inside packages or treated as a separate resource. By using Apidog, we can quickly create mock APIs for both versions and tested how they would work in different use cases. This helps us make a data-driven decision instead of endless discussions.

Once our endpoint is created, Apidog automatically generates a mock API based on our defined API spec, as shown in the following screenshot.

A list of mock API URLs for our “Get all travel packages” endpoint.

Clicking on the “Request” button next to each of the mock API URL will bring us to the corresponding mock response, as shown in the following screenshot.

Default mock response for HTTP 200 of our first endpoint “Get all travel packages”.

As shown in the screenshot above, some values in the mock response are not making any sense, for example negative id and destinationId, rating which is supposed to be between 1 and 5, “East” as sorting direction, and so on. How could we fix them?

Firstly, we will set the id (and destinationId) to be any positive integer number starting from 1.

Setting id to be a positive integer number starting from 1.

Secondly, we update both the price and rating to be float. In the following screenshot, we specify that the rating can be any float from 1.0 to 5.0 with single fraction digit.

Apidog is able to generate an example based on our condition under “Preview”.

Finally, we will indicate that the sorting direction can only be either ASC or DESC, as shown in the following screenshot.

Configuring the possible value for the direction field.

With all the necessary mock values configuration, if we fetch the mock response again, we should be able to get a response with more reasonable values, as demonstrated in the screenshot below.

Now the mock response looks more reasonable.

With the mock APIs, our frontend developers will be able to start building UI components without waiting for the backend to be completed. Also, as shown above, a mock API responds instantly, unlike real APIs that depend on database queries, authentication, or network latency. This makes UI development and unit testing faster.

Speaking of testing, some test cases are difficult to create with a real API. For example, what if an API returns an error (500 Internal Server Error)? What if there are thousands of travel packages? With a mock API, we can control the responses and simulate rare cases easily.

In addition, Apidog supports returning different mock data based on different request parameters. This makes the mock API more realistic and useful for developers. This is because if the mock API returns static data, frontend developers may only test one scenario. A dynamic mock API allows testing of various edge cases.

For example, our travel package API allows admins to see all packages, including unpublished ones, while regular users only see public packages. We thus can setup in such a way that different bearer token will return different set of mock data.

We are setting up the endpoint to return drafts when a correct admin token is provided in the request header with Mock Expectation.

With Mock Expectation feature, Apidog can return custom responses based on request parameters as well. For instance, it can return normal packages when the destinationId is 1 and trigger an error when the destinationId is 2.

API Design Journey 03: Documenting Phase

With endpoints designed properly in earlier two phases, we can now proceed to create documentation which is offers a detailed explanation of the endpoints in our API. This documentation will include the information such as HTTP methods, request parameters, and response formats.

Fortunately, Apidog makes the documentation process smooth by integrating well within the API ecosystem. It also makes sharing easy, letting us export the documentation in formats like OpenAPI, HTML, and Markdown.

Apidog can export API spec in formats like OpenAPI, HTML, and Markdown.

We can also export our documentation on folder basis to OpenAPI Specification in Overview, as shown below.

Custom export configuration for OpenAPI Specification.

We can also export the data as an offline document. Just click on the “Open URL” or “Permalink” button to view the raw JSON/YAML content directly in the Internet browser. We then can place the raw content into the Swagger Editor to view the Swagger UI of our API, as demonstrated in the following screenshot.

The exported content from Apidog can be imported to Swagger Editor directly.

Let’s say now we need to share the documentation with our team, stakeholders, or even the public. Our documentation thus needs to be accessible and easy to navigate. That is where exporting to HTML or Markdown comes in handy.

Documentation is Markdown format, generated by Apidog.

Finally, Apidog also allows us to conveniently publish our API documentation as a webpage. There are two options: Quick Share, for sharing parts of the docs with collaborators, and Publish Docs, for making the full documentation publicly available.

Quick Share is great for API collaborators because we can set a password for access and define an expiration time for the shared documentation. If no expiration is set, the link stays active indefinitely.

API spec presented as a website and accessible by the collaborators. It also enables collaborators to generate client code for different languages.

API Design Journey 04: The Development Phase

With our API fully designed, mocked, and documented, it is time to bring it to life with actual code. Since we have already defined information such as the endpoints, request format, and response formats, implementation becomes much more straightforward. Now, let’s start building the backend to match our API specifications.

Orchard Core generally supports two main approaches for designing APIs, i.e. Headless and Decoupled.

In the headless approach, Orchard Core acts purely as a backend CMS, exposing content via APIs without a frontend. The frontend is built separately.

In the decoupled approach, Orchard Core still provides APIs like in the headless approach, but it also serves some frontend rendering. It is a hybrid approach because we use Razor Pages some parts of the UI are rendered by Orchard, while others rely on APIs.

So in fact, we can combine the good of both approaches so that we can build a customised headless APIs on Orchard Core using services like IOrchardHelper to fetch content dynamically and IContentManager to allow us full CRUD operations on content items. This is in fact the approach mentioned in the Orchard Core Basics Companion (OCBC) documentation.

For the endpoint of getting a list of travel packages, i.e. /api/v1/packages, we can define it as follows.

[ApiController]
[Route("api/v1/packages")]
public class PackageController(
IOrchardHelper orchard,
...) : Controller
{
[HttpGet]
public async Task<IActionResult> GetTravelPackages()
{
var travelPackages = await orchard.QueryContentItemsAsync(q =>
q.Where(c => c.ContentType == "TravelPackage"));

...

return Ok(travelPackages);
}

...
}

In the code above, we are using Orchard Core Headless CMS API and leveraging IOrchardHelper to query content items of type “TravelPackage”. We are then exposing a REST API (GET /api/v1/packages) that returns all travel packages stored as content items in the Orchard Core CMS.

API Design Journey 05: Testing of Actual Implementation

Let’s assume our Dev Server Base URL is localhost. This URL is set as a variable in the Develop Env, as shown in the screenshot below.

Setting Base URL for Develop Env on Apidog.

With the environment setup, we can now proceed to run our endpoint under that environment. As shown in the following screenshot, we are able to immediately validate the implementation of our endpoint.

Validated the GET endpoint under Develop Env.

The screenshot above shows that through API Validation Testing, the implementation of that endpoint has met all expected requirements.

API validation tests are not just for simple checks. The feature is great for handling complex, multi-step API workflows too. With them, we can chain multiple requests together, simulate real-world scenarios, and even run the same requests with different test data. This makes it easier to catch issues early and keep our API running smoothly.

Populate testing steps based on our API spec in Apidog.

In addition, we can also set up Scheduled Tasks, which is still in Beta now, to automatically run our test scenarios at specific times. This helps us monitor API performance, catch issues early, and ensure everything works as expected automatically. Plus, we can review the execution results to stay on top of any failures.

Result of running one of the endpoints on Develop Env.

Wrap-Up

Throughout this article, we have walked through the process of designing, mocking, documenting, implementing, and testing a headless API in Orchard Core using Apidog. By following an API-first approach, we ensure that our API is well-structured, easy to maintain, and developer-friendly.

With this approach, teams can collaborate more effectively, reduce friction in development. Now that the foundation is set, the next step could be integrating this API into a frontend app, optimising our API performance, or automating even more tests.

Finally, with .NET 9 moving away from built-in Swagger UI, developers now have to find alternatives to set up API documentation. As we can see, Apidog offers a powerful alternative, because it combines API design, testing, and documentation in one tool. It simplifies collaboration while ensuring a smooth API-first design approach.