The Blueprint Fallacy: A Case for Discrete Event Simulation in Modern Systems Architecture

Greetings from Taipei!

I just spent two days at the Hello World Dev Conference 2025 in Taipei, and beneath the hype around cloud and AI, I observed a single, unifying theme: The industry is desperately building tools to cope with a complexity crisis of its own making.

The agenda was a catalog of modern systems engineering challenges. The most valuable sessions were the “踩雷經驗” (landmine-stepping experiences), which offered hard-won lessons from the front lines.

A 2-day technical conference on AI, Kubernetes, and more!

However, these talks raised a more fundamental question for me. We are getting exceptionally good at building tools to detect and recover from failure but are we getting any better at preventing it?

This post is not a simple translation of a Mandarin-language Taiwan conference. It is my analysis of the patterns I observed. I have grouped the key talks I attended into three areas:

  • Cloud Native Infrastructure;
  • Reshaping Product Management and Engineering Productivity with AI;
  • Deep Dives into Advanced AI Engineering.

Feel free to choose to dive into the section that interests you most.

Session: Smart Pizza and Data Observability

This session was led by Shuhsi (林樹熙), a Data Engineering Manager at Micron. Micron needs no introduction, they are a massive player in the semiconductor industry, and their smart manufacturing facilities are a prime example of where data engineering is mission-critical.

Micron in Singapore (Credit: Forbes)

Shuhsi’s talk, “Data Observability by OpenLineage,” started with a simple story he called the “Smart Pizza” anomaly.

He presented a scenario familiar to anyone in a data-intensive environment: A critical dashboard flatlines, and the next three hours are a chaotic hunt to find out why. In his “Smart Pizza” example, the culprit was a silent, upstream schema change.

Smart pizza dashboard anomaly.

His solution, OpenLineage, is a powerful framework for what we would call digital forensics. It is about building a perfect, queryable map of the crime scene after the crime has been committed. By creating a clear data lineage, it reduces the “Mean Time to Discovery” from hours of panic to minutes of analysis.

Let’s be clear: This is critical, valuable work. Like OpenTelemetry for applications, OpenLineage brings desperately needed order to the chaos of modern data pipelines.

It is a fundamentally reactive posture. It helps us find the bullet path through the body with incredible speed and precision. However, my main point is that our ultimate goal must be to predict the bullet trajectory before the trigger is pulled. Data lineage minimises downtime. My work with simulation, which will be explained in the next session, aims to prevent it entirely by modelling these complex systems to find the breaking points before they break.

Session: Automating a .NET Discrete Event Simulation on Kubernetes

My talk, “Simulation Lab on Kubernetes: Automating .NET Parameter Sweeps,” addressed the wall that every complex systems analysis eventually hits: Combinatorial explosion.

While the industry is focused on understanding past failures, my session is about building the Discrete Event Simulation (DES) engine that can calculate and prevent future ones.

A restaurant simulation game in Honkai Impact 3rd. (Source:
西琳 – YouTube)

To make this concrete, I used the analogy of a restaurant owner asking, “Should I add another table or hire another waiter?” The only way to answer this rigorously is to simulate thousands of possible futures. The math becomes brutal, fast: testing 50 different configurations with 100 statistical runs each requires 5,000 independent simulations. This is not a task for a single machine; it requires a computational army.

My solution is to treat Kubernetes not as a service host, but as a temporary, on-demand supercomputer. The strategy I presented had three core pillars:

  • Declarative Orchestration: The entire 5,000-run DES experiment is defined in a single, clean Argo Workflows manifest, transforming a potential scripting nightmare into a manageable, observable process.
  • Radical Isolation: Each DES run is containerised in its own pod, creating a perfectly clean and reproducible experimental environment.
  • Controlled Randomness: A robust seeding strategy is implemented to ensure that “random” events in our DES are statistically valid and comparable across the entire distributed system.
The turnout for my DES session confirmed a growing hunger in our industry for proactive, simulation-driven approaches to engineering.

The final takeaway was a strategic re-framing of a tool many of us already use. Kubernetes is more than a platform for web apps. It can also be a general-purpose compute engine capable of solving massive scientific and financial modelling problems. It is time we started using it as such.

Session: AI for BI

Denny’s (監舜儀) session on “AI for BI” illustrated a classic pain point: The bottleneck between business users who need data and the IT teams who provide it. The proposed solution was a natural language interface, the FineChatBI, a tool designed to sit on top of existing BI platforms to make querying existing data easier.

Denny is introducing AI for BI.

His core insight was that the tool is the easy part. The real work is in building the “underground root system” which includes the immense challenge of defining metrics, managing permissions, and untangling data semantics. Without this foundation, any AI is doomed to fail.

Getting the underground root system right is important for building AI projects.

This is a crucial step forward in making our organisations more data-driven. However, we must also be clear about what problem is being solved.

This is a system designed to provide perfect, instantaneous answers to the question, “What happened?”

My work, and the next category of even more complex AI, begins where this leaves off. It seeks to answer the far harder question: “What will happen if…?” Sharpening our view of the past is essential, but the ultimate strategic advantage lies in the ability to accurately simulate the future.

Session: The Impossibility of Modeling Human Productivity

The presented Jugg (劉兆恭) is a well-known agile coach and the organiser of Agile Tour Taiwan 2020. His talk, “An AI-Driven Journey of Agile Product Development – From Inspiration to Delivery,” was a masterclass in moving beyond vanity metrics to understand and truly improve engineering performance.

Jugg started with a graph that every engineering lead knows in their gut. As a company grows over time:

  • Business grow (purple line, up);
  • Software architecture and complexity grow (first blue line, up);
  • The number of developers increases (second blue line, up);
  • Expected R&D productivity should grow (green line, up);
  • But paradoxically, the actual R&D productivity often stagnates or even declines (red line, down).

Jugg provided a perfect analogue for the work I do. He tackled the classic productivity paradox: Why does output stagnate even as teams grow? He correctly diagnosed the problem as a failure of measurement and proposed the SPACE framework as a more holistic model for this incredibly complex human system.

He was, in essence, trying to answer the same class of question I do: “If we change an input variable (team process), how can we predict the output (productivity)?”

This is where the analogy becomes a powerful contrast. Jugg’s world of human systems is filled with messy, unpredictable variables. His solutions are frameworks and dashboards. They are the best tools we have for a system that resists precise calculation.

This session reinforced my conviction that simulation is the most powerful tool we have for predicting performance in the systems we can actually control: Our code and our infrastructure. We do not have to settle for dashboards that show us the past because we can build models that calculate the future.

Session: Building a Map of “What Is” with GraphRAG

The most technically demanding session came from Nils (劉岦崱), a Senior Data Scientist at Cathay Financial Holdings. He presented GraphRAG, a significant evolution beyond the “Naive RAG” most of us use today.

Nils is explaining what a Naive RAG is.

He argued compellingly that simple vector search fails because it ignores relationships. By chunking documents, we destroy the contextual links between concepts. GraphRAG solves this by transforming unstructured data into a structured knowledge graph: a web of nodes (entities) and edges (their relationships).

Enhancing RAG-based application accuracy by constructing and leveraging knowledge graphs (Image Credit: LangChain)

In essence, GraphRAG is a sophisticated tool for building a static map of a known world. It answers the question, “How are all the pieces in our universe connected right now?” For AI customer service, this is a game-changer, as it provides a rich, interconnected context for every query.

This means our data now has an explicit, queryable structure. So, the LLM gets a much richer, more coherent picture of the situation, allowing it to maintain context over long conversations and answer complex, multi-faceted questions.

This session was a brilliant reminder that all advanced AI is built on a foundation of rigorous data modelling.

However, a map, no matter how detailed, is still just a snapshot. It shows us the layout of the city, but it cannot tell us how the traffic will flow at 5 PM.

This is the critical distinction. GraphRAG creates a model of a system at rest and DES creates a model of a system in motion. One shows us the relationships while the other lets us press watch how those relationships evolve and interact over time under stress. GraphRAG is the anatomy chart and simulation is the stress test.

Session: Securing the AI Magic Pocket with LLM Guardrails

Nils from Cathay Financial Holdings returned to the stage for Day 2, and this time he tackled one of the most pressing issues in enterprise AI: Security. His talk “Enterprise-Grade LLM Guardrails and Prompt Hardening” was a masterclass in defensive design for AI systems.

What made the session truly brilliant was his central analogy. As he put it, an LLM is a lot like Doraemon: a super-intelligent, incredibly powerful assistant with a “magic pocket” of capabilities. It can solve almost any problem you give it. But, just like in the cartoon, if you give it vague, malicious, or poorly thought-out instructions, it can cause absolute chaos. For a bank, preventing that chaos is non-negotiable.

Nils grounded the problem in the official OWASP Top 10 for LLM Applications.

There are two lines of defence: Guardrails and Prompt Hardening. The core of the strategy lies in understanding two distinct but complementary approaches:

  • Guardrails (The Fortress): An external firewall of input filters and output validators;
  • Prompt Hardening (The Armour): Internal defences built into the prompt to resist manipulation.

This is an essential framework for any enterprise deploying LLMs. It represents the state-of-the-art in building static defences.

While necessary, this defensive posture raises another important question for a developers: How does the fortress behave under a full-scale siege?

A static set of rules can defend against known attack patterns. But what about the unknown unknowns? What about the second-order effects? Specifically:

  • Performance Under Attack: What is the latency cost of these five layers of validation when we are hit with 10,000 malicious requests per second? At what point does the defence itself become a denial-of-service vector?
  • Emergent Failures: When the system is under load and memory is constrained, does one of these guardrails fail in an unexpected way that creates a new vulnerability?

These are not questions a security checklist can answer. They can only be answered by a dynamic stress test. The X-Teaming Nils mentioned is a step in this direction, but a full-scale DES is the ultimate laboratory.

Neil’s techniques are a static set of rules designed to prevent failure. Simulation is a dynamic engine designed to induce failure in a controlled environment to understand a system true breaking points. He is building the armour while my work with DES is in building the testing grounds to see where that armour will break.

Session: Driving Multi-Task AI with a Flowchart in a Single Prompt

The final and most thought-provoking session was delivered by 尹相志, who presented a brilliant hack: Embedding a Mermaid flowchart directly into a prompt to force an LLM to execute a deterministic, multi-step process.

尹相志,數據決策股份有限公司技術長。

He provided a new way beyond the chaos of autonomous agents and the rigidity of external orchestrators like LangGraph. By teaching the LLM to read a flowchart, he effectively turns it into a reliable state machine executor. It is a masterful piece of engineering that imposes order on a probabilistic system.

Action Grounding Principles proposed by 相志.

What he has created is the perfect blueprint. It is a model of a process as it should run in a world with no friction, no delays, and no resource contention.

And in that, he revealed the final, critical gap in our industry thinking.

A blueprint is not a stress test. A flowchart cannot answer the questions that actually determine the success or failure of a system at scale:

  • What happens when 10,000 users try to execute this flowchart at once and they all hit the same database lock?
  • What is the cascading delay if one step in the flowchart has a 5% chance of timing out?
  • Where are the hidden queues and bottlenecks in this process?

His flowchart is the architect’s beautiful drawing of an airplane. A DES is the wind tunnel. It is the necessary, brutal encounter with reality that shows us where the blueprint will fail under stress.

The ability to define a process is the beginning. The ability to simulate that process under the chaotic conditions of the real world is the final, necessary step to building systems that don’t just look good on paper, but actually work.

Final Thoughts and Key Takeaways from Taipei

My two days at the Hello World Dev Conference were not a tour of technologies. In fact, they were a confirmation of a dangerous blind spot in our industry.

From what I observe, they build tools for digital forensics to map past failures. They sharpen their tools with AI to perfectly understand what just happened. They create knowledge graphs to model the systems at rest. They design perfect, deterministic blueprints for how AI processes should work.

These are all necessary and brilliant advancements in the art of mapmaking.

However, the critical, missing discipline is the one that asks not “What is the map?”, but “What will happen to the city during the hurricane?” The hard questions of latency under load, failures, and bottlenecks are not found on any of their map.

Our industry is full of brilliant mapmakers. The next frontier belongs to people who can model, simulate, and predict the behaviour of complex systems under stress, before the hurricane reaches.

That is why I am building SNA, my .NET-based Discrete Event Simulation engine.

Hello, Taipei. Taken from the window of the conference venue.

I am leaving Taipei with a notebook full of ideas, a deeper understanding of the challenges and solutions being pioneered by my peers in the Mandarin-speaking tech community, and a renewed sense of excitement for the future we are all building.

Building a Gacha Bot in Power Automate and MS Teams

Every agile team knows the “Support Hero” role, that one person designated to handle the interruptions of the day, bug reports, and urgent requests. In our team, we used a messy spreadsheet to track the rotation. People forgot whose turn it was, someone would be on leave, and the whole thing was a low-grade, daily friction point.

One day, a teammate had a brilliant idea: “What if we made it fun? What if we gamified it?”

He quickly prototyped an gacha bot using Power Automate that would randomly select the hero of the day. It was a huge hit. It turned a daily chore into a fun moment of team engagement. It was a perfect example of a small automation making a big impact on our culture.

Over time, as team members changed and responsibilities shifted, that original gacha bot was lost. The fun morning ritual disappeared, and we went back to the old, boring way. We all felt the difference.

Recently, I decided it was time to bring that spark back. I took the original, brilliant concept and decided to re-build it from the ground up as a robust, reusable, and shareable solution.

This post is a tribute to that original idea, and a detailed, step-by-step guide on how you can build a similar gacha bot for your own team. Let’s make our daily routines fun again.

How it Works: The Daily Gacha Ritual

Before we open the hood and look at the Power Automate engine, let me walk you through what my team actually experiences every morning at 10:00 AM.

It all starts with a message from the bot to the Microsoft Teams group of our team. The message says the following.

Hi, Louisa. You are the lucky Support Hero today. 

This is the moment of suspense. Everyone sees the ping. Louisa, one of our teammates, is now in the spotlight.

However, what if Louisa is on vacation, sipping a drink on a beach in Bali? The bot is prepared. Immediately following the announcement, it posts a second message which is an interactive Adaptive Card:

Is our teammate mentioned above working today?
[] Yes.
[] No.
[] I volunteer!
[Submit Status]

This is where the team interaction happens.

  • If Louisa is around, she proudly clicks ‘Yes.’ The card updates to say ‘Louisa has accepted the quest!’ and the ritual is over.
  • If Louisa is on leave, anyone on the team can click ‘No.’ This immediately triggers the bot to run the gacha again, announcing a new hero.
  • And my favourite part is that if someone else, for example Austin, is feeling particularly heroic that day, he can click ‘I volunteer!‘ This lets him steal the spotlight and take on the role, giving Louisa a day off. The card updates to say ‘A new hero has emerged! Austin has volunteered for the quest!'”

Within a minute, the daily chore is assigned, not through a boring spreadsheet, but through a fun, interactive, and slightly dramatic team ritual. It is a small thing, but it starts our day with a smile and a sense of shared fun.

Now that you have seen what it does, let’s build it.

Step 1: Define The Trigger

First, I setup a “Schedule cloud flow” so that every morning 10am, a message will be sent to the Teams on who is the lucky one.

Second, I will name the flow and define its starting date and time. As shown in the following screenshot, we will set the occurrence to be every day, starting from 1st Oct 2025, 00:00.

Please take note that in the step above, the “12am” is the beginning time, not the time when this job will be executed daily. So in the first node of the flow itself, I have to define at what time the gacha bot will start and at which timezone. Since our daily support needs to be done in the morning, we will make it run at 10am everyday, as shown in the screenshot below.

Step 2: Define Variables and Controls

After that, we add a new “Initialize Variable” node where we can define name of all the teammates.

We also need another variable to later store the response of the user on the adaptive card, as shown in the screenshot below.

Since this gacha only makes sense during weekday, so I need a “Condition” block to check whether the day is a weekday or not. If it is a weekend, the bot will not send any message.

As shown in the screenshot above, what I do is checking the value of dayOfWeek(convertFromUtc(utcNow(), 'Singapore Standard Time')).

Since there is nothing to be done when it is a weekend, so we will leave the “False” block as empty. For the “True” block, we will have a “Do Until” block because the gacha bot needs to keep on selecting a name until someone clicks “Yes” or “Volunteer”. Hence, as shown in the screenshot below, the loop will loop until responseChoice is not “No”.

Step 3: Inside the Loop

There are three important “Compose” data operations.

  • Generate Random Index: To generate a random number from 0 to the number of the team members.
    rand(0, length(variables('teamMembers')))
  • Select Random Teammate Object: The random number is used to pick the hero from the array.
    variables('teamMembers')[int(outputs('Compose:_Generate_Random_Index'))]
  • Get Name of Hero: Get the name of the person from the array.
    outputs('Compose:_Select_Random_Teammate_Object')['name']

After the three data operations are added, the flow now looks as shown below.

According the our designed workflow, after a hero is selected, we can send a message with the “Post message in a chat or channel” action to inform the team who is being selected by the gacha bot.

Next we need to post an adaptive card to Microsoft Teams and wait for a response. In our case, since the adaptive card is posted to group chat, we need to put an entire JSON below to the Message field.

{
"type": "AdaptiveCard",
"$schema": "http://adaptivecards.io/schemas/adaptive-card.json",
"version": "1.4",
"body": [
{
"type": "TextBlock",
"text": "Daily Check-In",
"wrap": true,
"size": "Large",
"weight": "Bolder"
},
{
"type": "TextBlock",
"text": "Please pick an option accordingly.",
"wrap": true
},
{
"type": "Input.ChoiceSet",
"id": "userChoice",
"style": "expanded",
"isMultiSelect": false,
"label": "Is our teammate mentioned above working today?",
"choices": [
{
"title": "Yes.",
"value": "Yes"
},
{
"title": "No.",
"value": "No"
},
{
"title": "I volunteer!",
"value": "Volunteer"
}
]
}
],
"actions": [
{
"type": "Action.Submit",
"title": "Submit Status"
}
]
}

In short, the “Post adaptive card and wait for a response” action will be setup as shown in the following screenshot.

Step 4: Handle the User’s Response

Right after the adaptive card, I setup a “Switch” control to handle the user’s response.

If the response is “Yes”, there will be a confirmation sent to the Microsoft Teams group chat. If the response is “Volunteer”, before a confirmation message is sent, the bot needs to know who responds so that it can indicate the volunteer’s name. To do so, I use a “Get user profile (V2)” action with body/responder/userPrincipalName as the UPN, as shown in the screenshot below.

The Office 365 Users node will give us the friendly display name of the person who volunteers, as shown in the screenshot below.

Your Turn

So, what have we really built here? On the surface, it is just a simple Power Automate flow. However, the real product is not the bot. Instead, it is the daily moment of shared fun. We did not just automate a chore but we engineered a small spark of joy and human connection into our daily routine. We used technology to solve a human problem, not just a technical one.

Now, it is your turn.

Your mission, should you choose to accept it, is to find the single most boring, repetitive chore that your own team has to deal with. Find that small, grey corner of the life of your team, and ask yourself: “How can I make this fun?”

Together, we learn better.

Securing APIs with OAuth2 Introspection

In today’s interconnected world, APIs are the backbone of modern apps. Protecting these APIs and ensuring only authorised users access sensitive data is now more crucial than ever. While many authentication and authorisation methods exist, OAuth2 Introspection stands out as a robust and flexible approach. In this post, we will explore what OAuth2 Introspection is, why we should use it, and how to implement it in our .NET apps.

Before we dive into the technical details, let’s remind ourselves why API security is so important. Think about it: APIs often handle the most sensitive stuff. If those APIs are not well protected, we are basically opening the door to some nasty consequences. Data breaches? Yep. Regulatory fines (GDPR, HIPAA, you name it)? Potentially. Not to mention, losing the trust of our users. A secure API shows that we value their data and are committed to keeping it safe. And, of course, it helps prevent the bad guys from exploiting vulnerabilities to steal data or cause all sorts of trouble.

The most common method of securing APIs is using access tokens as proof of authorization. These tokens, typically in the form of JWTs (JSON Web Tokens), are passed by the client to the API with each request. The API then needs a way to validate these tokens to verify that they are legitimate and haven’t been tampered with. This is where OAuth2 Introspection comes in.

OAuth2 Introspection

OAuth2 Introspection is a mechanism for validating bearer tokens in an OAuth2 environment. We can think of it as a secure lookup service for our access tokens. It allows an API to query an auth server, which is also the “issuer” of the token, to determine the validity and attributes of a given token.

The workflow of an OAuth2 Introspection request.

To illustrate the process, the diagram above visualises the flow of an OAuth2 Introspection request. The Client sends the bearer token to the Web API, which then forwards it to the auth server via the introspection endpoint. The auth server validates the token and returns a JSON response, which is then processed by the Web API. Finally, the Web API grants (or denies) access to the requested resource based on the token validity.

Introspection vs. Direct JWT Validation

You might be thinking, “Isn’t this just how we normally validate a JWT token?” Well, yes… and no. What is the difference, and why is there a special term “Introspection” for this?

With direct JWT validation, we essentially check the token ourselves, verifying its signature, expiry, and sometimes audience. Introspection takes a different approach because it involves asking the auth server about the token status. This leads to differences in the pros and cons, which we will explore next.

With OAuth2 Introspection, we gain several key advantages. First, it works with various token formats (JWTs, opaque tokens, etc.) and auth server implementations. Furthermore, because the validation logic resides on the auth server, we get consistency and easier management of token revocation and other security policies. Most importantly, OAuth2 Introspection makes token revocation straightforward (e.g., if a user changes their password or a client is compromised). In contrast, revoking a JWT after it has been issued is significantly more complex.

.NET Implementation

Now, let’s see how to implement OAuth2 Introspection in a .NET Web API using the AddOAuth2Introspection authentication scheme.

The core configuration lives in our Program.cs file, where we set up the authentication and authorisation services.

// ... (previous code for building the app)

builder.Services.AddAuthentication("Bearer")
.AddOAuth2Introspection("Bearer", options =>
{
options.IntrospectionEndpoint = "<Auth server base URL>/connect/introspect";
options.ClientId = "<Client ID>";
options.ClientSecret = "<Client Secret>";

options.DiscoveryPolicy = new IdentityModel.Client.DiscoveryPolicy
{
RequireHttps = false,
};
});

builder.Services.AddAuthorization();

// ... (rest of the Program.cs)

This code above configures the authentication service to use the “Bearer” scheme, which is the standard for bearer tokens. AddOAuth2Introspection(…) is where the magic happens because it adds the OAuth2 Introspection authentication handler by pointing to IntrospectionEndpoint, the URL our API will use to send the token for validation.

Usually, RequireHttps needs to be true in production. However, in situations like when the API and the auth server are both deployed to the same Elastic Container Service (ECS) cluster and they communicate internally within the AWS network, we can set it to false. This is because the Application Load Balancer (ALB) handles the TLS/SSL termination and the internal communication between services happens over HTTP, we can safely disable RequireHttps in the DiscoveryPolicy for the introspection endpoint within the ECS cluster. This simplifies the setup without compromising security, as the communication from the outside world to our ALB is already secured by HTTPS.

Finally, to secure our API endpoints and require authentication, we can simply use the [Authorize] attribute, as demonstrated below.

[ApiController]
[Route("[controller]")]
[Authorize]
public class MyController : ControllerBase
{
[HttpGet("GetData")]
public IActionResult GetData()
{
...
}
}

Wrap-Up

OAuth2 Introspection is a powerful and flexible approach for securing our APIs, providing a centralised way to validate bearer tokens and manage access. By understanding the process, implementing it correctly, and following best practices, we can significantly improve the security posture of our apps and protect our valuable data.

References

Observing Orchard Core: Traces with Grafana Tempo and ADOT

In the previous article, we have discussed about how we can build a custom monitoring pipeline that has Grafana running on Amazon ECS to receive metrics and logs, which are two of the observability pillars, sent from the Orchard Core on Amazon ECS. Today, we will proceed to talk about the third pillar of observability, traces.

Source Code

The CloudFormation templates and relevant C# source codes discussed in this article is available on GitHub as part of the Orchard Core Basics Companion (OCBC) Project: https://github.com/gcl-team/Experiment.OrchardCore.Main.

Lisa Jung, senior developer advocate at Grafana, talks about the three pillars in observability (Image Credit: Grafana Labs)

About Grafana Tempo

To capture and visualise traces, we will use Grafana Tempo, an open-source, scalable, and cost-effective tracing backend developed by Grafana Labs. Unlike other tracing tools, Tempo does not require an index, making it easy to operate and scale.

We choose Tempo because it is fully compatible with OpenTelemetry, the open standard for collecting distributed traces, which ensures flexibility and vendor neutrality. In addition, Tempo seamlessly integrates with Grafana, allowing us to visualise traces alongside metrics and logs in a single dashboard.

Finally, being a Grafana Labs project means Tempo has strong community backing and continuous development.

About OpenTelemetry

With a solid understanding of why Tempo is our tracing backend of choice, let’s now dive deeper into OpenTelemetry, the open-source framework we use to instrument our Orchard Core app and generate the trace data Tempo collects.

OpenTelemetry is a Cloud Native Computing Foundation (CNCF) project and a vendor-neutral, open standard for collecting traces, metrics, and logs from our apps. This makes it an ideal choice for building a flexible observability pipeline.

OpenTelemetry provides SDKs for instrumenting apps across many programming languages, including C# via the .NET SDK, which we use for Orchard Core.

OpenTelemetry uses the standard OTLP (OpenTelemetry Protocol) to send telemetry data to any compatible backend, such as Tempo, allowing seamless integration and interoperability.

Both Grafana Tempo and OpenTelemetry are projects under the CNCF umbrella. (Image Source: CNCF Cloud Native Interactive Landscape)

Setup Tempo on EC2 With CloudFormation

It is straightforward to deploy Tempo on EC2.

Let’s walk through the EC2 UserData script that installs and configures Tempo on the instance.

First, we download the Tempo release binary, extract it, move it to a proper system path, and ensure it is executable.

wget https://github.com/grafana/tempo/releases/download/v2.7.2/tempo_2.7.2_linux_amd64.tar.gz
tar -xzvf tempo_2.7.2_linux_amd64.tar.gz
mv tempo /usr/local/bin/tempo
chmod +x /usr/local/bin/tempo

Next, we create a basic Tempo configuration file at /etc/tempo.yaml to define how Tempo listens for traces and where it stores trace data.

echo "
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
storage:
trace:
backend: local
local:
path: /tmp/tempo/traces
" > /etc/tempo.yaml

Let’s breakdown the configuration file above.

The http_listen_port allows us to set the HTTP port (3200) for Tempo internal web server. This port is used for health checks and Prometheus metrics.

After that, we configure where Tempo listens for incoming trace data. In the configuration above, we enabled OTLP receivers via both gRPC and HTTP, the two protocols that OpenTelemetry SDKs and agents use to send data to Tempo. Here, the ports 4317 (gRPC) and 4318 (HTTP) are standard for OTLP.

Last but not least, in the configuration, as demonstration purpose, we use the simplest one, local storage, to write trace data to the EC2 instance disk under /tmp/tempo/traces. This is fine for testing or small setups, but for production we will likely want to use services like Amazon S3.

In addition, since we are using local storage on EC2, we can easily SSH into the EC2 instance and directly inspect whether traces are being written. This is incredibly helpful during debugging. What we need to do is to run the following command to see whether files are being generated when our Orchard Core app emits traces.

ls -R /tmp/tempo/traces

The configuration above is intentionally minimal. As our setup grows, we can explore advanced options like remote storage, multi-tenancy, or even scaling with Tempo components.

Each flushed trace block (folder with UUID) contains a data.parquet file, which holds the actual trace data.

Finally, in order to enable Tempo to start on boot, we create a systemd unit file that allows Tempo to start on boot and automatically restart if it crashes.

cat <<EOF > /etc/systemd/system/tempo.service
[Unit]
Description=Grafana Tempo service
After=network.target

[Service]
ExecStart=/usr/local/bin/tempo -config.file=/etc/tempo.yaml
Restart=always
RestartSec=5
User=root
LimitNOFILE=1048576

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reexec
systemctl daemon-reload
systemctl enable --now tempo

This systemd service ensures that Tempo runs in the background and automatically starts up after a reboot or a crash. This setup is crucial for a resilient observability pipeline.

Did You Know: When we SSH into an EC2 instance running Amazon Linux 2023, we will be greeted by a cockatiel in ASCII art! (Image Credit: OMG! Linux)

Understanding OTLP Transport Protocols

In the previous section, we configured Tempo to receive OTLP data over both gRPC and HTTP. These two transport protocols are supported by the OTLP, and each comes with its own strengths and trade-offs. Let’s break them down.

Ivy Zhuang from Google gave a presentation on gRPC and Protobuf at gRPConf 2024. (Image Credit: gRPC YouTube)

Tempo has native support for gRPC, and many OpenTelemetry SDKs default to using it. gRPC is a modern, high-performance transport protocol built on top of HTTP/2. It is the preferred option when performanceis critical. gRPC also supports streaming, which makes it ideal for high-throughput scenarios where telemetry data is sent continuously.

However, gRPC is not natively supported in browsers, so it is not ideal for frontend or web-based telemetry collection unless a proxy or gateway is used. In such scenarios, we will normally choose HTTP which is browser-friendly. HTTP is a more traditional request/response protocol that works well in restricted environments.

Since we are collecting telemetry from server-side like Orchard Core running on ECS, gRPC is typically the better choice due to its performance benefits and native support in Tempo.

Please take note that since gRPC requires HTTP/2, which some environments, for example, IoT devices and embedding systems, might not have mature gRPC client support, OTLP over HTTP is often preferred in simpler or constrained systems.

Daniel Stenberg, Senior Network Engineer at Mozilla, sharing about HTTP/2 at GOTO Copenhagen 2015. (Image Credit: GOTO Conferences YouTube)

gRPC allows multiplexing over a single connection using HTTP/2. Hence, in gRPC, all telemetry signals, i.e. logs, metrics, and traces, can be sent concurrently over one connection. However, with HTTP, each telemetry signal needs a separate POST request to its own endpoint as listed below to enforce clean schema boundaries, simplify implementation, and stay aligned with HTTP semantics.

  • Logs: /v1/logs;
  • Metrics: /v1/metrics;
  • Traces: /v1/traces.

In HTTP, since each signal has its own POST endpoint with its own protobuf schema in the body, there is no need for the receiver to guess what is in the body.

AWS Distro for Open Telemetry (ADOT)

Now that we have Tempo running on EC2 and understand the OTLP protocols it supports, the next step is to instrument our Orchard Core to generate and send trace data.

The following code snippet shows what a typical direct integration with Tempo might look like in an Orchard Core.

builder.Services
.AddOpenTelemetry()
.ConfigureResource(resource => resource.AddService(serviceName: "cld-orchard-core"))
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://<tempo-ec2-host>:4317");
options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc;
})
.AddConsoleExporter());

This approach works well for simple use cases during development stage, but it comes with trade-offs that are worth considering. Firstly, we couple our app directly to the observability backend, reducing flexibility. Secondly, central management becomes harder when we scale to many services or environments.

This is where AWS Distro for OpenTelemetry (ADOT) comes into play.

The ADOT collector. (Image credit: ADOT technical docs)

ADOT is a secure, AWS-supported distribution of the OpenTelemetry project that simplifies collecting and exporting telemetry data from apps running on AWS services, for example our Orchard Core on ECS now. ADOT decouples our apps from the observability backend, provides centralised configuration, and handles telemetry collection more efficiently.

Sidecar Pattern

We can deploy the ADOT in several ways, such as running it on a dedicated node or ECS service to receive telemetry from multiple apps. We can also take the sidecar approach which cleanly separates concerns. Our Orchard Core app will focus on business logic, while a nearby ADOT sidecar handles telemetry collection and forwarding. This mirrors modern cloud-native patterns and gives us more flexibility down the road.

The sidecar pattern running in Amazon ECS. (Image Credit: AWS Open Source Blog)

The following CloudFormation template shows how we deploy ADOT as a sidecar in ECS using CloudFormation. The collector config is stored in AWS Systems Manager Parameter Store under /myapp/otel-collector-config, and injected via the AOT_CONFIG_CONTENT environment variable. This keeps our infrastructure clean, decoupled, and secure.

ecsTaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Ref ServiceName
NetworkMode: awsvpc
ExecutionRoleArn: !GetAtt ecsTaskExecutionRole.Arn
TaskRoleArn: !GetAtt iamRole.Arn
ContainerDefinitions:
- Name: !Ref ServiceName
Image: !Ref OrchardCoreImage
...

- Name: adot-collector
Image: public.ecr.aws/aws-observability/aws-otel-collector:latest
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !Sub "/ecs/${ServiceName}-log-group"
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: adot
Essential: false
Cpu: 128
Memory: 512
HealthCheck:
Command: ["/healthcheck"]
Interval: 30
Timeout: 5
Retries: 3
StartPeriod: 60
Secrets:
- Name: AOT_CONFIG_CONTENT
ValueFrom: !Sub "arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:parameter/otel-collector-config"
Deploy an ADOT sidecar on ECS to collect observability data from Orchard Core.

There are several interesting and important details in the CloudFormation snippet above that are worth calling out. Let’s break them down one by one.

Firstly, we choose awsvpc as the NetworkMode of the ECS task. In awsvpc, each container in the ECS task, i.e. our Orchard Core container and the ADOT sidecar, receives its own ENI (Elastic Network Interface). This is great for network-level isolation. With this setup, we can reference the sidecar from our Orchard Core using its container name through ECS internal DNS, i.e. http://adot-collector:4317.

Secondly, we include a health check for the ADOT container. ECS will use this health check to restart the container if it becomes unhealthy, improving reliability without manual intervention. In November 2022, Paurush Garg from AWS added the healthcheck component with the new ADOT collector release, so we can simply specify that we will be using this healthcheck component in the configuration that we will discuss next.

Yes, the configuration! Instead of hardcoding the ADOT configuration into the task definition, we inject it securely at runtime using the AOT_CONFIG_CONTENT secret. This environment variable AOT_CONFIG_CONTENT is designed to enable us to configure the ADOT collector. It will override the config file used in the ADOT collector entrypoint command.

The SSM Parameter for the environment variable AOT_CONFIG_CONTENT.

Wrap-Up

By now, we have completed the journey of setting up Grafana Tempo on EC2, exploring how traces flow through OTLP protocols like gRPC and HTTP, and understanding why ADOT is often the better choice in production-grade observability pipelines.

With everything connected, our Orchard Core app is now able to send traces into Tempo reliably. This will give us end-to-end visibility with OpenTelemetry and AWS-native tooling.

References