Modern Data Warehouse with Azure

This month marks my third year in port and logistics industry.

In April, I attended a talk organised by NUS Business School on the future-ready supply chain. The talk is delivered by Dr Robert Yap, the YCH Group Executive Chairman. During the talk, Dr Yap mentioned that they innovated to survive because innovation was always at the heart of their development and growth. To him and his team, technology is not only an enabler for the growth of their business, but also a competitive advantage of the YCH Group.

In YCH Group, they have a vision of integrating the data flows in the supply chain with their unique analytics capabilities so that they can provide a total end-to-end supply chain enablement and transformation. Hence, today I’d like to share about how, with Microsoft Azure, we can build a data pipeline and modern data warehouse which helps to enable logistics companies to gear towards a future-ready supply chain.

Dr Yap shared about the The 7PL™ Strategy in YCH Group.

Two months ago, I also had the opportunity to join an online workshop to learn from Michelle Xie, Microsoft Azure Technical Trainer, about Azure Data Fundamentals. The workshop consists of four modules. In the workshop, we learnt core data concepts, relational and non-relational data offerings in Azure, modern data warehouses, and Power BI. Hence, I will share with you what I have learned in the workshop in this article as well.

About Data

Data is a collection of facts, figures, descriptions, and objects. Hence, data can be texts written on papers, or it can be in digital form and stored inside the electronic devices, or it could be facts that are in our mind. Data can be classified as follows.

Unstructured data like image is frequently used in combination with Machine Learning or Azure Cognitive Services capabilities to extract data.

ETL Data Pipeline

To build an data analytical system, we normally will have the following steps in a data pipeline to perform ETL procedure. ETL stands for Extract, Transform and Load. ETL loads data first into the staging storage server and then into the target storage system, as shown below.

ETL procedure in a data processing pipeline.
  • Data Ingestion: Data is moved from one or many data sources to a destination where it can be stored and further analysed;
  • Data Processing: Sometimes the raw data may not in the format suitable for querying. Hence, we need to transform and clean up the data;
  • Data Storage: Once the raw data has been processed, all the cleaned and transformed data will be stored to different storage systems which serve different purposes;
  • Data Exploration: A way of analysing performance through graphs and charts with business intelligence tools. This is helpful in making informed business decisions.
A map in the Power BI report showing the location of a prime mover within a time period.

In the world of big data, raw data often comes from different endpoints and the data is stored in different storage systems. Hence, there is a need of a service which can orchestrate the processes to refine these enormous stores of raw data into actionable business insights. This is where the Azure Data Factory, a cloud ETL service for scale-out serverless data integration and data transformation, comes into picture.

There are two ways of capturing the data in the Data Ingestion stage.

The first method is called the Batch Processing where a set of data is first collected over time and then fed into an analytics system to process them in group. For example, the daily sales data collected is scheduled to be processed every midnight. This is not just because midnight is the end of the day but also because the business normally ends at night and thus midnight is also the time when the servers are most likely to have more computing capacity.

Another method will be Streaming model where data is fed into analytics tools as it arrives and the data is processed in real time. This is suitable for use cases like collecting GPS data sent from the trucks because every piece of new data is generated in continuous manner and needs to be sent in real time.

Modern Data Warehouse

A modern data warehouse allows us to gather all our data at any scale easily, and to get insights through analytics, dashboard, and reports. The following image shows the data warehouse components on Azure.

Architecture diagram
Azure modern data warehouse architecture. (Image Source: Azure Docs)

For a big data pipeline, the data is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Apache Kafka, Event Hub, or IoT Hub. This data will then land in Azure Data Lake Storage long term persisted storage.

The Azure Data Lake Storage is an enterprise-wide hyper-scale repository for large volume of raw data. It is a suitable staging storage for our ingested data before the data is converted into a format suitable for data analysis. Thus, it can store any data in its native format, without requiring any prior transformations. Data Lake Storage can be accessed from Hadoop with the WebHDFS-compatible REST APIs.

As part of our data analytics workflow, we can use Azure Databricks as a platform to run SQL queries on the data lake and provide results for the dashboards in, for example, PowerBI. In addition, Azure Databricks also integrates with the MLflow machine learning platform API to support the end-to-end machine learning lifecycle from data preparation to deployment.

In the logistics industry, the need to store spatial data is greater than ever.

Let’s say a container trucking company collects data about each container delivery through an IoT device installed on the vehicle. Information such as the location and the speed of the prime mover is constantly sent from the IoT device to Azure Event Hub. We then can use Azure Databricks to correlate of the trip data, and also to enrich the correlated data with neighborhood data stored in the Databricks file system.

In addition, to process large amount of data efficiently, we can also rely on the Azure Synapse Analytics, which is basically an analytics service and a cloud data warehouse that lets us scale, compute, and store elastically and independently, with a massively parallel processing architecture.

Finally, we have Azure Analysis Services, which is an enterprise-grade analytics engine as a service. It is used to combine the data, define metrics, and secure the data in a single, trusted tabular semantic data model with enterprise-grade data models. As mentioned by Christian Wade, the Power BI Principal Program Manager in Microsoft, in March 2021, they have brought Azure Analysis Services capabilities to Power BI.

Pricing tiers available for Azure Analysis Services.

Relational Database Deployment Options on Azure and HOSTING COST

On Azure, there are two database deployment options available, i.e. IaaS and PaaS. IaaS option means that we have to host our SQL server on their virtual machines. For PaaS approach, we are able to either use Azure SQL Database, which is considered as DBaaS, or Azure SQL Managed Instance. Unless there is a need for the team to have OS-level access and control to the SQL servers, PaaS approach is normally the best choice.

Both PaaS and IaaS options include base price that covers underlying infrastructure and licensing. In IaaS, we can reduce the cost by shutting down the resources. However, in PaaS, the resources are always running unless we drop and re-create our resources when they are needed.

Cloud SQL Server options: SQL Server on IaaS, or SaaS SQL Database in the cloud.
The level of administration we have over the infrastructure and by the degree of cost efficiency. (Image Source: Azure Docs)

SQL Managed Instance is the latest deployment option which enables easy migration of most of the on-premises databases to Azure. It’s a fully-fledged SQL instance with nearly complete compatible with on-premise version of SQL server. Also, since SQL Managed Instance is built on the same PaaS service infrastructure, it comes with all PaaS features. Hence, if you would like to migrate from on-premise to Azure without management overhead but at the same time you require instance-scoped features, such as SQL Server Agent, you can try the SQL Managed Instance.

Andreas Wolter, one of the only 7 Microsoft Certified Solutions Masters (MCSM) for the Data Platform worldwide, once came to Singapore .NET Developers Community to talk about the SQL Database Managed Instance. If you’re new to SQL Managed Instance, check out the video below.

Spatial Data Types

Visibility plays a crucial role in the logistics industry because it relates to the ability of supply chain partners to be able to access and share operation information with other parties. Tracking the asset locations with GPS is one of the examples. However, how should we handle the geography data in our database?

Spatial data, also known as geospatial data, is data represented by numerical values in a geographic coordinate system. There are two types of spatial data, i.e. the Geometry Data Type, which supports Euclidean flat-earth data, and the Geography Data Type, which stores round-earth data, such as GPS latitude and longitude coordinates.

In Microsoft SQL Server, native spatial data types are used to represent spatial objects. In addition, it is able to index spatial data, provide cost-based optimizations, and support operations such as the intersection of two spatial objects. This functionality is also available in Azure SQL Database and Azure Managed Instances.

geom_hierarchy
The geometry hierarchy upon which the geometry and geography data types are based. (Image Source: SQL Docs)

Let’s say now we want to find the closest containers to a prime mover as shown in the following map.

The locations of 5 containers (marked as red) and location of the prime mover (marked as blue).

In addition, we have a table of container positions defined with the schema below.

CREATE TABLE ContainerPositions
(
    Id int IDENTITY (1,1),
    ContainerNumber varchar(13) UNIQUE,
    Position GEOGRAPHY
);

We can then simply use spatial function as shown below like STDistance, which will return the shortest distance between the two geography locations, in our query to sort the containers having the shortest to the longest distance from the prime mover.

The container “HKXU 200841-9” is the nearest container to the prime mover.

In addition, starting from version 2.2, Entity Framework Core also supports mapping to spatial data types using the NetTopologySuite spatial library. So, if you are using EF Core in your ASP .NET Core project, for example, you can easily get the mapping to spatial data types.

In the .NET Conference Singapore 2018, we announced the launch of Spatial Extension in EF Core 2.2.

Non-Relational DatabaseS on Azure

Azure Table Storage is one of the Azure services storing non-relational structured data. It provides a key/attribute store with a schema-less design. Since it’s a NoSQL datastore, it is suitable for datasets which do not require complex joins and can be denormalised for fast access.

Each of the Azure Table Storage consists of relevant entities, similar to a database row in RDBMS. Then each entity can have up to 252 properties to store the data together with a partition key. Entities with the same partition key will be stored in the same partition and the same partition server. Thus, entities with the same partition key can be queried more quickly. This also means that batch processing, the mechanism for performing atomic updates across multiple entities, can only operate on entities stored in the same partition.

In Azure Table Storage, using more partitions increases the scalability of our application. However, at the same time, using more partitions might limit the ability of the application to perform atomic transactions and maintain strong consistency for the data. We can then make use of this design to store, for example, data from each of the IoT devices in a warehouse, into different partition in the Azure Table Storage.

For a larger scale of the system, we can also design a data solution architecture that captures real-time data via Azure IoT Hub and store them into Cosmos DB which is a fast and flexible distributed database that scales seamlessly with guaranteed latency and throughput. If there is existing data in other data sources, we can also import data from data sources such as JSON files, CSV files, SQL database, and Azure Table storage to the Cosmos DB with the Azure Cosmos DB Migration Tool.

Azure Cosmos DB Migration Tool can be downloaded as a pre-compiled binary.

Globally, supply chain with Industry 4.0 is transformed into a smart and effective procedure to produce new outlines of income. Hence, the key impression motivating Industry 4.0 is to guide companies by transforming current manual processes with digital technologies.

Hard-copy of container proof of delivery (POD), for example, is still necessary in today’s container trucking industry. Hence, storing images and files for document generation and printing later is still a key feature in the digitalised supply chain workflow.

Proof of Delivery is now still mostly recorded on paper and sent via email or instant messaging services like Whatsapp. There is also no acceptable standard for what a proof of delivery form should specify. Each company more or less makes up their own rules.

On Azure, we can make use of Blob Storage to store large, discrete, binary objects that change infrequently, such as the documents like Proof of Delivery mentioned earlier.

In addition, there is another service called Azure Files available to provide serverless enterprise-grade cloud file shares. Azure Files can thus completely replace or supplement traditional on-premises file servers or NAS devices.

Hence, as shown in the screenshot below, we can upload files from a computer to the Azure File Share directly. Then the files will be accessible in another computer which is also connected to the Azure File Share, as shown below.

We can mount Azure File Share on macOS, Windows, and even Linux.

The Data Team

Setting up a new data team, especially in a startup, is a challenging problem. We need to explore roles and responsibilities in the world of data.

There are basically three roles that we need to have in a data team.

  • Database Administrator: In charge of operations such as managing the databases, creating database backups, restoring backups, monitoring database server performance, and implementing data security and access rights policy.
    • Tools: SQL Server Management Studio, Azure Portal, Azure Data Studio, etc.
  • Data Engineer: Works with the data to build up data pipeline and processes as well as apply data cleaning routine and transformations. This role is important to turn the raw data into useful information for the data analysis.
    • Tools: SQL Server Management Studio, Azure Portal, Azure Synapse Studio.
  • Data Analysis: Explores and analyses data by creating data visualisation and reporting which transforms data into insights to help in business decision making.
    • Tools: Excel, Power BI, Power BI Report Builder

In 2016, Gartner, a global research and advisory firm, shared a Venn Diagram on how data science is multi-disciplinary as shown below. Hence, there are some crucial technical skills needed, such as statistics, querying, modelling, R, Python, SQL, and data visualisation. Besides the technical skill, the team also needs to be equipped with business domain knowledge and soft skills.

The data science Venn Diagram. (Image source: Gartner)

In addition, the data team can also be organised in two manners, according to Lisa Cohen, Microsoft Principal Data Science Manager.

  • Embedded: The data science teams are spread throughout the company and each of the teams serves specific functional team in the company;
  • Centralised: There will be a core data team providing services to all functional teams across the company.

References

Run an Audio Server on Azure

Recently with music-streaming services like Spotify and YouTube Music getting popular, one may ask whether it’s possible to setup personal music-streaming service. The answer is yes.

There is a solution called Subsonic, which is developed by Sindre Mehus. However, Subsonic is no longer open source after 2016. Hence, we would talk about another open-source project inspired by Subsonic, i.e. Airsonic. According the the official website, the goal of Airsonic is to provide a full-featured, stable, self-hosted media server based on the Subsonic codebase that is free, open source, and community driven. So, let’s see how we can get Airsonic up and running on Azure.

Ubuntu on Microsoft Azure

Azure Virtual Machines supports running Linux and Windows. Airsonic can be installed on both Linux and Windows too. Since Linux is an open-source software server, it will be cheaper to run it on Azure than a Windows server.

Currently, Azure supports common Linux distributions including Ubuntu, CentOS, Debian, Red Hat, SUSE. Here, we would choose to use Ubuntu because it certainly has the upper hand when it comes to documentation and online help which makes finding OS-related solutions to easy. In addition, Ubuntu is updated frequently with LTS (Long Term Support) version released once every two years. Finally, if you are users of Debian-style distributions, Ubuntu will be a comfortable pick.

Ubuntu LTS and interim releases timeline. (Source: ubuntu.com)

Azure VM Size, Disk Size, and Cost

We should deploy a VM that provides the necessary performance for the workload at hand.

The B-series VMs are ideal for workloads that do not need the full performance of the CPU continuously. Hence, things like web servers, small databases, and our current project Airsonic is a suitable use case for B-series VMs. Hence, we will go for B1s which has only 1 virtual CPU and 1GiB of RAM. We don’t choose B1ls which has the smallest memory and lowest cost among Azure VM instances is because the installation of Airsonic on B1ls is found to be not successful. The lowest we can go is only B1s.

Choosing B1s as the VM size to host Airsonic.

For the OS disk type, instead of the default Premium SSD option, we will go for Standard SSD because it is not only a lower-cost SSD offering, but also more suitable for our audio application which is lightly used.

Remove Public Inbound Ports and Public IP Address

It’s not alright to have SSH port exposed to the Internet because there will be SSH attacks. Hence, we will remove the default public inbound ports. This will make all traffic from the Internet will be blocked. Later we will need to use a VPN connection instead to connect to the VM.

Remove all public inbound ports.

By default, when we create a VM on Azure Portal, there will be a public IP address given. It’s always recommended to not have public IP bound to the VM directly even there is only a single VM. Instead, we should deploy a load balancer in front of the VM and then have the VM bound to the load balancer. This will eventually make our life easier when we want to scale out our VM.

To not have any public IP address assigned to the VM, as shown in the screenshot below, we need to change the value of Public IP to “None”.

Setting Public IP to “None”.

Setup Virtual Network and VPN Gateway

When we create an Azure VM, we must create a Virtual Network (VNet) or use an existing VNet. A VNet is a virtual, isolated portion of the Azure public network. A VNet can then be further segmented into one or more subnets.

It is important to plan how our VM is intended to be accessed on the VNet before creating the actual VM.

The VNet configuration that we will be setting up for this project.

Since we have removed all the inbound public ports for the VM, we need to communicate with the VM through VPN. Hence, we currently need to have at least two subnets where one is for the VM and another one is for the VPN Gateway. We will add the subnet for VPN Gateway later. Now, we just do as follows.

Configuring VNet for our new VM.

Setup Point-to-Site (P2S) VPN Connection

There are already many tutorials available online about how to setup P2S VPN on Azure, for example the one written by Dishan Francis in Microsoft Tech Community, so I will not talk about how to setup the VPN Gateway on Azure. Instead, I’d like to highlight that P2S Connection is not configurable on Azure Portal if you are choosing the Basic type of the Azure VPN Gateway.

Once the VM deployment is successful, we can head to where the VNet it is located at. Then, we add the VPN Gateway subnet as shown in the screenshot below. As you can see, unlike the other subnets, the Gateway Subnet entry always has its name fixed to “GatewaySubnet” which we cannot modify.

Specifying the subnet address range for the VPN Gateway.

Next, we create a VPN Gateway. Since we are using the gateway for P2S, the type of VPN needs to be route-based. The gateway SKU that we chose here is the lowest cost, which is VpnGw1. Meanwhile, the Subnet field will be automatically chosen once we specify our VNet.

Creating a route-based VPN gateway.

The VPN gateway deployment process takes about 25 minutes. So while waiting for it to complete, we can proceed to create self-sign root and client certificates. Only root cert will be used in setting up the VPN Gateway here. The client certificate is used for installation on other computers which need P2S connections.

Once the VPN gateway is successfully deployed, we will then submit the root cert data to configure P2S, as shown below. In the Address pool field, I simply use 10.4.0.0/24 as the private IP address range that I want to use. VPN clients will dynamically receive an IP address from the range that we specify here.

Configuring Point-to-site. Saving of this will take about 5 minutes.

Now, we can download the corresponding VPN client to our local machine and install it. With this, we will get to see a new connection having our resource group name as its name available as one of the VPN connections on our machine.

A new VPN connection available to connect to our VM.

We can then connect to our VM using its private IP address, as shown in the screenshot below. Now, at least our VM is secured in the sense that its SSH port is not exposed to the public Internet.

We will not be connected with our VM through PuTTY SSH client if the corresponding VPN is disconnected.

Upgrade Ubuntu to 20.04 LTS

Once we have successfully connected to our VM, if we are using the Ubuntu 18.04 provided on Azure, then we will notice a message reminding us that there is a newer LTS version of Ubuntu available, which is Ubuntu 20.04, as shown in the screenshot below. Simply proceed to upgrade it.

New release of Ubuntu 20.04.2 LTS is available now.

Set VM Operating Hours

Since in cloud computing, we pay for what we use. Hence, it’s important that our VMs are only running when it’s necessary. If the VM doesn’t need to run 24-hour everyday, then we can configure its auto start and stop timings. For my case, I don’t listen to music when I am sleeping, so I will turn off the audio server between 12am to 6am.

To start and stop our VM at a scheduled time of the day, we can use the Tasks function, which is still in preview and available under Automation section of the VM. It will create two Logic Apps which will not automatically start or stop the VM.

Instead, I have to change the Logic Apps to send HTTP POST requests to start and powerOff endpoints of Azure directly, as suggested by R:\ob.ert in his post “Start/Stop Azure VMs during off-hours — The Logic App Solution”.

Changed the Logic Apps generated by the auto-power-off-VM template to send POST request to the powerOff endpoint directly.

Install Airsonic and Run as Standalone Programme

Since our VM will be automatically stopped and started everyday, it’s better to integrate Airsonic programme with Systemd so that Airsonic will be automatically run on each boot. There is a tutorial on how to set this up in the Airsonic documentation, so I will not describe the steps here. However, please remember to install Open JDK 8 too because Airsonic is based on Java to run.

Checking the airsonic.service status.

By default, Airsonic will be available at the port 8080 and it is listening on the path /airsonic. If the installation is successful, with our VPN connection connected, then we shall be able to see the following login screen in our first visit. Please immediately change the password as instructed for security purpose.

Welcome to Airsonic!

Public IP on VM Only via Load Balancer

We need to allow Airsonic music streaming over the public Internet and thus the VM needs to be accessible via public IP. However, since we have already earlier configured our VM to not have any public IP address, there needs to be a public load balancer bound to the VM. This setup gives us the flexibility to change the VM in the backend on the fly and secure the VM from Internet traffic.

Now, we can create a public load balancer, as shown in the screenshot below. The reason why Basic SKU which has no SLA is used here is because it’s free. SLA is optional to me here because this VM will be just a personal audio server.

Creating a new load balancer.

Basic SKU public IP address supports a dynamic as the default IP address assignment method. This means that a public IP address will be released from a resource when the resource is stopped (or deleted). The same resource will receive a different public IP address on start-up next time. If this is not what you expect, you can choose to use a static IP address to ensure it remains the same.

We now need to attach our VM to the backend pool of the load balancer, as shown in the following screenshot.

Attaching VM to the backend pool of the Azure Load Balancer.

After that, in order to allow Airsonic to be accessible from the public Internet, we shall set an inbound NAT (Network Address Translation) rule on the Azure Load Balancer. Here since I have only one VM, I directly set the VM as the target and setup a custom port mapping from port 80 to port 8080 (8080 is the default port used by Airsonic), as shown below.

A new inbound NAT rule has been set for the Airsonic VM.

Also, at the same time, we need to allow port 8080 in the Network Interface of the VM, as highlighted in the screenshot below.

Note: The VM airsonic-main-02 shown in the screenshot is the 2nd VM that I have for the same project. It is same as airsonic-main VM.

Allow inbound port 8080 on the Airsonic VM.

Once we have done all these, we can finally access Airsonic through the public IP address of the load balancer.

Enjoy the Music

By default, the media folder that will be used by Airsonic is at /var/music, as shown below. If this music folder does not exist yet, simply proceed to create one.

Airsonic will scan the media folder every day at 3am by default.

By default, the media folder is not accessible by any of the users. We need to explicitly give users the access to the media folders, as shown in the screenshot below.

Giving user access to the media folders.

As recommended by Airsonic, the music folders we add to /var/music and other media folders are better organized in an “artist/album/song” manner. This will help Airsonic to automatically build the albums. In addition, since I have already entered the relevant properties such as title and artist name to the music files, so Airsonic can read them and display on the web app, as shown in the screenshot below.

The cover image is automatically picked up from an image file named cover.png in the corresponding album folder.

In addition, both Airsonic and Subsonic provide the same API. Hence, we can access our music on Airsonic through Subsonic mobile apps as well. Currently I am using the free app Subsonic Music Streamer on my Android phone and it works pretty well.

The music on our Airsonic server can be accessed through Subsonic mobile app too!

References

Personal OneDrive Music Player on Raspberry Pi with a Web-based Remote Control (Part 2)

There are so much things that we can build with a Raspberry Pi. It’s always my small little dream to have a personal music player that sits on my desk. In the Part 1, we already setup the music player programme which is written in Golang on Raspberry Pi successfully.

Now we need to have a web app as a remote control which will send command to the music player to play the selected song. In this article, we will talk about the web portal and how we access the OneDrive with Microsoft Graph and go-onedrive client.

[Image Caption: System design of this music player together with its web portal.]

Project GitHub Repository

The complete source code of this web-based music player remote control can be found at https://github.com/goh-chunlin/Lunar.Music.Web.

Gin Web Framework

I had a discussion with Shaun Chong, the Ninja Van CTO, and I was told that they’re using Gin web framework. Hence, now I decide to try the framework out in this project as well.

Gin offers a fast router that’s easy to configure and use. For example, to serve static files, we simply need to have a folder static_files, for example, in the root of the programme together with the following one line.

router.StaticFS("/static", http.Dir("static_files"))

However, due to the fact that later I need to host this web app on Azure Function, I will not go this route. Hence, currently the following are the main handlers in the main function.

router.LoadHTMLGlob("templates/*.tmpl.html")

router.GET("/api/HttpTrigger/login-url", getLoginURL)
router.GET("/api/HttpTrigger/auth/logout", showLogoutPage)
router.GET("/api/HttpTrigger/auth/callback", showLoginCallbackPage)

router.GET("/api/HttpTrigger/", showMusicListPage)

router.POST("/api/HttpTrigger/send-command-to-raspberrypi", sendCommandToRaspberryPi)

The first line is to load all the HTML template files (*.tmpl.html) located in the templates folder. The templates we have include some reusable templates such as the following footer template in the file footer.tmpl.html.

<!-- Footer -->
<footer id="footer">
    <div class="container">
        ...
    </div>
</footer>

We can then import such reusable templates into other HTML files as shown below.

<!DOCTYPE html>
<html>
    ...
    <body>
        ...
        {{ template "footer.tmpl.html" . }}
        ...
    </body>
</html>

After importing the templates, we have five routes defined. All of the routes start with /api/HttpTrigger is because this web app is designed to be hosted on Azure Function with a HTTP-triggered function called HttpTrigger.

The first three routes are for authentication. Then after that is one route for loading the web pages, and one handler for sending RabbitMQ message to the Raspberry Pi.

The showMusicListPage handler function will check whether the user is logged in to Microsoft account with access to OneDrive or not. It will display the homepage if the user is not logged in yet. Otherwise, if the user has logged in, it will list the music items in the user’s OneDrive Music folder.

func showMusicListPage(context *gin.Context) {
    ...
    defaultDrive, err := client.Drives.Default(context)
    if err == nil && defaultDrive.Id != "" {
        ...
        context.HTML(http.StatusOK, "music-list.tmpl.html", gin.H{ ... })
        
        return
    }

    context.HTML(http.StatusOK, "index.tmpl.html", gin.H{ ... })
}

Hosting Golang Application on Azure Function

There are many ways to host Golang web application on Microsoft Azure. The place we will be using in this project is Azure Function, the serverless service from Microsoft Azure.

Currently, Azure Function offers first-class support for only a limited number of programming languages, such as JavaScript, C#, Python, Java, etc. Golang is not one of them. Fortunately, in March 2020, Azure Function custom handler is announced officially even though it’s still in preview now.

The custom handler provides a lightweight HTTP server written in any language which then enables developers to bring applications, such as those written in Golang, into Azure Function.

Azure Functions custom handler overview
[Image Caption: The relationship between the Functions host and a web server implemented as a custom handler. (Image Source: Microsoft Docs – Azure)]

What is even more impressive is that for HTTP-triggered functions with no additional bindings or outputs, we can enable HTTP Request Forwarding. With this configuration, the handler in our Golang application can work directly with the HTTP requests and responses. This is all configured in the host.json of the Azure Function as shown below.

{
    "version": "2.0",
    "extensionBundle": {
        "id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[1.*, 2.0.0)"
    },
    "customHandler": {
        "description": {
            "defaultExecutablePath": "lunar-music-webapp.exe"
        },
        "enableForwardingHttpRequest": true
    }
}

The defaultExecutablePath is pointing to the our Golang web app binary executable which is output by go build command.

So, in the wwwroot folder of the Azure Function, we should have the following items, as shown in the screenshot below.

[Image Caption: The wwwroot folder of the Azure Function.]

Since we have already enabled the HTTP Request Forwarding, we don’t have to worry about the HttpTrigger directory. Also, we don’t have to upload our web app source codes because the executable is there already. What we need to upload is just the static resources, such as our HTML template files in the templates folder.

The reason why I don’t upload other static files, such as CSS JavaScript, and image files, is that those static files can be structured to have multiple directory levels. We will encounter a challenge when we are trying to define the Route Template of the Function, as shown in the screenshot below. There is currently no way to define route template without knowing the maximum number of directory level we can have in our web app.

[Image Caption: Defining the route template in the HTTP Trigger of a function.]

Hence, I move all the CSS, JS, and image files to Azure Storage instead.

From the Route Template in the screenshot above, we can also understand why the routes in our Golang web app needs to start with /api/HttpTrigger.

Golang Client Library for OneDrive API

In this project, users will first store the music files online at their personal OneDrive Music folder. I restrict it to only the Music folder is to make the management of the music to be more organised. So it is not because there is technical challenge in getting files from other folders in OneDrive.

Referring to the Google project where they build a Golang client library for accessing the GitHub API, I have also come up with go-onedrive, a Golang client library, which is still in progress, to access the Microsoft OneDrive.

Currently, the go-onedrive only support simple API methods such as getting Drives and DriveItems. These two are the most important API methods for our web app in this project. I will continue to enhance the go-onedrive library in the future.

[Image Caption: OneDrive can access the metadata of a music file as well, so we can use API to get more info about the music.]

The go-onedrive library does not directly handle authentication. Instead, when creating a new client, we need to pass an http.Client that can handle authentication. The easiest and recommended way to do this is using the oauth2 library.

So the next thing we need to do is adding user authentication feature to our web app.

Microsoft Graph and Microsoft Identity platform

Before our web app can make requests to OneDrive, it needs users to authenticate and authorise the application to have access to their data. Currently, the official recommended way of doing so is to use Microsoft Graph, a set of REST APIs which enable us to access data on Microsoft cloud services, such as OneDrive, User, Calendar, People, etc. For more information, please refer to the YouTube video below.

So we can send HTTP GET requests to endpoints to retrieve information from OneDrive, for example /me/drives will return the default drive of the currently authenticated user.

Generally, to access OneDrive API, developers are recommend to use the standard OAuth 2.0 authorisation framework with the Azure AD v2.0 endpoint. This is where we will talk about the new Microsoft Identity Platform, which is the recommended way for accessing Microsoft Graph APIs.  Microsoft Identity Platform allows developers to build applications that sign in users, get tokens to call the APIs, as shown in the diagram below.

Microsoft identity platform today
[Image Caption: Microsoft Identity Platform experience. (Image Source: Microsoft Docs – Azure)]

By the way, according to Microsoft, the support for ADAL will come to an end in June 2022. So it’s better to do the necessary migration if you are still using the v1.0. Currently, the Golang oauth2 package is already using the Microsoft Identity Platform endpoints.

[Image Caption: Microsoft Identity Platform endpoints are used in the Golang OAuth2 package since December 2017.]

Now the first step we need to do is to register an Application with Microsoft on the Azure Portal. From there, we can get both the Client ID and the Client Secret (secret is now available under the “Certificates & secrets” section of the Application).

After that, we need to find out the authentication scopes to use so that the correct access type is granted when the user is signed in from our web app.

With those information available, we can define the OAuth2 configuration as follows in our web app.

var oauthConfig = &oauth2.Config{
    RedirectURL:  AzureADCallbackURL,
    ClientID:     AzureADClientID,
    ClientSecret: AzureADClientSecret,
    Scopes:       []string{"files.read offline_access"},
    Endpoint:     microsoft.AzureADEndpoint("common"),
}

The “file.read” scope is to grant read-only permission to all OneDrive files of the logged in user. By the way, to check the Applications that you are given access to so far in Microsoft Account, you can refer to the consent management page of Microsoft Account.

Access Token, Refresh Token, and Cookie

The “offline_access” scope is used here because we need a refresh token that can be used to generate additional access tokens as necessary. However, please take note that this “offline_access” scope is not available for the Token Flow. Hence, what we can only use is the Code Flow, which is described in the following diagram.

Authorization Code Flow Diagram
[Image Caption: The Code Flow. (Image Source: Microsoft Docs – OneDrive Developer)]

Hence, this explains why we have the following codes in the /auth/callback, which is the Redirect URL of our registered Application. What the codes do is to get the access token and refresh token from the /token endpoint using the auth code returned from the /authorize endpoint.

r := context.Request
code := r.FormValue("code")

response, err := http.PostForm(
    microsoft.AzureADEndpoint("common").TokenURL,
    url.Values{
        "client_id":     {AzureADClientID},
        "redirect_uri":  {AzureADCallbackURL},
        "client_secret": {AzureADClientSecret},
        "code":          {code},
        "grant_type":    {"authorization_code"}
    }
)

Here, we cannot simply decode the response body into the oauth2.Token yet. This is because the JSON in the response body from the Azure AD token endpoint only has expires_in but not expiry. So it does not have any field that can map to the Expiry field in oauth2.Token. Without Expiry, the refresh_token will never be used, as highlighted in the following screenshot.

[Image Caption: Even though the Expiry is optional but without it, refresh token will not be used.]

Hence, we must have our own struct tokenJSON defined so that we can first decode the response body to tokenJSON and then convert it to oauth2.Token with value in the Expiry field before passing the token to the go-onedrive client. By doing so, the access token will be automatically refreshed as necessary.

Finally, we just need to store the token in cookies using gorilla/securecookie which will encode authenticated and encrypted cookie as shown below.

encoded, err := s.Encode(ACCESS_AND_REFRESH_TOKENS_COOKIE_NAME, token)
if err == nil {
    cookie := &http.Cookie{
        Name: ACCESS_AND_REFRESH_TOKENS_COOKIE_NAME,
        Value: encoded,
        Path: "/",
        Secure: true,
        HttpOnly: true,
        SameSite: http.SameSiteStrictMode,
    }
    http.SetCookie(context.Writer, cookie)
}

Besides encryption, we also enable both Secure and HttpOnly attributes so that the cookie is sent securely and is not accessed by unintended parties or JavaScript Document.cookie API. The SameSite attribute also makes sure the cookie above not to be sent with cross-origin requests and thus provides some protection against Cross-Site Request Forgery (CSRF) attacks.

Microsoft Graph Explorer

For testing purposes, there is an official tool known as the Graph Explorer. It’s a tool that lets us make requests and see responses against the Microsoft Graph. From the Graph Explorer, we can also retrieve the Access Token and use it on other tools such as Postman to do further testing.

[Image Caption: Checking the response returned from calling the get DriveItems API.]

Azure Front door

In additional, Azure Front Door is added between the web app and the user in order to give us convenience in managing the global routing for the traffic to our web app.

The very first reason why I use Azure Front Door is also because I want to hide the /api/HttpTrigger part from the URL. This can be done by setting the custom forwarding path which points to the /api/HttpTrigger/ with URL rewrite enabled, as shown in the screenshot below.

[Image Caption: Setting the route details for the Rules Engine in Azure Front Door.]

In the screenshot above, we also notice a field called Backend Pool. A backend pool is a set of equivalent backends to which Front Door load balances your client requests. In our project, this will be the Azure Function that we’ve created above. Hence, in the future, when we have deployed the same web app to multiple Azure Functions so that Azure Front Door can help us to do load balancing.

Finally, Azure Front Door also provides a feature called Session Affinity which enables direct subsequent traffic from a user session to the same application backend for processing using Front Door generated cookies. This feature can be useful if we are building a stateful applications.

Final Product

Let’s take a look what it looks like after we’ve deployed the web app above and uploaded some music to the OneDrive. The web app is accessible through the Azure Front Door URL now at https://lunar-music.azurefd.net/.

[Image Caption: My playlist based on my personal OneDrive Music folder.]

Yup, that’s all. I finally have a personal music entertainment system on Raspberry Pi. =)

References

The code of the music player described in this article can be found in my GitHub repository: https://github.com/goh-chunlin/Lunar.Music.Web.

If you would like to find out more about Microsoft Identity Platform, you can also refer to the talk below given by Christos Matskas, Microsoft Senior Program Manager. Enjoy!