Data – cuteprogramming

Data Protection APIs in ASP.NET Core

October 12, 2022 by Chun Lin, posted in ASP.NET, C#, Cloud Computing: Microsoft Azure, Core, Data, Experience, Security

Beginning with Windows 2000, Microsoft Windows operating systems have been shipped with a data protection interface known as DPAPI (Data Protection Application Programming Interface). DPAPI is a simple cryptographic API. It doesn’t store any persistent data for itself; instead, it simply receives plaintext and returns cyphertext.

Windows DPAPI isn’t intended for use in web applications. Fortunately, ASP.NET Core offers data protection APIs which include also key management and rotation. With the APIs, we are able to store security-sensitive data for our ASP .NET Web apps.

Configure Service Container and Register Data Protection Stack

In ASP.NET Core project, we have to first configure a data protection system and then add it to the service container for dependency injection.

public void ConfigureServices(IServiceCollection services)
{
    // …
    
    services.AddDataProtection()
            .PersistKeysToFileSystem(new DirectoryInfo(@"\server\shared\directory\"))
            .SetApplicationName("<sharedApplicationName>");
}

In the code above, instead of storing key at the default location, which is %LOCALAPPDATA%, we store it on a network drive by specifying the path to the UNC Share.

By default, the Data Protection system isolates apps from one another based on their content root paths, even if they share the same physical key repository. This isolation prevents the apps from understanding each other’s protected payloads. Just in case we may need to share protected payloads among apps, we can configure SetApplicationName first so that other apps with the same value later can share the protected payloads.

Key Protection with Azure Key Vault

The code above shows how we can store keys on a UNC share. If we head to the directory \server\shared\directory\, we will be seeing an XML file with content similar as what is shown below.

<?xml version="1.0" encoding="utf-8"?>
<key id="..." version="1">
  <creationDate>2022-08-31T02:50:40.14912Z</creationDate>
  <activationDate>2022-08-31T02:50:40.0801042Z</activationDate>
  <expirationDate>2022-11-29T02:50:40.0801042Z</expirationDate>
  <descriptor deserializerType="Microsoft.AspNetCore.DataProtection.AuthenticatedEncryption.ConfigurationModel.AuthenticatedEncryptorDescriptorDeserializer, Microsoft.AspNetCore.DataProtection, Version=7.0.0.0, Culture=neutral, PublicKeyToken=adb9793829ddae60">
    <descriptor>
      <encryption algorithm="AES_256_CBC" />
      <validation algorithm="HMACSHA256" />
      <masterKey p4:requiresEncryption="true" xmlns:p4="http://schemas.asp.net/2015/03/dataProtection">
        <!-- Warning: the key below is in an unencrypted form. -->
        <value>au962I...kpMYA==</value>
      </masterKey>
    </descriptor>
  </descriptor>
</key>

As we can see, the key <masterKey> itself is in an unencrypted form.

Hence, in order to protect the data protection key ring, we need to make sure that the storage location should be protected as well. Normally, we can use file system permissions to ensure only the identity under which our web app runs has access to the storage directory. Now with Azure, we can also protect our keys using Azure Key Vault, a cloud service for securely storing and accessing secrets.

The approach we will take is to first create an Azure Key Vault called lunar-dpkeyvault with a key named dataprotection, as shown in the screenshot below.

Created a key called dataprotection on Azure Key Vault.

Hence, the key identifier that we will be using to connect to the Azure Key Vault from our application will be new Uri(“https://lunar-dpkeyvault.vault.azure.net/keys/dataprotection/”).

We need to give our app the Get, Unwrap Key, and Wrap Key permissions to the Azure Key Vault in its Access Policies section.

Now, we can use Azure Key Vault to protect our key by updating our codes earlier to be as follows.

services.AddDataProtection()
        .PersistKeysToFileSystem(new DirectoryInfo(@"\server\shared\directory\"))            
        .SetApplicationName("<sharedApplicationName>");
        .ProtectKeysWithAzureKeyVault(new Uri("https://lunar-dpkeyvault.vault.azure.net/keys/dataprotection/"), credential);

The credential can be a ClientSecretCredential object or DefaultAzureCredential object.

Tenant Id, Client Id, and Client Secret can be retrieved from the App Registrations page of the app having the access to the Azure Key Vault above. We can use these three values to create a ClientSecretCredential object.

Now, if we check again the newly generated XML file, we shall see there won’t be <masterKey> element anymore. Instead, it is replaced with the content shown below.

<encryptedKey xmlns="">
    <!-- This key is encrypted with Azure Key Vault. -->
    <kid>https://lunar-dpkeyvault.vault.azure.net/keys/dataprotection/...</kid>
    <key>HSCJsnAtAmf...RHXeeA==</key>
    <iv>...</iv>
    <value>...</value>
</encryptedKey>

Key Lifetime

We shall remember that, by default, the generated key will have a 90-day lifetime. This means that the app will automatically generate a new active key when the current active key expires. However, the retired keys can still be used to decrypt any data protected with them.

Hence, we know that the data protection APIs above are not primarily designed for indefinite persistence of confidential payload.

Create a Protector

To create a protector, we need to specify Purpose Strings. A Purpose String provides isolation between consumers so that a protector cannot decrypt cyphertext encrypted by another protector with different purpose.

_protector = provider.CreateProtector("Lunar.DataProtection.v1");

Encrypt Text AND THEN DECRYPT IT

Once we have the data protector, we can encrypt the text with the Protect method as shown below.

string protectedPayload = _protector.Protect("<text to be encrypted>");

If we would like to turn the protectedPayload back to the original plain text, we can use the Unprotect method.

try 
{
    string originalText = _protector.Unprotect(protectedPayload);
    ...
} 
catch (CryptographicException ex) 
{
    ...
}

Yup, that’s all for quick starting of encrypting and decrypting texts in ASP .NET Core.

Analysing Data from Multiple Data Sources with OctoSQL

March 28, 2021March 28, 2021 by Chun Lin, posted in Data, Experience

Throughout my career, I have been given data-related tasks such as getting the number of products sold in each category, retrieving a list of drivers of a certain trucking company, calculating the gross profit of the week, etc. The data is always spread across varied locations. They can be stored in JSON files, CSV files, relational databases, or even event streams. Of course, I can write a C# program or Python script to solve those tasks. However, could it be that there is no better solution to such problems?

Recently, I came across this tool called OctoSQL. It is a query tool that allows us to join, analyse, and transform data from multiple data sources using SQL. Hence, for those of you who are dealing with data from multiple sources in different ways, OctoSQL can definitely be helpful to you, especially you have SQL background.

Setup and Configuration

To setup OctoSQL, we can simply download the executable file in the Release page of the OctoSQL GitHub project. After that, we can specify the path to the OctoSQL executable in the PATH system environment variable. Once we have done that, then when we run the following command, we shall be able to see the help page of the OctoSQL.

After that, we need to specify a data source configuration path in a system environment variable OCTOSQL_CONFIG so that in our future commands, we don’t need to keep specifying the same path to the configuration file.

Genshin Impact as Sample Data

I will use the data from my favourite game Genshin Impact here so that it’s more fun for me to explain.

In Genshin Impact, there are seven gods, and each takes charge of a different elemental domain. The seven elements are: Pyro (fire), Geo (earth), Dendro (grass), Cryo (ice), Electro (lightning), Anemo (wind), and Hydro (water). Every character makes use of one of these elements in battle.

Each character has different Max HP and other attributes.

Each of the characters can bring a weapon to the battle. Currently, there are 5 weapon types in Genshin Impact such as Swords, Bows, Polearms, Claymores, and Catalysts. Normally the weapon will have a Base ATK and some other attributes, as shown in the screenshot below.

A character can only master in using one single type of weapon.

Now, I will proceed to show a few ideas on how we can use this tool to make our data processing and analysing life easier.

Use Case #1: Single JSON File

Let’s say we have a JSON file called characters.json with game characters and their info, we would like to find out the highest Base HP of each Element. Hence, we will have our config file as follows.

dataSources:
  - name: characters
    type: json
    config:
      path: "C:\\Users\\gclin\\source\\repos\\OctoSQLSamples\\characters.json"
      arrayFormat: true

Currently, OctoSQL accepts JSON in the form of either one record per line without commas or JSON list of records. By default, it is expecting JSON file is having one record per line. Hence, since our JSON file is a file containing a JSON array, we need to specify arrayFormat to be true in the configuration file.

Both Anemo and Geo characters have the highest Base HP.

Next, we can find out who are the characters having the highest Base HP in his/her own element group, as shown below, using nested query and inner join.

The characters who are having highest Base HP in each element.

Use Case #2: When JSON Meets CSV

Data is not always given to us in the same format. Hence, sometimes we will have situations where we need to do data analysis based on data in, for example, JSON and CSV. So, now let’s assume that we have the character’s weapon data stored in a CSV file. We will then have to update the config file as follows.

dataSources:
  - name: characters
    type: json
    config:
      path: "C:\\Users\\gclin\\source\\repos\\OctoSQLSamples\\characters.json"
      arrayFormat: true
  - name: weapons
    type: csv
    config:
      path: "C:\\Users\\gclin\\source\\repos\\OctoSQLSamples\\weapons.csv"

By default, OctoSQL assumes that the CSV file is using comma as delimiter and has a header row.

Now we can find out the maximum Base ATK for each of the elements.

Cyro and Pyro elements have the highest maximum Base ATK.

Use Case #3: JSON with PostgreSQL on Azure

Now, let’s assume that we are not given the CSV file. Instead, the data sits in a PostgreSQL database on the Microsoft cloud.

We can use Azure Data Studio to access the PostgreSQL database on Azure.

Could we easily do a table join between these two data sources, i.e. JSON and PostgreSQL? The answer is yes. We simply need to change the config file as follows, where we input the URL of the PostgreSQL server on Azure together with user info, database name, and the table name.

dataSources:
  - name: characters
    type: json
    config:
      path: "C:\\Users\\gclin\\source\\repos\\OctoSQLSamples\\characters.json"
      arrayFormat: true
  - name: weapons
    type: postgres
    config:
      address: "genshin-impact.postgres.database.azure.com:5432"
      user: "chunlin@genshin-impact"
      password: "<password here>"
      databaseName: "postgres"
      tableName: "weapons"

Merging data from both JSON and the PostgreSQL.

Interestingly, when I try to run a more complicated nested SQL, OctoSQL currently doesn’t return any result.

Query Plan in DOT Language and Graphviz

A directed graph of three nodes A, B, and C.

In OctoSQL, we can describe its query plan in DOT Language, a graph description language. For example, the DOT Language below describes a directed graph G of three nodes A, B, and C.

digraph G {
    A -> B -> C;
    A -> C;
}

Using the Graphviz, a package of open-source tools initiated by AT&T Labs Research, we can visualise the graph described in the DOT Language.

In OctoSQL, DOT Language is also used to describe the query plan. As shown in the following screenshot, it is a query plan for a simple query SELECT * FROM weapons w. We are using WebGraphviz here to display the corresponding graph.

Physical query plan of a simple SELECT query.

References

Analytical Processing on Transaction Data with Google BigQuery and Data Studio

June 7, 2020June 8, 2020 by Chun Lin, posted in Cloud Computing: Google Cloud Platform, Data, Google BigQuery, Google Data Studio, SQL

Data analysing is not about reporting. While reporting gives data, data analytics gives answers to the whys. Data analytics is the practice of using data and information to make informed decisions.

When I was in a startup, I was assigned a task by the CEO to work on analytical processing on transaction data. In the early days of the startup, the number of transactions was low, so simple data processing using stored procedures on MS SQL databases was sufficient. However, based on my past experience in SMEs, without investing in data workflow early, it will be challenging for the team to use the data to make informed decisions later. Imagine five years down the road, the team requires to do analysis on the huge amount of data collected in the past five years with just Excel.

🎨 Simple solution to do big data processing with Google BigQuery. 🎨

Hence, in this article, we will be focusing on how we can do analytical data processing with Google BigQuery and then visualise the data using Google Data Studio.

Analytical Data Processing

There are two main categories when we talk about data processing, i.e. Transactional Processing and Analytical Processing.

For example, in my previous startup team, we had an Order Management Support (OMS) team that focused on tracking and processing the orders on time. What the OMS team does is transactional processing. Then we also had another Data Analyse (DA) team to analyse sales data to find out about monthly revenue, for example. So the DA team is basically performing analytical data processing.

Hence, the DA team needs to analyse large batch of data. As the business grows, the data the team needs to access will be going back months, or even years. Also, when there are more sales channels introduced in the business, the DA team may need to access multiple data sources as well. So, let’s see how we can use data warehouse to help dealing with the big data the DA team has.

Why Google BigQuery?

There are many data warehouse solutions out there.

The reason why we choose Google BigQuery is because it is a data warehouse that is very similar to the RDBMS which we have been very familiar with. Another good news is that Google BigQuery now supports the standard SQL which is ANSI:2011 compliant. Hence, the DA team can move to Google BigQuery seamlessly.

In addition, Google BigQuery can handle complex analytical queries which will be essential to the businesses and the data stored in it can easily scale to petabytes as the businesses grow.

The fast real-time access to our data is also another advantage of Google BigQuery. So within a few seconds, the DA team can retrieve the results from processing the huge amount of data.

Finally, Google BigQuery is serverless. So we don’t have to instantiate compute nodes like we do in AWS Redshift.

🎨 First 1TB per month is free for on-demand querying on Google BigQuery. 🎨

Importing Data to Google BigQuery

The transaction data of the previous month will normally be double checked and verified by the relevant teams monthly. Once it is done, we can then download the transaction report of the month into a CSV file.

The reason why we choose CSV file is because it’s one of the three file types accepted by Google BigQuery. The other two are JSON and AVRO. Yes, AVRO! We talked about it in our previous article about Azure Event Hub.

Here, we will use Google Cloud Storage to store the CSV files because it is one of the accepted data source for Google BigQuery.

🎨 Monthly transaction data is kept in the bucket on Google Cloud Storage. 🎨

So after the monthly transaction data has been uploaded to the storage, we can then proceed to create a new dataset (a concept which can be treated as database in RDBMS). Then in the dataset, we can start to create new table based on the monthly transaction data.

There are a few ways on how to create the table in the dataset. It seems like merging data from all months into one table is easier for maintenance. However, I decide to go for the way where we have one table for every month. This way actually allows me to delete and upload monthly data whenever I need to.

🎨 Creating a new table based on data stored in the Google Cloud Storage. Take note that Google BigQuery is powerful enough to generate schema for us. 🎨

After that, we need to make sure to remove the header row(s) in our CSV file, if any, as shown in the following screenshot. If we don’t do this, the header row may be wrongly included into the dataset. The reason why we don’t need header here is also because the schema has been auto-detected by the Google BigQuery (or defined differently by the person who is uploading the data) in the previous step.

🎨 Remember to remove header in CSV during import stage. 🎨

Sometimes, there might be some data corruption in the CSV file. For example, a column which is expected to contain only number suddenly has a non-numerical value. Then Google BigQuery will complain to us, as shown in the screenshot below..

🎨 The dialog will indicate the row number (29928) and its position of the error. 🎨

Once the data is imported successfully to the table, we can then preview the data in the table.

This preview function is a very user-friendly feature. Do you still remember about the query price we mentioned earlier? If we explore the data in the table with the “SELECT *” statement, we will be charged and our usage quota may be affected. However, the preview function allows us to get a rough idea about the data in the table for free (and quota not affected)!

Views and Queries

Now we have seen both dataset and table, we need to introduce third concept in Google BigQuery called View. The view is actually a virtual table defined by a SQL query. Unlike the table which actually holds the records, the view will display the data in the related tables by executing its view-query.

Just now, I split the monthly transaction data into different tables. Actually I sort of regretting it after doing so because I can’t have an overview of the yearly report. Fortunately, with view, I can come up with a virtual table that holds essential data for the months in a year from all the tables using UNION in the query.

🎨 Combining transaction data of each month in 2018 into one single view. 🎨

The query is easy to write as long as we are familiar with the standard SQL. For example, we can have a view which will show the driver who has the highest cost of jobs in each month with the query which uses LIMIT and UNION ALL shown in the following screenshot.

Visualisation using Data Studio

Dashboard building and reporting are very important in almost all the businesses because they make it easy to translate messages, retain information, and gather insights from the data. In short, that’s a way of data storytelling.

Similar as Power BI, Google Data Studio comes with interactive dashboards and beautiful reports that inspire smarter business decisions. The Data Studio has connectors to Google Cloud Platform (GCP) services, including Google BigQuery, and data stores. So, we will see how we can make use of it to visualise our data.

In the Data Studio, we first need to search for the BigQuery connector first.

🎨 Google BigQuery <> Google Data Studio 🎨

After that, we need to locate the table or view which will provide the necessary data for the data visualisation, as shown in the following screenshot.

🎨 We can use both table and view as the data source in Data Studio. 🎨

We then can draw a table with bars to show the driver with highest cost in their jobs for each month, as shown in the image below.

However, we notice that the month is actually not ordered properly and it makes us hard to analyse the data. Why is it so? It turns out that the data type of the Month field is actually now recognised as “Text” which is indicated as “ABC” in the right menu.

To correct it, we need to click on the “ABC” beside the “Month” field and change its type to “Date”. The format we will use YYYYMM because that is how the year and month are formatted in the data source.

🎨 Changing the type of the “Month” field to be Date & Time. 🎨

We will then see the “Month” column in the table will be shown in date format. The funny thing is even though the format we choose is YYYYMM, when the Data Studio displays the months, they will be displayed as, for example, April 2018 instead of 201804.

What we’re now left to do is just sort the table according to the month in ascending order.

🎨 Sorting the table according to month. 🎨

So, congratulation to Mr Heng Swee Ren for being the only driver having the highest cost for two months!

In the Data Studio, we can also further customise the style of our report, such as changing the colours, to make the report to be more engaging to the readers.

🎨 A simple report for a huge amount of data is done easily without any coding. 🎨

Conclusion

This is not the end of the journey. It’s actually just the beginning. For example, the line chart shown above actually has a limitation of showing only 10 series, as discussed over here. If there are more than 10 series, some of them will be grouped as “Others” in the chart and the visualisation will no longer make sense.

Anyway, this is my proposal of how we should do analytical data processing with data warehouse tool, such as Google BigQuery. Through this article, I also hope that businesses, especially startups and SMEs, can start to look into building common business reports using Power BI or Google Data Studio which will avoid wasting programmers’ time on coming up in-house dashboard and reports which are not highly customisable.