Analytical Processing on Transaction Data with Google BigQuery and Data Studio

Data analysing is not about reporting. While reporting gives data, data analytics gives answers to the whys. Data analytics is the practice of using data and information to make informed decisions.

When I was in a startup, I was assigned a task by the CEO to work on analytical processing on transaction data. In the early days of the startup, the number of transactions was low, so simple data processing using stored procedures on MS SQL databases was sufficient. However, based on my past experience in SMEs, without investing in data workflow early, it will be challenging for the team to use the data to make informed decisions later. Imagine five years down the road, the team requires to do analysis on the huge amount of data collected in the past five years with just Excel.

🎨  Simple solution to do big data processing with Google BigQuery. 🎨 

Hence, in this article, we will be focusing on how we can do analytical data processing with Google BigQuery and then visualise the data using Google Data Studio.

Analytical Data Processing

There are two main categories when we talk about data processing, i.e. Transactional Processing and Analytical Processing.

For example, in my previous startup team, we had an Order Management Support (OMS) team that focused on tracking and processing the orders on time. What the OMS team does is transactional processing. Then we also had another Data Analyse (DA) team to analyse sales data to find out about monthly revenue, for example. So the DA team is basically performing analytical data processing.

Hence, the DA team needs to analyse large batch of data. As the business grows, the data the team needs to access will be going back months, or even years. Also, when there are more sales channels introduced in the business, the DA team may need to access multiple data sources as well. So, let’s see how we can use data warehouse to help dealing with the big data the DA team has.

Why Google BigQuery?

There are many data warehouse solutions out there.

The reason why we choose Google BigQuery is because it is a data warehouse that is very similar to the RDBMS which we have been very familiar with. Another good news is that Google BigQuery now supports the standard SQL which is ANSI:2011 compliant. Hence, the DA team can move to Google BigQuery seamlessly.

In addition, Google BigQuery can handle complex analytical queries which will be essential to the businesses and the data stored in it can easily scale to petabytes as the businesses grow.

The fast real-time access to our data is also another advantage of Google BigQuery. So within a few seconds, the DA team can retrieve the results from processing the huge amount of data.

Finally, Google BigQuery is serverless. So we don’t have to instantiate compute nodes like we do in AWS Redshift.

🎨  First 1TB per month is free for on-demand querying on Google BigQuery. 🎨 

Importing Data to Google BigQuery

The transaction data of the previous month will normally be double checked and verified by the relevant teams monthly. Once it is done, we can then download the transaction report of the month into a CSV file.

The reason why we choose CSV file is because it’s one of the three file types accepted by Google BigQuery. The other two are JSON and AVRO. Yes, AVRO! We talked about it in our previous article about Azure Event Hub.

Here, we will use Google Cloud Storage to store the CSV files because it is one of the accepted data source for Google BigQuery.

🎨  Monthly transaction data is kept in the bucket on Google Cloud Storage. 🎨 

So after the monthly transaction data has been uploaded to the storage, we can then proceed to create a new dataset (a concept which can be treated as database in RDBMS). Then in the dataset, we can start to create new table based on the monthly transaction data.

There are a few ways on how to create the table in the dataset. It seems like merging data from all months into one table is easier for maintenance. However, I decide to go for the way where we have one table for every month. This way actually allows me to delete and upload monthly data whenever I need to.

🎨  Creating a new table based on data stored in the Google Cloud Storage. Take note that Google BigQuery is powerful enough to generate schema for us. 🎨 

After that, we need to make sure to remove the header row(s) in our CSV file, if any, as shown in the following screenshot. If we don’t do this, the header row may be wrongly included into the dataset. The reason why we don’t need header here is also because the schema has been auto-detected by the Google BigQuery (or defined differently by the person who is uploading the data) in the previous step.

🎨  Remember to remove header in CSV during import stage. 🎨 

Sometimes, there might be some data corruption in the CSV file. For example, a column which is expected to contain only number suddenly has a non-numerical value. Then Google BigQuery will complain to us, as shown in the screenshot below..

🎨  The dialog will indicate the row number (29928) and its position of the error. 🎨 

Once the data is imported successfully to the table, we can then preview the data in the table.

This preview function is a very user-friendly feature. Do you still remember about the query price we mentioned earlier? If we explore the data in the table with the “SELECT *” statement, we will be charged and our usage quota may be affected. However, the preview function allows us to get a rough idea about the data in the table for free (and quota not affected)!

🎨  Previewing the data in the table. 🎨 

Views and Queries

Now we have seen both dataset and table, we need to introduce third concept in Google BigQuery called View. The view is actually a virtual table defined by a SQL query. Unlike the table which actually holds the records, the view will display the data in the related tables by executing its view-query.

Just now, I split the monthly transaction data into different tables. Actually I sort of regretting it after doing so because I can’t have an overview of the yearly report. Fortunately, with view, I can come up with a virtual table that holds essential data for the months in a year from all the tables using UNION in the query.

🎨  Combining transaction data of each month in 2018 into one single view. 🎨 

The query is easy to write as long as we are familiar with the standard SQL. For example, we can have a view which will show the driver who has the highest cost of jobs in each month with the query which uses LIMIT and UNION ALL shown in the following screenshot.

🎨  Who is the top driver? =) 🎨 

Visualisation using Data Studio

Dashboard building and reporting are very important in almost all the businesses because they make it easy to translate messages, retain information, and gather insights from the data. In short, that’s a way of data storytelling.

Similar as Power BI, Google Data Studio comes with interactive dashboards and beautiful reports that inspire smarter business decisions. The Data Studio has connectors to Google Cloud Platform (GCP) services, including Google BigQuery, and data stores. So, we will see how we can make use of it to visualise our data.

In the Data Studio, we first need to search for the BigQuery connector first.

🎨  Google BigQuery <> Google Data Studio 🎨 

After that, we need to locate the table or view which will provide the necessary data for the data visualisation, as shown in the following screenshot.

🎨  We can use both table and view as the data source in Data Studio. 🎨 

We then can draw a table with bars to show the driver with highest cost in their jobs for each month, as shown in the image below.

🎨  Table with bars. 🎨 
🎨  ABC 🎨 

However, we notice that the month is actually not ordered properly and it makes us hard to analyse the data. Why is it so? It turns out that the data type of the Month field is actually now recognised as “Text” which is indicated as “ABC” in the right menu.

To correct it, we need to click on the “ABC” beside the “Month” field and change its type to “Date”. The format we will use YYYYMM because that is how the year and month are formatted in the data source.

🎨  Changing the type of the “Month” field to be Date & Time. 🎨 

We will then see the “Month” column in the table will be shown in date format. The funny thing is even though the format we choose is YYYYMM, when the Data Studio displays the months, they will be displayed as, for example, April 2018 instead of 201804.

What we’re now left to do is just sort the table according to the month in ascending order.

🎨  Sorting the table according to month. 🎨 

So, congratulation to Mr Heng Swee Ren for being the only driver having the highest cost for two months!

In the Data Studio, we can also further customise the style of our report, such as changing the colours, to make the report to be more engaging to the readers.

🎨  A simple report for a huge amount of data is done easily without any coding. 🎨 

Conclusion

This is not the end of the journey. It’s actually just the beginning. For example, the line chart shown above actually has a limitation of showing only 10 series, as discussed over here. If there are more than 10 series, some of them will be grouped as “Others” in the chart and the visualisation will no longer make sense.

Anyway, this is my proposal of how we should do analytical data processing with data warehouse tool, such as Google BigQuery. Through this article, I also hope that businesses, especially startups and SMEs, can start to look into building common business reports using Power BI or Google Data Studio which will avoid wasting programmers’ time on coming up in-house dashboard and reports which are not highly customisable.

[KOSD Series] Running MS SQL Server 2019 on macOS

Few days ago, my teammate would like to learn how to use MS SQL Server. However, he only has a Macbook and MS SQL Server doesn’t run on macOS. Hence, I decided to write him a quick setup guide on how to do that with the help of container.

Starting from March 2016, besides Windows, SQL Sever 2019 also runs on Linux. So, we can easily spin up a Linux container and host SQL Server on it.

🎨 Microsoft introduced SQL Server on Linux in 2016. 🎨

Docker

We need to run Docker on our Mac machine. Since my teammate is new to Docker, he can simply choose a rather straight-forward path for this, which is to use Docker Desktop on Mac. Kindly take note of the system requirement before proceed to install it.

Once the Docker is up and running, we can proceed to pull the image of SQL Server 2019 from the Docker Hub.

SQL Server 2019 Developer Edition

In 2019, continuing with the approach to delivering a consistent and trustworthy acquisition experience for Microsoft container images, Microsoft Container Registry (MCR) is announced.

We can run the following command in Terminal window to start the database server. Here we are using 1501 as the port. Take note that, we need to replace the password with our password which meets the following guideline:

  • at least 8 characters;
  • including uppercase, lowercase letters, base-10 digits and/or non-alphanumeric symbols.
$ docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=yourStrong(!)Password' -p 1501:1433 -d mcr.microsoft.com/mssql/server:2019-latest

In the command above, there are two environment variables.

Firstly, it is the environment variable “ACCEPT_EULA”. Setting it to Y means that we accept the End-User Licensing Agreement of the product. So far I still couldn’t find the EULA of the Microsoft SQL Server 2019. If you know, please drop me a message in the comment section. Thanks!

Secondly, it is the “SA_PASSWORD” which is used to set the password that we will later use to connect to the SQL server later as the database admin (userid = “sa”).

Actually, there is another environment variable which is not set here. It is the MSSQL_PID, i.e. the product ID of the SQL Server. By default, it is the Developer edition. If we would like to use Express or Enterprise edition, we can specify it here.

The reason we chose the Developer edition is because it is the edition that it is licensed for use as a development and test system, not as a production server. In addition, despite being Developer edition, it includes all the functionality of Enterprise edition. Hence, SQL Server Developer is an ideal choice for developers like us to build and test applications.

🎨 Docker Hub page of Microsoft SQL Server. 🎨

There are more information about the MS SQL Server image on the Docker Hub page. Hence I will not repeat them here.

Azure Data Studio

To visualise and manage our data in the databases, we need to use tools such as SQL Server Management Studio (SSMS). However, SSMS is only for Windows (AMD or Intel). So, on macOS, we have to choose another cross-platform alternative, which is Azure Data Studio. Azure Data Studio is usable on Windows and Linux too.

Interestingly, Azure Data Studio was previously called SQL Operations Studio. Hence, please only use the latest one, which is the Azure Data Studio.

Now we can connect to the SQL Server from Azure Data Studio as shown below. Take note that the Server is “localhost,1501” and it is using comma, not dot, between the word localhost and the port number.

🎨 Connecting to the Microsoft SQL Server from Azure Data Studio. 🎨

If the connection is successful, we shall be able to see the Server Dashboard as shown below.

🎨 Server Dashboard in Azure Data Studio. 🎨

That’s all. Now we can have MS SQL Server running on our Mac machine for local testing and development.

References

KOSD, or Kopi-O Siew Dai, is a type of Singapore coffee that I enjoy. It is basically a cup of coffee with a little bit of sugar. This series is meant to blog about technical knowledge that I gained while having a small cup of Kopi-O Siew Dai.

Getting Started: Making It a Golang Web Application

Continue from the previous topic

Playing YouTube video via console is cool but it’s not user friendly. Hence, I decided to move on to make it a web app. Design of a web app that I chose to follow is the one shared in Sau Sheong Chang’s book, Go Web Programming. The following diagram covers the overview of the architecture of the web app.

Overview of the web application. (Source: Go Web Programming)

Hence, I started of with two folders and two Go files in the project folder. The two folders are public and templates folders. The public folder stores all CSS files and Javascript files. Then we have one index.go file which is basically the handler function for the homepage of our web app and finally server.go file.

The server.go

In the server.go file, we have our multiplexer which in charge of inspecting the URL being requested and redirecting the request to the correct handler. The main function also sits in server.go file. There is where we also connect to the database which has its connection made to be global. The following code shows the beginning of our main function.

package main

...

var db *sql.DB

func main() {

    var err error

    // Initialize connection string.
    var connectionString = fmt.Sprintf(os.Getenv("CONNECTION_STRING"))

    // Initialize connection object.
    db, err = sql.Open("postgres", connectionString)
    checkError(err)

// Set up multiplexer
    mux := http.NewServeMux()

// Handle static files in the public folder
    staticFiles := http.FileServer(http.Dir("public"))
    mux.Handle("/static/", http.StripPrefix("/static/", staticFiles))

    mux.HandleFunc("/index", index)

    server := &http.Server{
        Addr: "127.0.0.1:8081",
        Handler: mux,
    }

    server.ListenAndServe()
}

There is one thing that needs to take note is that, as described on the GoDoc, “The returned DB (of sql.Open) is safe for concurrent use by multiple goroutines and maintains its own pool of idle connections. Thus, the Open function should be called just once. It is rarely necessary to close a DB.” So we don’t close the DB here.

The index.go

For the index.go file where we have our homepage handler function, we will first connect to the database to retrieve the list of videos from the table. In addition, the index handler function will be in charge of generating the HTML responses with templates that we define in templates folder, which is index.html in this case.

The following code shows the index handler function where it retrieves all the video records from the table then store them into a map, the built-in dictionary data type in Go.

package main

...

func index(writer http.ResponseWriter, request *http.Request) {
    template, _ := template.ParseFiles("templates/index.html")

    err := db.Ping()
    checkError(err)

    if err != nil {
        template.Execute(writer, "Cannot connect to the database")
    } else {
        // Read data from table.
...

        sqlStatement := "SELECT * FROM videos;"

        rows, err := db.Query(sqlStatement)
        checkError(err)

        defer rows.Close()

        videos := make(map[string]string)

        ...

        template.Execute(writer, videos)
    }
}

The templates/index.html

Now, let’s see what’s inside index.html template file. It basically uses the Bootstrap 4.2.1 template that I downloaded from Bootswatch.

<!DOCTYPE html>
<html lang="en">

<head>
<meta charset="utf-8">
<title>YouTube RePlayer</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<link rel="stylesheet" href="static/bootstrap.min.css" media="screen">
</head>

<body>
...
{{ range $videoId, $videoName := . }}
...
<span>{{ $videoName }}</span>
...
{{ end }}
...
</body>
</html>

There are two highlights in the code above.

Firstly, even though we put the CSS and JS files in the public folder, we still have their relative path as “static/…”. The reason is because of the lines in server.go as shown below.

// Handle static files in the public folder
staticFiles := http.FileServer(http.Dir("public"))
mux.Handle("/static/", http.StripPrefix("/static/", staticFiles))

Secondly, we iterate through the map using range. That will help to list down the videos we retrieved earlier on the HTML side.

The public Folder

Since we are using Bootstrap, we need to have the CSS and JS files of the framework locally. Hence, they are all put under the public folder. Then files with my customized CSS and JS codes are also put in this folder.

This is the web app so far with the “Add To List” function that will be covered in another article later.

Form Submission

Now I move on to add a function that allows users to add new YouTube video the the video list. On HTML, we have the following hidden fields. They are hidden fields because the values, i.e. YouTube video id and video title, are retrieved from the URL and the YouTube API, respectively.

<form action="/addVideo" method="POST">
<input id="hidVideoID" name="hidVideoID" type="hidden" />
<input id="hidVideoName" name="hidVideoName" type="hidden" />

<input id="btnAddToList" type="submit" class="btn btn-primary btn-lg" value="Add to List"></input>
</form>

After that, due to the fact that I want the form to be posted to the relative request URL /addVideo, so there is a new handler needed. Hence, a new line as follows is added to the server.go file.

mux.HandleFunc("/addVideo", addVideo) 

Then we have our new handler function in a new file, addVideo.go, as shown below.

package main

...

func addVideo(writer http.ResponseWriter, request *http.Request) {
    request.ParseForm()

    err := db.Ping()
    checkError(err)

    if err != nil {

        http.Redirect(writer, request, "/index", http.StatusSeeOther)

    } else {

        // Insert data into the table.
        sqlStatement := "INSERT INTO videos (name, url) VALUES ($1, $2);"

        _, err = db.Exec(sqlStatement, request.PostForm["hidVideoName"][0], "https://www.youtube.com/watch?v="+(request.PostForm["hidVideoID"][0]))
        checkError(err)

        http.Redirect(writer, request, "/index", http.StatusSeeOther)
}
}

As the code above shows, after the data is successfully saved to the database table, we need to redirect the user back to the homepage.

References

1. Golang: http.Redirect;
2. Golang SQL Database Open and Close;
3. Golang: How to Redirect to an URL.

Getting Started: Connecting Golang Console App with Azure PostgreSQL Database

Setting up Go in VS Code

Firstly, I need to equip Visual Studio Code with the Go language support by installing Go extension in the IDE. Installing the extension helps us to do day-to-day job, such as, code navigating, code editing, code testing and debugging in an easy and efficient manner.

Installed and enabled the Go extension from Microsoft.

There are many cool features I like in the extension.

Firstly, it’s the convenience of F12 Code Navigation. With Go extension, in Go code I can easily view the source code of the type definition with just F12 or Alt+F12 on it. This is similar to my C# coding experience on Visual Studio too. To show all references of the type, I can simply use Shift+F12.

Secondly, for every file save, the extension will build, vet, and lint. Build (go build) builds the command or the package. Vet (go vet) examines Go source code and reports suspicious constructs, such as Printf calls whose arguments do not align with the format string. A linter is a tool giving coding style feedback and suggestions. By default, this extension uses the official golint as a linter.

We can update go.lintTool to other linter, for example the more advanced Go Meta Linter. Take note of the warnings (and errors) shown under the Problems tab below.

After I have installed the Go extension, I proceed to start coding. The app that I am going to build in this learning journey is a YouTube video player app. So, let’s get started!

Storing Connection String in Environment

I will start with building a console application connecting to Azure PostgreSQL. To do that, I will connect to the database as follows.

package main;

const (
// Initialize connection constants.
HOST = "...postgres.database.azure.com"
DATABASE = "..."
USER = "..."
PASSWORD = "..."
)

...

function main() {
var connectionString = fmt.Sprintf("host=%s port=5432 user=%s password=%s dbname=%s sslmode=require", HOST, USER, PASSWORD, DATABASE)
...
}

The code above is quite straight-forward. However, it does have a problem. The configuration to connect to database is hard-coded in the code. This will thus reduce the code maintainability. Also, it is extremely bad to have password appearing in the code in plain text. To store such sensitive data, we need to store them as Environment variables, as recommended by the article Best Practice for Configuration File in Your Code.

In Powershell, there are two ways of setting Environment variable. The first one is a temporarily variable that lasts only as long as the Powershell session. The command is as follows.

> $env:CONNECTION_STRING = '...'

If what you are looking for is a permanent one, you can do as follows.

> [environment]::SetEnvironmentVariable("CONNECTION_STRING", "...", "User")

After setting the connection string in environment variable, we then can edit the earlier code to be something shorter as follows.

package main;

…

function main() {
var connectionString = fmt.Sprintf(os.Getenv("CONNECTION_STRING"))

…
}

Connecting to Database

After that, we can initialize the connection object.

package main

...

func checkError(err error) {
    if err != nil {
        log.Fatal(err)
        panic(err)
    }
}

func main() {
var connectionString = ...

// Initialize connection string.
db, err := sql.Open("postgres", connectionString)
checkError(err)

err = db.Ping()
    checkError(err)
    fmt.Println("Successfully created connection to database")

...
}

The Ping() function verifies a connection to the database is alive and it will establish a connection if necessary. After that, we can use the database handler db to do CRUD operations, as demonstrated below.

Remember to allow the PostgreSQL access to Azure service and also your local IP if you need to access the database from your local machine.

Insert Data into Table

sqlStatement := "INSERT INTO table (column1) VALUES ($1);"
_, err = db.Exec(sqlStatement, "New Value")
checkError(err)

Read Data from Table

var id int
var column1 string

sqlStatement := "SELECT * from table;"
rows, err := db.Query(sqlStatement)
checkError(err)

defer rows.Close()

for rows.Next() {
switch err := rows.Scan(&id, &column1); err {
case sql.ErrNoRows:
fmt.Println("No rows were returned")
case nil:
fmt.Printf("Data row = (%d, %s)\n", id, column1)
default:
checkError(err)
}
}

Update Data in Database

sqlStatement := "UPDATE table SET column1 = $1 WHERE id = $2;"
_, err = db.Exec(sqlStatement, "New Value 2", 1)
checkError(err)

Delete Data from Database

sqlStatement := "DELETE FROM table WHERE id = $1;"
_, err = db.Exec(sqlStatement, 1)
checkError(err)

A YouTube Playlist Console App

With the codes above, we then can create a simple table to store our YouTube video URLs as a playlist, as shown in the following screenshot.

The console app allows user to choose video from the list and plays it on default browser.

To make the program to play the video on a browser, we then can make use of the code written by hyg (黄勇刚) which is to open a web page on the default browser of the machine, as shown below.

func openbrowser(url string) {
var err error

switch runtime.GOOS {
case "linux":
err = exec.Command("xdg-open", url).Start()
case "windows":
err = exec.Command("rundll32", "url.dll,FileProtocolHandler", url).Start()
case "darwin":
err = exec.Command("open", url).Start()
default:
err = fmt.Errorf("unsupported platform")
}

if err != nil {
checkError(err)
}

}

Playing YouTube video via console is cool, isn’t it? Next, we will see how we can make it to be web application so that we can have a better UI/UX.

To be continued in next article

References

1. Quickstart: Create an Azure Database for PostgreSQL server in the Azure portal;
2. Azure Database for PostgreSQL: Use Go language to connect and query data;
3. [StackOverflow] List all environment variables from command line;
4. Azure Database for PostgreSQL Server firewall rules;
5. PostgreSQL – CREATE Database;
6. [StackOverflow] How to switch databases in psql?;
7. [StackOverflow] How to read input from console line?;
8. [StackOverflow] Convert string to integer type in Go?;
9. [StackOverflow] Reading an integer from standard input;
10. [StackOverflow] How to exit from PostgreSQL command line utility: psql.