Cloud Computing: Google Cloud Platform

From Zero to Gemini: Building an AI-Powered Game Helper

December 8, 2024January 19, 2025 by Chun Lin, posted in Cloud Computing: Google Cloud Platform, Event, Experience, Gemini, GenAI, Langtrace, Pinecone, Python

On a chilly November morning, I attended the Google DevFest 2024 in Singapore. Together with my friends, we attended a workshop titled “Gemini Masterclass: How to Unlock Its Power with Prompting, Functions, and Agents.” The session was led by two incredible speakers, Martin Andrews and Sam Witteveen.

Martin, who holds a PhD in Machine Learning and has been an Open Source advocate since 1999. Sam is a Google Developer Expert in Machine Learning. Both of them are also organisers of the Machine Learning Singapore Meetup group. Together, they delivered an engaging and hands-on workshop about Gemini, the advanced LLM from Google.

Thanks to their engaging Gemini Masterclass, I have taken my first steps into the world of LLMs. This blog post captures what I learned and my journey into the fascinating world of Gemini.

Martin Andrews presenting in Google DevFest 2024 in Singapore.

About LLM and Gemini

LLM stands for Large Language Model. To most people, an LLM is like a smart friend who can answer almost all our questions with responses that are often accurate and helpful.

As a LLM, Gemini is trained on large amount of text data and can perform a wide range of tasks: answering questions, writing stories, summarising long documents, or even helping to debug code. What makes them special is their ability to “understand” and generate language in a way that feels natural to us.

Many of my developer friends have started using Gemini as a coding assistant in their IDEs. While it is good at that, Gemini is much more than just a coding tool.

Gemini is designed to not only respond to prompts but also act as an assistant with an extra set of tools. To make the most of Gemini, it is important to understand how it works and what it can (and cannot) do. With the knowledge gained from the DevFest workshop, I decided to explore how Gemini could assist with optimising relic choices in a game called Honkai: Star Rail.

Honkai: Star Rail and Gemini for Its Relic Recommendations

Honkai: Star Rail (HSR) is a popular RPG that has captured the attention of players worldwide. One of the key features of the game is its relic system, where players equip their characters with relics like hats, gloves, or boots to boost stats and unlock special abilities. Each relic has unique attributes, and selecting the right sets of relics for a character can make a huge difference in gameplay.

An HSR streamer, MurderofBirds, browsing through thousands of relics. (Image Source: MurderofBirds Twitch)

As a casual player, I often found myself overwhelmed by the number of options and the subtle synergies between different relic sets. Finding the good relic combination for each character was time-consuming.

This is where LLMs like Gemini come into play. With the ability to process and analyse complex data, Gemini can help players make smarter decisions.

In this blog post, I will briefly show how this Gemini-powered relic recommendation system can analyse a player’s current characters to suggest the best options for them. Then it will also explain the logic behind its recommendations, helping us to understand why certain relics are ideal.

Setup the Project

To make my project code available to everyone, I used Google Colab, a hosted Jupyter Notebook service that requires no setup to use and provides free access to computing resources, including GPUs and TPUs. You can access my code by clicking on the button below.

In my project, I used the google-generativeai Python library, which is pre-installed in Colab. This library serves as a user-friendly API for interacting with Google LLMs, including Gemini. It makes it easy for us to integrate Gemini capabilities directly into our code.

Next, we will need to import the necessary libraries.

Importing the libraries and setup Gemini client.

The first library to import is definitely the google.generativeai. Without it, we cannot interact with Gemini easily. Then we have google.colab.userdata which securely retrieves sensitive data, like our API key, directly from the Colab notebook environment.

We will also use IPython.display for displaying results in a readable format, such as Markdown.

In the Secret section, we will have two records, i.e.

HONKAI_STAR_RAIL_PLAYER_ID: Your HSR player UID. It is used later to personalise relic recommendations.
GOOGLE_API_KEY: The API key that we can get from Google AI Studio to authenticate with Gemini.

Creating and retrieving our API keys in Google AI Studio.

Once we have initialised the google.generativeai library with the GOOGLE_API_KEY, we can proceed to specify the Gemini model we will be using.

The choice of model is crucial in LLM projects. Google AI Studio offers several options, each representing a trade-off between accuracy and cost. For my project, I choose models/gemini-1.5-flash-8b-001, which provided a good balance for this experiment. Larger models might offer slightly better accuracy but at a significant cost increase.

Google AI Studio offers a range of models, from smaller, faster models suitable for quick tasks to larger, more powerful models capable of more complex processing.

Hallucination and Knowledge Limitation

We often think of LLMs like Gemini as our smart friends who can answer any question. But just like even our smartest friend can sometimes make mistakes, LLMs have their limits too.

Gemini knowledge is based on the data it was trained on, which means it doesn’t actually know everything. Sometimes, it might hallucinate, i.e. model invents information that sounds plausible but not actually true.

Kiana is not a character from Honkai: Star Rail but she is from another game called Honkai Impact 3rd.

While Gemini is trained on a massive dataset, its knowledge is not unlimited. As a responsible AI, it acknowledges its limitations. So, when it cannot find the answer, it will tell us that it lacks the necessary information rather than fabricating a response. This is how Google builds safer AI systems, as part of its Secure AI Framework (SAIF).

To overcome these constraints, we need to employ strategies to augment the capabilities of LLMs. Techniques such as integrating Retrieval-Augmented Generation (RAG) and leveraging external APIs can help bridge the gap between what the model knows and what it needs to know to perform effectively.

System Instructions

Leveraging System Instructions is a way to improve the accuracy and reliability of Gemini responses.

System instructions are prompts given before the main query in order to guide Gemini. These instructions provide crucial context and constraints, significantly enhancing the accuracy and reliability of the generated output.

System Instruction with contextual information about HSR characters ensures Gemini has the necessary background knowledge.

The specific design and phrasing of the system instructions provided to the Gemini is crucial. Effective system instructions provide Gemini with the necessary context and constraints to generate accurate and relevant responses. Without carefully crafted system instructions, even the most well-designed prompt can yield poor results.

Context Framing

As we can see from the example above, writing clear and effective system instructions requires careful thought and a lot of testing.

This is just one part of a much bigger picture called Context Framing, which includes preparing data, creating embeddings, and deciding how the system retrieves and uses that data. Each of these steps needs expertise and planning to make sure the solution works well in real-world scenarios.

You might have heard the term “Prompt Engineering,” and it sounds kind of technical, but it is really about figuring out how to ask the LLM the right questions in the right way to get the best answers from an LLM.

While context framing and prompt engineering are closely related and often overlap, they emphasise different aspects of the interaction with the LLM.

Stochasticity

While experimenting with Gemini, I noticed that even if I use the exact same prompt, the output can vary slightly each time. This happens because LLMs like Gemini have a built-in element of randomness , known as Stochasticity.

Lingsha, an HSR character released in 2024. (Image Credit: Game8)

For example, when querying for DPS characters, Lingsha was inconsistently included in the results. While this might seem like a minor variation, it underscores the probabilistic nature of LLM outputs and suggests that running multiple queries might be needed to obtain a more reliable consensus.

Lingsha was inconsistently included in the response to the query about multi-target DPS characters.

According to the official announcement, even though Lingsha is a healer, she can cause significant damage to all enemies too. (Image Source: Honkai: Star Rail YouTube)

Hence, it is important to treat writing efficient system instruction and prompt as iterative processes. so that we can experiment with different phrasings to find what works best and yields the most consistent results.

Temperature Tuning

We can also reduce the stochasticity of Gemini response through adjusting parameters like temperature. Lower temperatures typically reduce randomness, leading to more consistent outputs, but also may reduce creativity and diversity.

Temperature is an important parameter for balancing predictability and diversity in the output. Temperature, a number in the range of 0.0 to 2.0 with default to be 1.0 in gemini-1.5-flash model, indicates the probability distribution over the vocabulary in the model when generating text. Hence, a lower temperature makes the model more likely to select words with higher probabilities, resulting in more predictable and focused text.

Having Temperature=0 means that the model will always select the most likely word at each step. The output will be highly deterministic and repetitive.

Function Calls

A major limitation of using system instructions alone is their static nature.

For example, my initial system instructions included a list of HSR characters, but this list is static. The list does not include newly released characters or characters specific to the player’s account. In order to dynamically access a player’s character database and provide personalised recommendations, I integrated Function Calls to retrieve real-time data.

For fetching the player’s HSR character data, I leveraged the open-source Python library mihomo. This library provides an interface for accessing game data, enabling dynamic retrieval of a player’s characters and their attributes. This dynamic data retrieval is crucial for generating truly personalised relic recommendations.

Using the mihomo library, I retrieve five of my Starfaring Companions.

Defining the functions in my Python code was only the first step. To use function calls, Gemini needed to know which functions were available. We can provide this information to Gemini as shown below.

model = genai.GenerativeModel('models/gemini-1.5-flash-8b-001', tools=[get_player_name, get_player_starfaring_companions])

After we pass a query to a Gemini, the model returns a structured object that includes the names of relevant functions and their arguments based on the prompt, as shown in the screenshot below.

The correct function call is picked up by Gemini based on the prompt.

Using descriptive function names is essential for successful function calling with LLMs because the accuracy of function calls depends heavily on well-designed function names in our Python code. Inaccurate naming can directly impact the reliability of the entire system.

If our Python function is named incorrectly, for example, calling a function get_age but it returns the name of the person, Gemini might select that function wrongly when the prompt is asking for age.

As shown in the screenshot above, the prompt requested information about all the characters of the player. Gemini simply determines which function to call and provides the necessary arguments. Gemini does not directly execute the functions. The actual execution of the function needs to be handled by us, as demonstrated in the screenshot below.

After Gemini telling us which function to call, our code needs to call the function to get the result.

Grounding with Google Search

Function calls are a powerful way to access external data, but they require pre-defined functions and APIs.

To go beyond these limits and gather information from many online sources, we can use Gemini grounding feature with Google Search. This feature allows Gemini to google and include what it finds in its answers. This makes it easier to get up-to-date information and handle questions that need real-time data.

If you are getting the HTTP 429 errors when using the Google Search feature, please make sure you have setup a billing account here with enough quota.

With this feature enabled, we thus can ask Gemini to get some real-time data from the Internet, as shown below.

The upcoming v2.7 patch of HSR is indeed scheduled to be released on 4th December.

Building a Semantic Knowledge Base with Pinecone

System instructions and Google search grounding provide valuable context, but a structured knowledge base is needed to handle the extensive data about HSR relics.

Having explored system instructions and Google search grounding, the next challenge is to manage the extensive data about HSR relics. We need a way to store and quickly retrieve this information, enabling the system to generate timely and accurate relic recommendations. Thus we will need to use a vector database ideally suited for managing the vast dataset of relic information.

Vector databases, unlike traditional databases that rely on keyword matching, store information as vectors enabling efficient similarity searches. This allows for retrieving relevant relic sets based on the semantic meaning of a query, rather than relying solely on keywords.

There are many options for vector database, but I choose Pinecone. Pinecone, a managed service, offered the scalability needed to handle the HSR relic dataset and the robust API essential for reliable data access. Its availability of a free tier is also a significant factor because it allows me to keep costs low during the development of my project.

Pinecone’s well-documented API and straightforward SDK make integration surprisingly easy. To get started, simply follow the Pinecone documentation to install the SDK in our code and retrieve the API key.

# Import the Pinecone library
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
import time

# Initialize a Pinecone client with your API key
pc = Pinecone(api_key=userdata.get('PINECONE_API_KEY'))

I prepare my Honkai: Star Rail relic data, which I have previously organised into a JSON structure. This data includes information on each relic set’s two-piece and four-piece effects. Here’s a snippet to illustrate the format:

[
  {
    "name": "Sacerdos' Relived Ordeal",
    "two_piece": "Increases SPD by 6%",
    "four_piece": "When using Skill or Ultimate on one ally target, increases the ability-using target's CRIT DMG by 18%, lasting for 2 turn(s). This effect can stack up to 2 time(s)."
  },
  {
    "name": "Scholar Lost in Erudition",
    "two_piece": "Increases CRIT Rate by 8%",
    "four_piece": "Increases DMG dealt by Ultimate and Skill by 20%. After using Ultimate, additionally increases the DMG dealt by the next Skill by 25%."
  },
  ...
]

With the relic data organised in Pinecone, the next challenge is to enable similarity searches with vector embedding. Vector embedding captures the semantic meaning of the text, allowing Pinecone to identify similar relic sets based on their inherent properties and characteristics.

Vector embedding representations (Image Credit: Pinecode)

Now, we can generate vector embeddings for the HSR relic data using Pinecone. The following code snippet illustrates this process which is to convert textual descriptions of relic sets into numerical vector embeddings. These embeddings capture the semantic meaning of the relic set descriptions, enabling efficient similarity searches later.

# Load relic set data from the JSON file
with open('/content/hsr-relics.json', 'r') as f:
    relic_data = json.load(f)

# Prepare data for Pinecone
relic_info_data = [
    {"id": relic['name'], "text": relic['two_piece'] + " " + relic['four_piece']}  # Combine relic set descriptions
    for relic in relic_data
]

# Generate embeddings using Pinecone
embeddings = pc.inference.embed(
    model="multilingual-e5-large",
    inputs=[d['text'] for d in relic_info_data],
    parameters={"input_type": "passage", "truncate": "END"}
)

print(embeddings)

As shown in the code above, we use the multilingual-e5-large model, a text embedding model from Microsoft research, to generate a vector embedding for each relic set. The multilingual-e5-large model works well on messy data and it is good for short queries.

Pinecone ability to perform fast similarity searches relies on its indexing mechanism. Without an index, searching for similar relic sets would require comparing each relic set’s embedding vector to every other one, which would be extremely slow, especially with a large dataset. I choose Pinecone serverless index hosted on AWS for its automatic scaling and reduced infrastructure management.

# Create a serverless index
index_name = "hsr-relics-index"

if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=1024,
        metric="cosine",
        spec=ServerlessSpec(
            cloud='aws', 
            region='us-east-1'
        ) 
    ) 

# Wait for the index to be ready
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

The dimension parameter specifies the dimensionality of the vector embeddings. Higher dimensionality generally allows for capturing more nuanced relationships between data points. For example, two relic sets might both increase ATK, but one might also increase SPD while the other increases Crit DMG. A higher-dimensional embedding allows the system to capture these subtle distinctions, leading to more relevant recommendations.

For the metric parameter which measures the similarity between two vectors (representing relic sets), we use the cosine metric which is suitable for measuring the similarity between vector embeddings generated from text. This is crucial for understanding how similar two relic descriptions are.

With the vector embeddings generated, the next step was to upload them into my Pinecone index. Pinecone uses the upsert function to add or update vectors in the index. The following code snippet shows how we can upsert the generated embeddings into the Pinecone index.

# Target the index where you'll store the vector embeddings
index = pc.Index("hsr-relics-index")

# Prepare the records for upsert
# Each contains an 'id', the embedding 'values', and the original text as 'metadata'
records = []
for r, e in zip(relic_info_data, embeddings):
    records.append({
        "id": r['id'],
        "values": e['values'],
        "metadata": {'text': r['text']}
    })

# Upsert the records into the index
index.upsert(
    vectors=records,
    namespace="hsr-relics-namespace"
)

The code uses the zip function to iterate through both the list of prepared relic data and the list of generated embeddings simultaneously. For each pair, it creates a record for Pinecone with the following attributes.

id: Name of the relic set to ensure uniqueness;
values: The vector representing the semantic meaning of the relic set effects;
metadata: The original description of the relic effects, which will be used later for providing context to the user’s recommendations.

Implementing Similarity Search in Pinecone

With the relic data stored in Pinecone now, we can proceed to implement the similarity search functionality.

def query_pinecone(query: str) -> dict:

  # Convert the query into a numerical vector that Pinecone can search with
  query_embedding = pc.inference.embed(
      model="multilingual-e5-large",
      inputs=[query],
      parameters={
          "input_type": "query"
      }
  )

  # Search the index for the three most similar vectors
  results = index.query(
      namespace="hsr-relics-namespace",
      vector=query_embedding[0].values,
      top_k=3,
      include_values=False,
      include_metadata=True
  )

  return results

The function above takes a user’s query as input, converts it into a vector embedding using Pinecone’s inference endpoint, and then uses that embedding to search the index, returning the top three most similar relic sets along with their metadata.

Relic Recommendations with Pinecone and Gemini

With the integration with Pinecode, we design the initial prompt to pick relevant relic sets from Pinecone. After that, we take the results from Pinecone and combine them with the initial prompt to create a richer, more informative prompt for Gemini, as shown in the following code.

from google.generativeai.generative_models import GenerativeModel

async def format_pinecone_results_for_prompt(model: GenerativeModel, player_id: int) -> dict:
  character_relics_mapping = await get_player_character_relic_mapping(player_id)

  result = {}

  for character_name, (character_avatar_image_url, character_description) in character_relics_mapping.items():
    print(f"Processing Character: {character_name}")

    additional_character_data = character_profile.get(character_name, "")

    character_query = f"Suggest some good relic sets for this character: {character_description} {additional_character_data}"

    pinecone_response = query_pinecone(character_query)

    prompt = f"User Query: {character_query}\n\nRelevant Relic Sets:\n"
    for match in pinecone_response['matches']:
        prompt += f"* {match['id']}: {match['metadata']['text']}\n"  # Extract relevant data
    prompt += "\nBased on the above information, recommend two best relic sets and explain your reasoning. Each character can only equip with either one 4-piece relic or one 2-piece relic with another 2-piece relic. You cannot recommend a combination of 4-piece and 2-piece together. Consider the user's query and the characteristics of each relic set."

    response = model.generate_content(prompt)

    result[character_avatar_image_url] = response.text

  return result

The code shows that we are doing both prompt engineering (designing the initial query to get relevant relics) and context framing (combining the initial query with the retrieved relic information to get a better overall recommendation from Gemini).

First the code retrieves data about the player’s characters, including their descriptions, images, and relics the characters currently are wearing. The code then gathers potentially relevant data about each character from a separate data source character_profile which has more information, such as gameplay mechanic about the characters that we got from the Game8 Character List. With the character data, the query will find similar relic sets in the Pinecone database.

After Pinecone returns matches, the code constructs a detailed prompt for the Gemini model. This prompt includes the character’s description, relevant relic sets found by Pinecone, and crucial instructions for the model. The instructions emphasise the constraints of choosing relic sets: either a 4-piece set, or two 2-piece sets, not a mix. Importantly, it also tells Gemini to consider the character’s existing profile and to prioritise fitting relic sets.

Finally, the code sends this detailed prompt to Gemini, receiving back the recommended relic sets.

Knight of Purity Palace, is indeed a great option for Gepard!

Enviosity, a popular YouTuber known for his in-depth Honkai: Star Rail strategy guides, introduced Knight of Purity Palace for Gepard too. (Source: Enviosity YouTube)

Langtrace

Using LLMs like Gemini is sure exciting, but figuring out what is happening “under the hood” can be tricky.

If you are a web developer, you are probably familiar with Grafana dashboards. They show you how your web app is performing, highlighting areas that need improvement.

Langtrace is like Grafana, but specifically for LLMs. It gives us a similar visual overview, tracking our LLM calls, showing us where they are slow or failing, and helping us optimise the performance of our AI app.

Traces for the Gemini calls are displayed individually.

Langtrace is not only useful for tracing our LLM calls, it also offers metrics on token counts and costs, as shown in the following screenshot.

Beyond tracing calls, Langtrace collects metrics too.

Wrap-Up

Building this Honkai: Star Rail (HSR) relic recommendation system is a rewarding journey into the world of Gemini and LLMs.

I am incredibly grateful to Martin Andrews and Sam Witteveen for their inspiring Gemini Masterclass at Google DevFest in Singapore. Their guidance helped me navigate the complexities of LLM development, and I learned firsthand the importance of careful prompt engineering, the power of system instructions, and the need for dynamic data access through function calls. These lessons underscore the complexities of developing robust LLM apps and will undoubtedly inform my future AI projects.

Building this project is an enjoyable journey of learning and discovery. I encountered many challenges along the way, but overcoming them deepened my understanding of Gemini. If you’re interested in exploring the code and learning from my experiences, you can access my Colab notebook through the button below. I welcome any feedback you might have!

References

Building COVID-19 Dashboard in Golang with Google BigQuery

June 14, 2020January 26, 2022 by Chun Lin, posted in Cloud Computing: Google Cloud Platform, Experience, Go (Golang), Google BigQuery, Google Charts, Google Maps API, SQL

It has been almost half a year since the first case relating to the COVID-19 pandemic in Singapore, the country I am now working at, was confirmed. Two days after the first case was confirmed in Singapore, eight travellers entering Malaysia, my home country, from Singapore were confirmed to be infected as well.

Since then, we were asked to work from home as travel restriction is applied in both countries. While the situation is not getting better, it’s quite disappointing to know that there are still people believing that COVID-19 is a hoax.

Fortunately, there are still a lot more people working hard in this tough period. Earlier on, my friend who is doing research in Colorado told me that she’s working hard with a group of scientists to educate the public about the virus.

🎨 People endured hours-long queues to enter Singapore from Malaysia before the travel restrictions to curb the spread of COVID-19 came into force. 🎨

In addition, to aid the researchers and data scientists in an effort to combat the pandemic disease, Google BigQuery also decided to host a repository of public datasets from JHU CSSE (Johns Hopkins Center for Systems Science and Engineering). With the public datasets, we can now query up to 1TB for free each month on COVID-19 datasets and the queries over COVID-19 data are free (until 15th of September 2020).

In my previous article, I talked about how Google BigQuery could work together with Google Data Studio to render beautiful reports without any coding. Thus, in this article, I will show how we can write a simple web client in Golang to fetch data from the BigQuery via its API.

BigQuery Public Datasets Programme

There are a huge number of datasets hosted by Google where we can access and integrate them into our applications but Google pays for the storage. Using the public datasets, we only need to pay for the queries we perform on the data.

🎨 There are a lot of public datasets available in the GCP Marketplace. 🎨

In order to access the public datasets, we first need to enable them through the Google BigQuery documentation (I find this to be quite funny because Google makes the enabling link to be so hidden). In the “Using the Web UI” page, as shown in the screenshot below, we can then find an URL which will let us open the public datasets project manually through browser (Remember to update the &page=project to your project in GCP).

There are also detailed steps written in the documentation of Data Analytics Products (Yes, the same info is spread all over different places).

🎨 The link to enable the public datasets in the web UI. 🎨

The COVID-19 Dataset

Once we have done the steps above, we shall see the public datasets, including the COVID-19 datasets, available in our Google BigQuery. The dataset that I will be using in this article is the covid19_jhu_csse, a daily updated data repository for COVID-19 from JHU CSSE.

There are four tables under the dataset where the first three recording the number of confirmed cases, the number of reported deaths, and the number of recovered cases, respectively, in each of the country or region.

The interesting about the first three tables is that they recorded the numbers of each day in a separated column. Hence, every day, there will be one new column added to three of the tables. I’m not sure why they do so but this actually requires us to write our own client in order to get the data. Google Data Studio cannot work well with dynamic column names.

Luckily, there is a fourth table called summary which actually has just one date column and every record for each day is one row instead of one column. This is a more SQL-friendly table and can be integrated with Google Data Studio easily.

🎨 The summary table is more SQL-friendly because the date is stored in just one column. 🎨

In this article, I will demonstrate using 1st, 2nd, and 4th table in order to show how we can programmatically get the data through the BigQuery API.

BigQuery Client Library for Golang

There are many client libraries of Google BigQuery for different types of programming languages, including C#. In this article, we choose to use Golang.

Before we proceed, we need to make sure that we have already enabled the BigQuery API for our project in the GCP. From the GCP Cloud Console, we will get the credential which will allow us to connect to the Google BigQuery and thus we must keep this credential file in a safe and secret place.

Now we can proceed to build our Golang client.

Firstly, we need to install the client library using go get command.

go get -u cloud.google.com/go/bigquery

Secondly, we need to initialise a Google BigQuery client.

ctx := context.Background()
client, err := bigquery.NewClient(ctx, projectID)

if err != nil {
    log.Fatalf("bigquery.NewClient: %v", err)
}

defer client.Close()

Querying the Tables

Next, we can start to query the data in the BigQuery.

rows, err := queryData1(ctx, client)
if err != nil {
    log.Fatal(err) 
}

queryResult := processQueryResult1(rows)

If we have other different queries for different tables or even datasets, we can continue to query in the same way as above.

So what does queryData1 look like? It is basically as simple as follows.

func queryData1(ctx context.Context, client *bigquery.Client) (*bigquery.RowIterator, error) { 
    query := client.Query("<SQL here>")
    
    return query.Read(ctx)
}

For example, if we are fetching the date as well as numbers of confirmed cases and deaths, we will be using the the following SQL.

`SELECT 
    CAST(date as STRING) as date, 
    IFNULL(confirmed, 0) as confirmed_cases, 
    IFNULL(deaths, 0) as deaths 
FROM ` + "`bigquery-public-data.covid19_jhu_csse.summary`" + ` ORDER BY date;

There are a few things to take note here is the use of CAST.

It casts the date field to string otherwise we may encounter problems such as having error of “schema field date of type DATE is not assignable to struct field date of type time.Time” when we unmarshal the returned JSON from the BigQuery in Golang later. The reason why I choose CAST is because casting from a date type to a string is independent of time zone and is of the form YYYY-MM-DD.

In addition, we also use IFNULL to make sure that the value in the confirmed_cases and deaths are always non-negative integers. In the original tables, the numbers can be null.

Now, we just need to have a struct where we can apply RowIterator.Next() to load each row into it. The struct that corresponding to the SQL above is as follows.

type QueryResultDataRow struct { 
    Date           string `bigquery:"date"` 
    ConfirmedCases int64  `bigquery:"confirmed_cases"` 
    Deaths         int64  `bigquery:"deaths"`
}

To iterate, we can use the code below.

func processQueryResult1(iter *bigquery.RowIterator) []QueryResultDataRow { 
    var result []QueryResultDataRow

    for { 
        var row QueryResultDataRow

        err := iter.Next(&row)

        if err == iterator.Done { 
            break 
        }

        if err != nil { 
            log.Print(err) 
            continue 
        }

        result = append(result, row) 
    }
    
    return result
}

Here, I’d like to share that there was a mistake I made when I wrote the code above. I forgot that I should end the for loop when the iterator is done, i.e. when err == iterator.Done. So the return statement will never reach. Please take note of this when you are writing this type of iteration.

Challenge: The Tables Having Dates as Columns

If you would like to challenge yourself to retrieve the data from the tables having dates as their columns, it is possible too, just with a few challenges.

First challenge is that we are not sure when the dataset will be updated. So, we can never be sure for the value of the last column. Since the dataset will be updated daily, to be safe, we can let the date of two days ago to be the last column in our query.

Second challenge is the format of the date. We cannot use the Golang magical reference date (Mon, Jan 2 15:04:05 MST 2006) to format the date because of the underscores found in the column name. There is a very interesting discussion about the origin of the magical reference date on Stack Overflow, in case you are interested, but it’s not important here. Hence, we will use the following code to format the date instead.

latestDateInQuery := fmt.Sprintf("_%v_%v_%v", int(d.Month()), d.Day(), d.Year() - 2000)

So the following code will help us to get the count from the second latest, if not the latest, column.

latestDate := time.Now().AddDate(0, 0, -2)
latestDateInQuery := fmt.Sprintf("_%v_%v_%v", int(latestDate.Month()), latestDate.Day(), latestDate.Year()-2000)

Once we get the column name, we can then use it in the following query.

`SELECT 
    IFNULL(province_state, "") AS place, 
    country_region, 
    latitude, 
    longitude, 
    (` + latestDateInQuery + `) AS count 
FROM ` + "`bigquery-public-data.covid19_jhu_csse.confirmed_cases`;"

Visualising the Data

With the queries above, we can then easily generate results with Google Charts. Here, I use the Line Chart and the GeoChart.

🎨 The COVID-19 dashboard powered by Golang and Google BigQuery. 🎨

There is an interesting feature in GeoChart is that, by default, when we are using latitude and longitude instead of the address to identify the places, the text shown on the map tooltip will be the latitude and longitude, which is not user friendly. However, we can actually change the text by putting a description column right after the longitude column, as discussed over here on Google Groups. It’s interesting because this is said to be an undocumented support for such a column. So we’re not sure where this will stop working.

Next, I am using web page done with Material Design to display the charts. Please enjoy the following screenshots.

🎨 Charts showing the situation in both of my beloved countries. 🎨

🎨 Top 10 countries having the most confirmed cases. 🎨

🎨 The global situation where we locate the places with latitudes and longitudes. 🎨

That’s all for the COVID-19 dashboard done using Golang and Google BigQuery. Also, thanks to JHU CSSE and Google, we are able to access to such an important data for free.

Finally, I’d like to wish all of you and your loved ones to stay safe and healthy.

*🎨 A nurse checks the temperature of a visitor as part of the COVID-19 screening procedure. (Photo Credit:* The Straits Times) 🎨

Analytical Processing on Transaction Data with Google BigQuery and Data Studio

June 7, 2020June 8, 2020 by Chun Lin, posted in Cloud Computing: Google Cloud Platform, Data, Google BigQuery, Google Data Studio, SQL

Data analysing is not about reporting. While reporting gives data, data analytics gives answers to the whys. Data analytics is the practice of using data and information to make informed decisions.

When I was in a startup, I was assigned a task by the CEO to work on analytical processing on transaction data. In the early days of the startup, the number of transactions was low, so simple data processing using stored procedures on MS SQL databases was sufficient. However, based on my past experience in SMEs, without investing in data workflow early, it will be challenging for the team to use the data to make informed decisions later. Imagine five years down the road, the team requires to do analysis on the huge amount of data collected in the past five years with just Excel.

🎨 Simple solution to do big data processing with Google BigQuery. 🎨

Hence, in this article, we will be focusing on how we can do analytical data processing with Google BigQuery and then visualise the data using Google Data Studio.

Analytical Data Processing

There are two main categories when we talk about data processing, i.e. Transactional Processing and Analytical Processing.

For example, in my previous startup team, we had an Order Management Support (OMS) team that focused on tracking and processing the orders on time. What the OMS team does is transactional processing. Then we also had another Data Analyse (DA) team to analyse sales data to find out about monthly revenue, for example. So the DA team is basically performing analytical data processing.

Hence, the DA team needs to analyse large batch of data. As the business grows, the data the team needs to access will be going back months, or even years. Also, when there are more sales channels introduced in the business, the DA team may need to access multiple data sources as well. So, let’s see how we can use data warehouse to help dealing with the big data the DA team has.

Why Google BigQuery?

There are many data warehouse solutions out there.

The reason why we choose Google BigQuery is because it is a data warehouse that is very similar to the RDBMS which we have been very familiar with. Another good news is that Google BigQuery now supports the standard SQL which is ANSI:2011 compliant. Hence, the DA team can move to Google BigQuery seamlessly.

In addition, Google BigQuery can handle complex analytical queries which will be essential to the businesses and the data stored in it can easily scale to petabytes as the businesses grow.

The fast real-time access to our data is also another advantage of Google BigQuery. So within a few seconds, the DA team can retrieve the results from processing the huge amount of data.

Finally, Google BigQuery is serverless. So we don’t have to instantiate compute nodes like we do in AWS Redshift.

🎨 First 1TB per month is free for on-demand querying on Google BigQuery. 🎨

Importing Data to Google BigQuery

The transaction data of the previous month will normally be double checked and verified by the relevant teams monthly. Once it is done, we can then download the transaction report of the month into a CSV file.

The reason why we choose CSV file is because it’s one of the three file types accepted by Google BigQuery. The other two are JSON and AVRO. Yes, AVRO! We talked about it in our previous article about Azure Event Hub.

Here, we will use Google Cloud Storage to store the CSV files because it is one of the accepted data source for Google BigQuery.

🎨 Monthly transaction data is kept in the bucket on Google Cloud Storage. 🎨

So after the monthly transaction data has been uploaded to the storage, we can then proceed to create a new dataset (a concept which can be treated as database in RDBMS). Then in the dataset, we can start to create new table based on the monthly transaction data.

There are a few ways on how to create the table in the dataset. It seems like merging data from all months into one table is easier for maintenance. However, I decide to go for the way where we have one table for every month. This way actually allows me to delete and upload monthly data whenever I need to.

🎨 Creating a new table based on data stored in the Google Cloud Storage. Take note that Google BigQuery is powerful enough to generate schema for us. 🎨

After that, we need to make sure to remove the header row(s) in our CSV file, if any, as shown in the following screenshot. If we don’t do this, the header row may be wrongly included into the dataset. The reason why we don’t need header here is also because the schema has been auto-detected by the Google BigQuery (or defined differently by the person who is uploading the data) in the previous step.

🎨 Remember to remove header in CSV during import stage. 🎨

Sometimes, there might be some data corruption in the CSV file. For example, a column which is expected to contain only number suddenly has a non-numerical value. Then Google BigQuery will complain to us, as shown in the screenshot below..

🎨 The dialog will indicate the row number (29928) and its position of the error. 🎨

Once the data is imported successfully to the table, we can then preview the data in the table.

This preview function is a very user-friendly feature. Do you still remember about the query price we mentioned earlier? If we explore the data in the table with the “SELECT *” statement, we will be charged and our usage quota may be affected. However, the preview function allows us to get a rough idea about the data in the table for free (and quota not affected)!

Views and Queries

Now we have seen both dataset and table, we need to introduce third concept in Google BigQuery called View. The view is actually a virtual table defined by a SQL query. Unlike the table which actually holds the records, the view will display the data in the related tables by executing its view-query.

Just now, I split the monthly transaction data into different tables. Actually I sort of regretting it after doing so because I can’t have an overview of the yearly report. Fortunately, with view, I can come up with a virtual table that holds essential data for the months in a year from all the tables using UNION in the query.

🎨 Combining transaction data of each month in 2018 into one single view. 🎨

The query is easy to write as long as we are familiar with the standard SQL. For example, we can have a view which will show the driver who has the highest cost of jobs in each month with the query which uses LIMIT and UNION ALL shown in the following screenshot.

Visualisation using Data Studio

Dashboard building and reporting are very important in almost all the businesses because they make it easy to translate messages, retain information, and gather insights from the data. In short, that’s a way of data storytelling.

Similar as Power BI, Google Data Studio comes with interactive dashboards and beautiful reports that inspire smarter business decisions. The Data Studio has connectors to Google Cloud Platform (GCP) services, including Google BigQuery, and data stores. So, we will see how we can make use of it to visualise our data.

In the Data Studio, we first need to search for the BigQuery connector first.

🎨 Google BigQuery <> Google Data Studio 🎨

After that, we need to locate the table or view which will provide the necessary data for the data visualisation, as shown in the following screenshot.

🎨 We can use both table and view as the data source in Data Studio. 🎨

We then can draw a table with bars to show the driver with highest cost in their jobs for each month, as shown in the image below.

However, we notice that the month is actually not ordered properly and it makes us hard to analyse the data. Why is it so? It turns out that the data type of the Month field is actually now recognised as “Text” which is indicated as “ABC” in the right menu.

To correct it, we need to click on the “ABC” beside the “Month” field and change its type to “Date”. The format we will use YYYYMM because that is how the year and month are formatted in the data source.

🎨 Changing the type of the “Month” field to be Date & Time. 🎨

We will then see the “Month” column in the table will be shown in date format. The funny thing is even though the format we choose is YYYYMM, when the Data Studio displays the months, they will be displayed as, for example, April 2018 instead of 201804.

What we’re now left to do is just sort the table according to the month in ascending order.

🎨 Sorting the table according to month. 🎨

So, congratulation to Mr Heng Swee Ren for being the only driver having the highest cost for two months!

In the Data Studio, we can also further customise the style of our report, such as changing the colours, to make the report to be more engaging to the readers.

🎨 A simple report for a huge amount of data is done easily without any coding. 🎨

Conclusion

This is not the end of the journey. It’s actually just the beginning. For example, the line chart shown above actually has a limitation of showing only 10 series, as discussed over here. If there are more than 10 series, some of them will be grouped as “Others” in the chart and the visualisation will no longer make sense.

Anyway, this is my proposal of how we should do analytical data processing with data warehouse tool, such as Google BigQuery. Through this article, I also hope that businesses, especially startups and SMEs, can start to look into building common business reports using Power BI or Google Data Studio which will avoid wasting programmers’ time on coming up in-house dashboard and reports which are not highly customisable.