Why We Shouldn’t Ignore Null Values and How to Treat Them

Data cleaning using Google BigQuery SQL

Cindy Hosea
Geek Culture

--

Dealing with data aggregation can often be tricky, especially when it comes to big data that can’t be handled manually and contains null values (yes, in reality, there is no 100% clean data). Moreover, using different tools for data cleaning may require different approaches, because there is no such thing as one size fits all. Directly aggregating the numbers without cleaning the null values may cause us to provide misleading data and make the wrong conclusion.

For example, we want to calculate how much is the net margin comes from a retail store. We have the data stored in a table named ‘transactions_data’ as below:

Raw data from transactions_data table

The retail store currently is having giveaways as part of its promotional program to introduce the new products to the market. Each giveaway is recorded into a unique transaction_id, however since each product is given for free, therefore there are no sales recorded (null).

If we manually calculate the net margin using Excel/Google Spreadsheets, we can determine the net margin as below:

Net margin calculation using Excel/Google Spreadsheets

Note that we can determine the net margin for 1.52%, not quite high because we spend costs that cause a negative margin for giveaway transactions. However, imagine if there are millions of transactions that need to be processed, therefore the number of rows exceeds the limit for Excel/Google Spreadsheets, we need to utilize more advanced tools to calculate the net margin. For this case, I am going to use the example of using Google BigQuery SQL.

THE DON’TS

If we directly calculate the net margin without treating the null values first using this query:

WITH
raw_data AS (
SELECT *,
sales - cost AS margin
FROM transactions_data
)
SELECT SUM(margin)/SUM(cost) AS net_margin
FROM raw_data

The output can be misleading like this:

The margin calculation for transactions C3 and E5 will return as null (not -70,000 and -50,000) because once Google BigQuery encounters a null value in the formula (sales - cost), it will also return the result of the formula as null. When dealing with big data, we don’t print all the output row by row, but we will directly aggregate the data such as using SUM(margin)/SUM(cost), which will print the output of net margin as 19.70% (totally misleading!). Seeing the high net margin, we will be driven to the conclusion that the retail store is profitable enough, while actually we are burning costs for giveaways and driving down the net margin close to zero.

So, how should we treat the null values?

THE DO’S

Option #1: Add additional column sales_clean to replace null values with 0 (using CASE WHEN), then calculate margin_clean. The query should be somewhat like this:

WITH
add_sales_clean AS (
SELECT *,
CASE WHEN sales IS NULL
THEN 0
ELSE sales
END
AS sales_clean

FROM transactions_data
),
add_margin_clean AS (
SELECT *,
sales_clean - cost AS margin_clean
FROM add_sales_clean
)
SELECT SUM(margin_clean)/SUM(cost) AS net_margin
FROM add_margin_clean

In the backend, the query will return two additional columns before aggregation:

  1. sales_clean: similar with sales column, but will replace the null value with 0.
  2. margin_clean: subtracting sales_clean with cost.

Then, the output of the data aggregation will be like this:

After cleaning the data, we can get the correct result of net margin: 1.52%.

Option #2: Directly calculating margin_clean using CASE WHEN syntax (however, do note that this approach will be quite painful if the formula is complex). The query should be somewhat like this:

WITH
add_margin_clean AS (
SELECT *,
CASE WHEN sales IS NULL
THEN 0 - cost
ELSE sales - cost
END
AS margin_clean

FROM transactions_data
)
SELECT SUM(margin_clean)/SUM(cost) AS net_margin
FROM add_margin_clean

The output after aggregation will be like this:

The approach is similar to Option #1 and returns the same aggregation result. The difference is just that we are directly calculating the column margin_clean without creating the column sales_clean first.

Option #3: Using COALESCE or IFNULL function to replace the null values. The query should be somewhat like this:

WITH
add_margin_clean AS (
SELECT *,
COALESCE(sales, 0) - cost AS margin_clean
FROM transactions_data
)
SELECT SUM(margin_clean)/SUM(cost) AS net_margin
FROM add_margin_clean

Or we can also replace the COALESCE function with IFNULL, it will return the same output. The difference between both functions is that the logic for COALESCE can be used for more than two conditions, while IFNULL can only contain two conditions. But since we only have two conditions (return 0 or sales value), either function is okay. The output will be exactly the same as Option #2 and the query is much simpler.

Additional: We can also calculate the gross margin (excluding giveaway transactions), by filtering out transactions with null values using WHERE syntax. The query should be somewhat like this:

WITH
filter_giveaway_transactions AS (
SELECT *,
sales - cost AS margin
FROM transactions_data
WHERE sales IS NOT NULL
)
SELECT SUM(margin)/SUM(cost) AS gross_margin
FROM filter_giveaway_transactions

The query will filter out transactions C3 and E5, because they have null values in sales column. Therefore, the output of the calculation will be like this:

By excluding giveaway transactions, we can gain insights that eventually the gross margin almost makes up to a quarter of the product cost (24.07%). This indicates that maybe we are spending too much cost on the giveaway program, and need to explore ways how to optimize the cost such as:

  • Replacing with less expensive products
  • Reduce the number of giveaway winners
  • Adjusting the ratio of giveaway transactions compared to sales transactions
  • Etc.

Data cleaning is probably not the most exciting job to do, but if ignored, the consequence can be critical to the business — making the wrong conclusion and decision. Therefore, it is worth noted that data analytics is not only about modifying a syntax from Stackoverflow (I know we all need this gem) and running it, but we also need to understand how the query works. Happy analyzing!

--

--

Cindy Hosea
Geek Culture

Data analytics for business, supply chain, and marketing.