Publish AI, ML & data-science insights to a global community of data professionals.

BigQuery SQL Functions For Data Cleaning

Use cases and functions to apply

Image by Rosy - The world is worth thousands of pictures from Pixabay
Image by Rosy – The world is worth thousands of pictures from Pixabay

Data cleaning is an essential part of any data-related position whether you’re a data engineer, data scientist, or data analyst. Today I want to share a few BigQuery SQL functions for data cleaning and a use case I would’ve used them for.


Invisible Special Characters In String Values

String values can contain special characters that don’t display on the screen but are stored in the database. I learned this the hard way when I applied a where clause on a string field that resulted in 0 records found. This was incredibly frustrating and I had to find functions to expose the unicode values to remove them from the string before applying the where clause to return the records I knew existed.

BigQuery has a NORMALIZE function that handles this exact scenario. Below are 3 records with unicode values between Jane and Smith that are invisible in the query results.

Screenshot example created by author
Screenshot example created by author

If I use a where clause for Jane Smith no records are returned.

Screenshot example created by author
Screenshot example created by author

However, if I use the NORMALIZE function on the name field the unicode values are removed and the three Jane Smith records are returned in the query results.

Screenshot example using NORMALIZE function created by author
Screenshot example using NORMALIZE function created by author

Special Mention: BigQuery also has a NORMALIZE_AND_CASEFOLD function if you want string comparison to be case insensitive, i.e. records containing Jane Smith or jane smith, will be returned in the query results.

Pattern Matching

I’ve always used the LIKE operator for pattern matching in a string field. Recently I had to categorize referring URLs of website visitors to match Google analytics channel reporting. Since I didn’t know if the URLs were upper or lowercase, I had to use the LOWER function to convert the field to all lowercase before checking for a pattern match.

BigQuery has a CONTAINS_SUBSTR function for this situation. Not only does CONTAINS_SUBSTR perform case-insensitive pattern checking, it can also check for pattern values in numeric fields, timestamps, and arrays as well.

In the example below, I check if the breakfast field contains the string pancakes in all lowercase. Both rows are returned in the query results although each record has a capitalized letter in pancakes.

Screenshot example using CONTAINS_SUBSTR function created by author
Screenshot example using CONTAINS_SUBSTR function created by author

Special Mention: BigQuery also has an ENDS_WITH function to check if a string ends with a pattern. A common use case I could’ve used this for was checking if an email ended with .edu to confirm a user was a student.

Date Formatting

In the past, I always downloaded query results from SQL into Excel to format dates for reporting purposes because I wasn’t able to format the dates the way I needed with SQL. This was time-consuming when I had a large volume of data to format.

BigQuery has a FORMAT_DATE function to handle date formatting. In the example below Sept. 30, 2022 is formatted in three different ways based on the format string.

Screenshot example using FORMAT_DATE function created by author
Screenshot example using FORMAT_DATE function created by author

Special Mention: Besides FORMAT_DATE you can also use FORMAT_DATETIME to format datetime values. There’s also a FORMAT function to format a field as a string value. One use case for this function is to format large numbers with comma separators. Instead of 1000000 you can use the FORMAT function to display 1,000,000 in the query results.

Dividing With A Zero Denominator

I often had to calculate percentages where the denominator could be 0 which would return a SQL error when dividing by 0. One option was to use a CASE statement to check if the denominator was 0 before dividing to avoid an error but most databases had a function to handle this situation.

In BigQuery’s case, the function is called SAFE_DIVIDE. In the example below I divide 10 by 0 and get a division by zero error.

Screenshot example using division error example created by author
Screenshot example using division error example created by author

After I use SAFE_DIVIDE the result is a null value instead of an error.

Screenshot example using SAFE_DIVIDE example created by author
Screenshot example using SAFE_DIVIDE example created by author

Special Mention: BigQuery also has SAFE_ADD, SAFE_SUBTRACT, SAFE_MULTIPLY, and SAFE_NEGATE functions that will return a null value if an overflow occurs.

Final Thoughts

While we can never get away from data cleaning there are SQL functions that can help. I hope you learned a new function or two that’ll be useful in the future. While the functions I mentioned are in BigQuery, they may be available in your database too.

Note: All queries above were run on BigQuery sandbox that’s free to anyone with a Google account.


You might also like…

6 BigQuery SQL Functions Every User Should Know

4 BigQuery SQL Shortcuts That Can Simplify Your Queries

How Data Scientists Can Reduce Data Wrangling Time with a Data Mart


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles