Data cleaning is an essential part of any data-related position whether you’re a data engineer, data scientist, or data analyst. Today I want to share a few BigQuery SQL functions for data cleaning and a use case I would’ve used them for.
Invisible Special Characters In String Values
String values can contain special characters that don’t display on the screen but are stored in the database. I learned this the hard way when I applied a where clause on a string field that resulted in 0 records found. This was incredibly frustrating and I had to find functions to expose the unicode values to remove them from the string before applying the where clause to return the records I knew existed.
BigQuery has a NORMALIZE function that handles this exact scenario. Below are 3 records with unicode values between Jane and Smith that are invisible in the query results.

If I use a where clause for Jane Smith no records are returned.

However, if I use the NORMALIZE function on the name field the unicode values are removed and the three Jane Smith records are returned in the query results.

Special Mention: BigQuery also has a NORMALIZE_AND_CASEFOLD function if you want string comparison to be case insensitive, i.e. records containing Jane Smith or jane smith, will be returned in the query results.
Pattern Matching
I’ve always used the LIKE operator for pattern matching in a string field. Recently I had to categorize referring URLs of website visitors to match Google analytics channel reporting. Since I didn’t know if the URLs were upper or lowercase, I had to use the LOWER function to convert the field to all lowercase before checking for a pattern match.
BigQuery has a CONTAINS_SUBSTR function for this situation. Not only does CONTAINS_SUBSTR perform case-insensitive pattern checking, it can also check for pattern values in numeric fields, timestamps, and arrays as well.
In the example below, I check if the breakfast field contains the string pancakes in all lowercase. Both rows are returned in the query results although each record has a capitalized letter in pancakes.

Special Mention: BigQuery also has an ENDS_WITH function to check if a string ends with a pattern. A common use case I could’ve used this for was checking if an email ended with .edu to confirm a user was a student.
Date Formatting
In the past, I always downloaded query results from SQL into Excel to format dates for reporting purposes because I wasn’t able to format the dates the way I needed with SQL. This was time-consuming when I had a large volume of data to format.
BigQuery has a FORMAT_DATE function to handle date formatting. In the example below Sept. 30, 2022 is formatted in three different ways based on the format string.

Special Mention: Besides FORMAT_DATE you can also use FORMAT_DATETIME to format datetime values. There’s also a FORMAT function to format a field as a string value. One use case for this function is to format large numbers with comma separators. Instead of 1000000 you can use the FORMAT function to display 1,000,000 in the query results.
Dividing With A Zero Denominator
I often had to calculate percentages where the denominator could be 0 which would return a SQL error when dividing by 0. One option was to use a CASE statement to check if the denominator was 0 before dividing to avoid an error but most databases had a function to handle this situation.
In BigQuery’s case, the function is called SAFE_DIVIDE. In the example below I divide 10 by 0 and get a division by zero error.

After I use SAFE_DIVIDE the result is a null value instead of an error.

Special Mention: BigQuery also has SAFE_ADD, SAFE_SUBTRACT, SAFE_MULTIPLY, and SAFE_NEGATE functions that will return a null value if an overflow occurs.
Final Thoughts
While we can never get away from data cleaning there are SQL functions that can help. I hope you learned a new function or two that’ll be useful in the future. While the functions I mentioned are in BigQuery, they may be available in your database too.
Note: All queries above were run on BigQuery sandbox that’s free to anyone with a Google account.
You might also like…
6 BigQuery SQL Functions Every User Should Know
4 BigQuery SQL Shortcuts That Can Simplify Your Queries
How Data Scientists Can Reduce Data Wrangling Time with a Data Mart







