Data Migration to BigQuery: Handling Data Transformation Challenges

Samet Karadag
Google Cloud - Community
2 min readOct 11, 2023

--

Migrating data from legacy systems to BigQuery is a common need in today’s data-driven world. This process involves several challenges, such as dealing with char columns containing empty spaces, removing leading zeros, handling precision in floating-point data, and managing time zone issues. In this blog post, we will explore these challenges and provide sample SQL queries to overcome them.

Case 1: Trimming Empty Spaces from Char Columns

In some databases like Oracle and Teradata, character columns (e.g., CHAR(8)) may contain empty spaces at the end of the string. When migrating to BigQuery, it’s crucial to remove these extra spaces. You can achieve this using the SQL query below:

SELECT TRIM(your_column, ' ' ) AS trimmed_column
FROM your_table;

This query trims the trailing empty spaces from the specified column in your table.

Case 2: Removing Leading Zeros

Data from various sources may contain leading zeros, which can affect calculations and analysis. To remove these leading zeros, you can use the TRIM or LTRIMfunction as shown below:

SELECT LTRIM('000189823', '0') AS result;

This query removes leading zeros from the input string, resulting in ‘189823’ as the output.

Case 3: Handling Precision in Floating-Point Data

During data migration, you may encounter situations where floating-point data is represented as strings and requires specific precision formatting. For example, if you’re migrating from Redshift, you still want to see 0s in the precision. You can use the following SQL query:

SELECT FORMAT('%0.2f', CAST(your_numeric_column AS FLOAT64)) AS your_varchar_column
FROM your_table;

This query casts a numeric column to FLOAT64 and formats it with two decimal places, ensuring consistent precision.

Case 4: Managing Time Zone Issues

Time zone issues often arise during data migration. In Redshift, the convert_timezone function is available, but in BigQuery, we can create a User-Defined Function (UDF) to achieve similar functionality.

Example conversion:

select Datetime(TIMESTAMP('2023-10-15 15:00:00', 'Turkey'), 'Europe/Amsterdam');

Consider the following UDF example:

CREATE FUNCTION convert_timezone(source_tz string, target_tz string, dt datetime) 
RETURNS datetime
AS
(Datetime(TIMESTAMP(dt, source_tz), target_tz));

This UDF converts a timestamp from the source time zone (source_tz) to the destination time zone (dsttarget_tztz). You can then use this function in queries like this:

SELECT convert_timezone('Israel', 'Europe/Amsterdam', customer_datetime) as Europe_time
FROM your_table;

BigQuery uses iana timezone names as the standart as highlighted here; https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#time_zones.

See full list of tz values here; https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

Case 5: Converting String, Numeric to Currency / Dollar format with 2 decimals

select FORMAT("%'.2f", float64_field)

or if you have the type as string:

select cast(FORMAT("%'.2f", cast(string_field as FLOAT64)) as STRING)

Conclusion -

By using the sample SQL queries and UDFs provided in this blog post, you can overcome these challenges and ensure a smooth transition of your data to BigQuery while maintaining data integrity and accuracy.

I am planning to update this post as we encounter different formatting issues. Let me know if you have such an issue in the comments and we can solve together.

--

--