When working with JSON data stored as strings in a Databricks table, you might need to extract specific fields and perform computations on them. In this guide, we’ll explore how to extract the total_cost field from a JSON string column named metadata and calculate its average using SQL in Databricks.
The Challenge
Suppose you have a table my_table with a column metadata containing JSON strings like:
{"total_cost": "150.75", "other_data": "..."} |
Your goal is to:
- Access the
total_costkey within the JSON string. - Convert the extracted value to a numeric type.
- Calculate the average of all
total_costvalues.
Solution Overview
We’ll use the following functions:
get_json_object: Extracts a JSON value as a string.TRY_CAST: Safely converts a string to a numeric type, returningNULLif the conversion fails.AVG: Calculates the average of numeric values, ignoringNULLs.
SQL Query
Here is the SQL query to achieve this:
SELECT |
Explanation
get_json_object(metadata, '$.total_cost'):- Extracts the value associated with the
total_costkey from themetadataJSON string. - The
'$.total_cost'argument specifies the path to thetotal_costfield in the JSON object.
- Extracts the value associated with the
TRY_CAST(... AS DOUBLE):- Attempts to convert the extracted
total_coststring to aDOUBLE(a floating-point number). - If the conversion fails (e.g., if the value is not a valid number), it returns
NULLinstead of throwing an error.
- Attempts to convert the extracted
AVG(...):- Computes the average of all the
DOUBLEvalues. - Automatically ignores
NULLvalues, so any failed conversions won’t affect the result.
- Computes the average of all the
Steps Breakdown
Extract the
total_costValue:- Use
get_json_objectto parse the JSON string and retrieve thetotal_costvalue. - Example:
get_json_object('{"total_cost": "150.75"}', '$.total_cost') -- Returns "150.75"
- Use
Convert to a Numeric Type:
- Use
TRY_CASTto safely convert the extracted string to aDOUBLE. - Example:
TRY_CAST("150.75" AS DOUBLE) -- Returns 150.75 (as a DOUBLE)
- If the string is not a valid number,
TRY_CASTreturnsNULL.
- Use
Calculate the Average:
- Use
AVGto compute the average of allDOUBLEvalues. AVGignoresNULLvalues, so any non-numeric or missingtotal_costentries won’t affect the calculation.
- Use
Example
Assume my_table contains the following metadata values:
| metadata |
|---|
{"total_cost": "100.50", "other_data": "..."} |
{"total_cost": "200.25", "other_data": "..."} |
{"total_cost": "invalid_number", "other_data": "..."} |
{"other_data": "..."} |
Applying the query:
Extract and Convert:
Extracted total_costConverted to DOUBLE"100.50"100.50"200.25"200.25"invalid_number"NULLNULL(missing key)NULLCompute Average:
- Average of
100.50and200.25(ignoringNULLs):(100.50 + 200.25) / 2 = 150.375
- Average of
Conclusion
By combining get_json_object, TRY_CAST, and AVG, you can efficiently extract numeric values from JSON strings and perform aggregate calculations in Databricks SQL. This method handles invalid or missing data gracefully, ensuring accurate and reliable results.