Comparing values in huge data sets
In a big data world, from time to time, there is a need to check that data generated
by our application changed as expected. Besides functional testing, it is good practice
to verify this functionality that before and after change data is the same or changed
as expected. If we are talking about big data, even small data samples can have thousands
of lines and hundreds of columns in each. In this article I will explain how we can deal
with this very task.
Let's assume that our data set is located in Big Query and to make it simpler to explain contains only six columns and three rows as it is shown on the table below:
| uuid | timestamp | name | latitude | longitude | status |
|---|---|---|---|---|---|
| 44f3874b-bc7c-4730-9494-8f971f5e3ef7 | 1642416172 | WBP1734 | 40.75360 | -114.39207 | 32 |
| ddacbd0d-1b8c-4ced-bb15-7092e24fbc84 | 1642417135 | LC100 | -4.19202 | -65.67966 | 17 |
| dfc3eabe-2646-436c-a524-6a437cce2d68 | 1642418142 | AAWF-17 | 26.70534 | 117.27973 | 84 |
In the columns uuid is the unique number of the device which sent the message. Timestamp
represents the time when the message has been sent. Name is the name of the device. Latitude
and longitude represent the location of the device. Status is some kind of calculated status
of the device. This table is generated by our application based on the other table with input
data. This very input table is our entry point for verification and contains much more data.
Let's assume that the change we expect in our application is increased precision in latitude
and longitude. The rest of the behavior should be the same. The same table as above calculated
with the updated functionality can looks like below:
| uuid | timestamp | name | latitude | longitude | status |
|---|---|---|---|---|---|
| 44f3874b-bc7c-4730-9494-8f971f5e3ef7 | 1642416172 | WBP1734 | 40.7536023 | -114.3920701 | 32 |
| ddacbd0d-1b8c-4ced-bb15-7092e24fbc84 | 1642417135 | LC100 | -4.1920256 | -65.6796666 | 17 |
| dfc3eabe-2646-436c-a524-6a437cce2d68 | 1642418142 | AAWF-17 | 26.7053471 | 117.2797332 | 84 |
Now try to imagine that we have thousands of lines and hundreds of columns in each and, we expect changes in a few columns. Comparing those sets of data sounds a little complicated, doesn't it. Based on this simple example, I will try to explain how to solve the problem. First, you need to generate for every row in the old table a unique fingerprint that represents the data from all columns. You can do it with use of query:
UPDATE `iot-status-1sec-old` SET DIM_FINGERPRINT = FARM_FINGERPRINT(ARRAY_TO_STRING([CAST(uuid AS STRING), CAST(timestamp AS STRING), CAST(name AS STRING), CAST(latitude AS STRING), CAST(longitude AS STRING), CAST(status AS STRING)], '')) WHERE 1 = 1;
We have to do the same for the table with new data (after changes). Note that we have to
truncate latitude and longitude as we expect that it will have higher precision. We have to
cover in the sql query all expected changes to create the same fingerprint as it is for the
old data set. Unexpected changes should have different fingerprints then it were for the old data.
UPDATE `iot-status-1sec-new` SET DIM_FINGERPRINT = FARM_FINGERPRINT(ARRAY_TO_STRING([CAST(uuid AS STRING), CAST(timestamp AS STRING), CAST(name AS STRING), CAST(TRUNC(latitude, 5) AS STRING), CAST(TRUNC(longitude, 5) AS STRING), CAST(status AS STRING)], '')) WHERE 1 = 1;
The last step in this verification is to find the rows for which fingerprints do not match
with the value from second table.
SELECT `iot-status-1sec-new`.DIM_FINGERPRINT, `iot-status-1sec-old`.DIM_FINGERPRINT, `iot-status-1sec-old`.uuid FROM `iot-status-1sec-new` FULL JOIN `iot-status-1sec-old` ON `iot-status-1sec-old`.DIM_FINGERPRINT = `iot-status-1sec-new`.DIM_FINGERPRINT WHERE `iot-status-1sec-old`.DIM_FINGERPRINT is null OR `iot-status-1sec-new`.DIM_FINGERPRINT is null
Executing this query indicates uuid for rows in which we have unexpected changes and, we have
to verify manually if it is a real bug. The Method I have described above shows you how to validate
if in a big data set we do not have unexpected changes, but it doesn't validate that those expected
changes are correct. We have to validate it in a separate test set.