Loading...

Comparing values in huge data sets

 In a big data world, from time to time, there is a need to check that data generated by our application changed as expected. Besides functional testing, it is good practice to verify this functionality that before and after change data is the same or changed as expected. If we are talking about big data, even small data samples can have thousands of lines and hundreds of columns in each. In this article I will explain how we can deal with this very task.
Let's assume that our data set is located in Big Query and to make it simpler to explain contains only six columns and three rows as it is shown on the table below:

uuid timestamp name latitude longitude status
44f3874b-bc7c-4730-9494-8f971f5e3ef7 1642416172 WBP1734 40.75360 -114.39207 32
ddacbd0d-1b8c-4ced-bb15-7092e24fbc84 1642417135 LC100 -4.19202 -65.67966 17
dfc3eabe-2646-436c-a524-6a437cce2d68 1642418142 AAWF-17 26.70534 117.27973 84


In the columns uuid is the unique number of the device which sent the message. Timestamp represents the time when the message has been sent. Name is the name of the device. Latitude and longitude represent the location of the device. Status is some kind of calculated status of the device. This table is generated by our application based on the other table with input data. This very input table is our entry point for verification and contains much more data. Let's assume that the change we expect in our application is increased precision in latitude and longitude. The rest of the behavior should be the same. The same table as above calculated with the updated functionality can looks like below:

uuid timestamp name latitude longitude status
44f3874b-bc7c-4730-9494-8f971f5e3ef7 1642416172 WBP1734 40.7536023 -114.3920701 32
ddacbd0d-1b8c-4ced-bb15-7092e24fbc84 1642417135 LC100 -4.1920256 -65.6796666 17
dfc3eabe-2646-436c-a524-6a437cce2d68 1642418142 AAWF-17 26.7053471 117.2797332 84

Now try to imagine that we have thousands of lines and hundreds of columns in each and, we expect changes in a few columns. Comparing those sets of data sounds a little complicated, doesn't it. Based on this simple example, I will try to explain how to solve the problem. First, you need to generate for every row in the old table a unique fingerprint that represents the data from all columns. You can do it with use of query:

UPDATE
  `iot-status-1sec-old`
SET
  DIM_FINGERPRINT = FARM_FINGERPRINT(ARRAY_TO_STRING([CAST(uuid AS STRING), CAST(timestamp AS STRING),
  CAST(name AS STRING), CAST(latitude AS STRING), CAST(longitude AS STRING), CAST(status AS STRING)], ''))
WHERE
  1 = 1;


We have to do the same for the table with new data (after changes). Note that we have to truncate latitude and longitude as we expect that it will have higher precision. We have to cover in the sql query all expected changes to create the same fingerprint as it is for the old data set. Unexpected changes should have different fingerprints then it were for the old data.

UPDATE
  `iot-status-1sec-new`
SET
  DIM_FINGERPRINT = FARM_FINGERPRINT(ARRAY_TO_STRING([CAST(uuid AS STRING), CAST(timestamp AS STRING),
  CAST(name AS STRING), CAST(TRUNC(latitude, 5) AS STRING), CAST(TRUNC(longitude, 5) AS STRING), CAST(status AS STRING)], ''))
WHERE
  1 = 1;


The last step in this verification is to find the rows for which fingerprints do not match with the value from second table.

SELECT `iot-status-1sec-new`.DIM_FINGERPRINT, `iot-status-1sec-old`.DIM_FINGERPRINT,
   `iot-status-1sec-old`.uuid
FROM `iot-status-1sec-new`
FULL JOIN `iot-status-1sec-old`
ON `iot-status-1sec-old`.DIM_FINGERPRINT = `iot-status-1sec-new`.DIM_FINGERPRINT
WHERE `iot-status-1sec-old`.DIM_FINGERPRINT is null
   OR `iot-status-1sec-new`.DIM_FINGERPRINT is null


Executing this query indicates uuid for rows in which we have unexpected changes and, we have to verify manually if it is a real bug. The Method I have described above shows you how to validate if in a big data set we do not have unexpected changes, but it doesn't validate that those expected changes are correct. We have to validate it in a separate test set.