Tim's blah blah blah

Fixing InfluxDB data in place using INTEGRAL()

(Updated: )

TL;DR: you can use INTEGRAL() (influxdata.com) to sum data under a curve, however it seems to do right-sided integration (wikipedia.org) so be careful when using it if your data is sparse and only tracks changes. I solved this by re-sampling the data using GROUP BY(60s), losing some accuracy but gaining convenience.

I store my home automation (sensing, rather) data in an InfluxDB database. Usually this works OK, except sometimes I accidentally ingest erroneous data. Mostly this is a measurement error which I can delete, but sometimes I’ve written the wrong metric or unit to the database. Since deleting data is a shame, I usually try to recover it.

In this case I accidentally stored power data (in Watts) in my database instead of energy usage (in Joule). Initially I thought of deleting it because converting power to energy is prone to rounding errors if sampling is low, however in this case the sampling was a few seconds so I tried and succeeded in recovering the data. Below I document this one-off fix for my own reference and perhaps for a random stranger.

Contents

Inspecting data

First check out the data. It looks like indeed we’re measuring power (which varies) instead of energy (which increases monotonically). The /1000/3600 is used to convert the data back to the original unit of Watts.

select value/1000/3600 from _temp_dryerorig where source='dryer' order by desc limit 10

time                 value
----                 -----
2022-12-22T16:36:54Z 0
2022-12-22T16:36:52Z 4
2022-12-22T16:36:42Z 0
2022-12-22T16:36:39Z 157
2022-12-22T16:36:38Z 298
2022-12-22T16:36:37Z 14
2022-12-22T16:35:18Z 0
2022-12-22T16:35:16Z 13
2022-12-22T16:35:14Z 157

Applying INTEGRAL()

We can use INTEGRAL() (influxdata.com) to solve our problem, it calculates the area under the curve, excatly what we need! Let’s apply this to the last appliance cycle.

select integral(value)/1000/3600 from _temp_dryerorig where source='dryer' and time > '2022-12-22T16:35:12Z' limit 10

time                 integral
----                 --------
1970-01-01T00:00:00Z 1537

This seems to work, except I’m not sure if this number is correct.

select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z'

time                 integral
----                 --------
1970-01-01T00:00:00Z 1.2539470833333333

1.25 kWh (1h to get Wh from INTEGRAL(), /1000 to get kWh, the /1000/3600 is a fix to get the original data in Watts back) looks like a reasonable number for a cycle of this appliance.

Using group by we can simulate an energy meter instead, this time as Joule:

select integral(value,1s)/1000/3600 from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z' group by time(5m) limit 5

time                 integral
----                 --------
2022-12-22T13:00:00Z 26076.75
2022-12-22T13:05:00Z 130326.75
2022-12-22T13:10:00Z 137082.5
2022-12-22T13:15:00Z 143446.875
2022-12-22T13:20:00Z 146848.625

Nice, we get a monotonically increasing value back.

Next we inspect the last two appliance cycles and check if we can scale this:

select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z'

time                 integral
----                 --------
1970-01-01T00:00:00Z 3.649258888888889

3.65 kWh is a bit steep for two cycles. Let’s check the one-but last cycle indepdenently:

select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and  time < '2022-12-22T00:00:00Z'

time                 integral
----                 --------
1970-01-01T00:00:00Z 1.44998125

Ok so the last cycle was 1.25 kWh, the second to last was 1.45 kWh, however both combined gets me 3.65 kWh, something is off. What happens if we extend the INTEGRAL() to just capture the first bit of the second cycle:

select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and time < '2022-12-22T13:00:00Z'

time                 integral
----                 --------
1970-01-01T00:00:00Z 1.44998125

select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and time < '2022-12-22T14:00:00Z'

time                 integral
----                 --------
1970-01-01T00:00:00Z 2.87397125

The problem seems to be: if there’s a big gap of no data between two cycles, this generates spurious counts in INTEGRAL(). I suspect this is because INTEGRAL() does not take the left edge but the right edge (wikipedia.org) to do the integration/Riemann sum (stackexchange.com). The last datapoint of a cycle is always 0 because I only stored the changed datapoints. The first datapoint of a new cycle is therefore always >0. If INTEGRAL() uses the right-most data point to integrate, this fails.

Fixing spurious integrations

We can apply a fix the appliance idling integration problem by using a subclause where we re-group the data and fill(0) to prevent this problem:

select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and time < '2022-12-22T14:00:00Z' group by time(10s) fill(0) limit 10

time                 fillval_value
----                 -------------
2022-12-18T00:00:00Z 0
2022-12-18T00:00:10Z 0
2022-12-18T00:00:20Z 0
2022-12-18T00:00:30Z 0
2022-12-18T00:00:40Z 0
2022-12-18T00:00:50Z 0
2022-12-18T00:01:00Z 0
2022-12-18T00:01:10Z 0
2022-12-18T00:01:20Z 0
2022-12-18T00:01:30Z 0

OK lots of zeroes, makes sense.

Now we do the integral over re-grouped data which is now filled with 0 for no data:

select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and time < '2022-12-22T13:00:00Z' group by time(10s) fill(0))

time                 integral
----                 --------
1970-01-01T00:00:00Z 1.3851428670634927

select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and time < '2022-12-22T14:00:00Z' group by time(10s) fill(0))

time                 integral
----                 --------
1970-01-01T00:00:00Z 1.8589239296737232

select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' group by time(10s) fill(0))

time                 integral
----                 --------
1970-01-01T00:00:00Z 2.594881050485013

Ok this makes more sense, the one-but last cycle was 1.39 kWh, and the cycles combined are 2.6 kWh, which matches the individual measurements of before.

Check fix accuracy

Now we check how accurate the subclause is compared to simple measurement

select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z'
# Gives 1.2539470833333333

select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z' group by time(5s) fill(0))
# Gives 1.1214777083333336 (-11%)

select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z' group by time(10s) fill(0))
# Gives 1.2097381834215177 (-3%)

select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z' group by time(25s) fill(0))
# Gives 1.244173185249133 (-0.8%)

select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z' group by time(60s) fill(0))
# Gives 1.2622966387799266 (+0.7%)

Somehow a longer averaging window (60s vs 10s) works better? Anyway, all errors are beyond my measurement accuracy goal, no time to dive deeper or understand why. I picked 60s because it’s faster.

Check total energy of wrongly logged data

select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-09-25T00:00:00Z' group by time(60s) fill(0))

time                 integral
----                 --------
1970-01-01T00:00:00Z 16.188240516812662

16.188 kWh sounds about right. Could be 15 or 17 but I don’t care.

Now we group this data by 5 mins, in kWh

select integral(fillval_value,1s)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-09-25T00:00:00Z' group by time(60s) fill(0)) group by time(5m)

time                 integral
----                 --------
[...]
2022-12-19T12:05:00Z 0
2022-12-19T12:15:00Z 0
2022-12-19T12:20:00Z 23.35575
2022-12-19T12:25:00Z 132.00171128113408
2022-12-19T12:30:00Z 143.92108901794174
[...]

Seems to work, great! Now we integrate using CUMULATIVE_SUM() (influxdata.com) and add a constant offset to match the expected meter reading for today (129.05 kWh = 464580000 Joule).

# As Joule
select cumulative_sum(integral(fillval_value,1s))/1000/3600 - (16.188*1000*3600) + (129.05*1000*3600) from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-09-25T00:00:00Z' group by time(60s) fill(0)) group by time(5m)

time                 cumulative_sum
----                 --------------
[...]
2022-12-19T12:05:00Z 454645787.3819719
2022-12-19T12:10:00Z 454645787.3819719
2022-12-19T12:15:00Z 454645787.3819719
2022-12-19T12:20:00Z 454669143.1319719
2022-12-19T12:25:00Z 454801144.843253
2022-12-19T12:30:00Z 454945065.932271
[...]

# Keep tags alive, round, and name as 'value'
select round(cumulative_sum(integral(fillval_value,1s))/1000/3600 - (16.188*1000*3600) + (129.05*1000*3600)) as value from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-09-25T00:00:00Z' group by time(60s),* fill(0)) group by time(5m),* limit 5

tags: quantity=myquantity, source=dryer, type=mytype, uniqueid=myid
time                 value
----                 -----
2022-09-25T00:00:00Z 406303200
2022-09-25T00:05:00Z 406303200
2022-09-25T00:10:00Z 406303200
2022-09-25T00:15:00Z 406303200
2022-09-25T00:20:00Z 406303200

Hooray, this also works.

Replace data in database

Now we being the actual editing in the database. We keep tags alive, store in temp measurement.

select round(cumulative_sum(integral(fillval_value,1s))/1000/3600 - (16.188*1000*3600) + (129.05*1000*3600)) as value into _temp_dryerfix from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-09-25T00:00:00Z' group by time(60s),* fill(0)) group by time(5m),*

Next we store the original measurement as backup (maybe should do this first):

select * into _temp_dryerorig from energyv3 where source='dryer' and time > '2022-09-25T00:00:00Z' group by *

Finally, we delete the wrong data in the original measurement, then port fixed data back. We apply an extra WHERE clause here just to reduce the chance to mess up stuff.

SELECT count(*) FROM energyv3 where source='dryer' and time > '2022-09-25T00:00:00Z' and uniqueid='myid'
DELETE FROM energyv3 where source='dryer' and time > '2022-09-25T00:00:00Z' and uniqueid='myid'

select * from _temp_dryerfix group by * limit 1
select * into energyv3 from _temp_dryerfix group by * limit 1
select * into energyv3 from _temp_dryerfix group by *

And hooray, it works! We can clean up our temp measurements now and be happy.

drop measurement _temp_dryerfix
drop measurement _temp_dryerorig

#Server #Smarthome