# Fixing InfluxDB data in place using INTEGRAL()

*
**
(Updated: )
*

TL;DR: you can use `INTEGRAL()`

to sum data under a curve, however it seems to do right-sided integration so be careful when using it if your data is sparse and only tracks changes. I solved this by re-sampling the data using `GROUP BY(60s)`

, losing some accuracy but gaining convenience.

I store my home automation (sensing, rather) data in an InfluxDB database. Usually this works OK, except sometimes I accidentally ingest erroneous data. Mostly this is a measurement error which I can delete, but sometimes I’ve written the wrong metric or unit to the database. Since deleting data is a shame, I usually try to recover it.

In this case I accidentally stored power data (in Watts) in my database instead of energy usage (in Joule). Initially I thought of deleting it because converting power to energy is prone to rounding errors if sampling is low, however in this case the sampling was a few seconds so I tried and succeeded in recovering the data. Below I document this one-off fix for my own reference and perhaps for a random stranger.

# Inspecting data

First check out the data. It looks like indeed we’re measuring power (which varies) instead of energy (which increases monotonically). The `/1000/3600`

is used to convert the data back to the original unit of Watts.

```
select value/1000/3600 from _temp_dryerorig where source='dryer' order by desc limit 10
time value
---- -----
2022-12-22T16:36:54Z 0
2022-12-22T16:36:52Z 4
2022-12-22T16:36:42Z 0
2022-12-22T16:36:39Z 157
2022-12-22T16:36:38Z 298
2022-12-22T16:36:37Z 14
2022-12-22T16:35:18Z 0
2022-12-22T16:35:16Z 13
2022-12-22T16:35:14Z 157
```

# Applying INTEGRAL()

We can use `INTEGRAL()`

to solve our problem, it calculates the area under the curve, excatly what we need! Let’s apply this to the last appliance cycle.

```
select integral(value)/1000/3600 from _temp_dryerorig where source='dryer' and time > '2022-12-22T16:35:12Z' limit 10
time integral
---- --------
1970-01-01T00:00:00Z 1537
```

This seems to work, except I’m not sure if this number is correct.

```
select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z'
time integral
---- --------
1970-01-01T00:00:00Z 1.2539470833333333
```

1.25 kWh (`1h`

to get Wh from `INTEGRAL()`

, `/1000`

to get kWh, the `/1000/3600`

is a fix to get the original data in Watts back) looks like a reasonable number for a cycle of this appliance.

Using group by we can simulate an energy meter instead, this time as Joule:

```
select integral(value,1s)/1000/3600 from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z' group by time(5m) limit 5
time integral
---- --------
2022-12-22T13:00:00Z 26076.75
2022-12-22T13:05:00Z 130326.75
2022-12-22T13:10:00Z 137082.5
2022-12-22T13:15:00Z 143446.875
2022-12-22T13:20:00Z 146848.625
```

Nice, we get a monotonically increasing value back.

Next we inspect the last two appliance cycles and check if we can scale this:

```
select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z'
time integral
---- --------
1970-01-01T00:00:00Z 3.649258888888889
```

3.65 kWh is a bit steep for two cycles. Let’s check the one-but last cycle indepdenently:

```
select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and time < '2022-12-22T00:00:00Z'
time integral
---- --------
1970-01-01T00:00:00Z 1.44998125
```

Ok so the last cycle was 1.25 kWh, the second to last was 1.45 kWh, however both combined gets me 3.65 kWh, something is off. What happens if we extend the `INTEGRAL()`

to just capture the first bit of the second cycle:

```
select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and time < '2022-12-22T13:00:00Z'
time integral
---- --------
1970-01-01T00:00:00Z 1.44998125
select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and time < '2022-12-22T14:00:00Z'
time integral
---- --------
1970-01-01T00:00:00Z 2.87397125
```

The problem seems to be: if there’s a big gap of no data between two cycles, this generates spurious counts in `INTEGRAL()`

. I suspect this is because `INTEGRAL()`

does not take the left edge but the right edge to do the integration/Riemann sum. The last datapoint of a cycle is always 0 because I only stored the changed datapoints. The first datapoint of a new cycle is therefore always >0. If `INTEGRAL()`

uses the right-most data point to integrate, this fails.

# Fixing spurious integrations

We can apply a fix the appliance idling integration problem by using a subclause where we re-group the data and `fill(0)`

to prevent this problem:

```
select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and time < '2022-12-22T14:00:00Z' group by time(10s) fill(0) limit 10
time fillval_value
---- -------------
2022-12-18T00:00:00Z 0
2022-12-18T00:00:10Z 0
2022-12-18T00:00:20Z 0
2022-12-18T00:00:30Z 0
2022-12-18T00:00:40Z 0
2022-12-18T00:00:50Z 0
2022-12-18T00:01:00Z 0
2022-12-18T00:01:10Z 0
2022-12-18T00:01:20Z 0
2022-12-18T00:01:30Z 0
```

OK lots of zeroes, makes sense.

Now we do the integral over re-grouped data which is now filled with 0 for no data:

```
select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and time < '2022-12-22T13:00:00Z' group by time(10s) fill(0))
time integral
---- --------
1970-01-01T00:00:00Z 1.3851428670634927
select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' and time < '2022-12-22T14:00:00Z' group by time(10s) fill(0))
time integral
---- --------
1970-01-01T00:00:00Z 1.8589239296737232
select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-18T00:00:00Z' group by time(10s) fill(0))
time integral
---- --------
1970-01-01T00:00:00Z 2.594881050485013
```

Ok this makes more sense, the one-but last cycle was 1.39 kWh, and the cycles combined are 2.6 kWh, which matches the individual measurements of before.

# Check fix accuracy

Now we check how accurate the subclause is compared to simple measurement

```
select integral(value,1h)/1000/3600/1000 from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z'
# Gives 1.2539470833333333
select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z' group by time(5s) fill(0))
# Gives 1.1214777083333336 (-11%)
select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z' group by time(10s) fill(0))
# Gives 1.2097381834215177 (-3%)
select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z' group by time(25s) fill(0))
# Gives 1.244173185249133 (-0.8%)
select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-12-22T08:00:00Z' group by time(60s) fill(0))
# Gives 1.2622966387799266 (+0.7%)
```

Somehow a longer averaging window (60s vs 10s) works better? Anyway, all errors are beyond my measurement accuracy goal, no time to dive deeper or understand why. I picked 60s because it’s faster.

Check total energy of wrongly logged data

```
select integral(fillval_value,1h)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-09-25T00:00:00Z' group by time(60s) fill(0))
time integral
---- --------
1970-01-01T00:00:00Z 16.188240516812662
```

16.188 kWh sounds about right. Could be 15 or 17 but I don’t care.

Now we group this data by 5 mins, in kWh

```
select integral(fillval_value,1s)/1000/3600/1000 from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-09-25T00:00:00Z' group by time(60s) fill(0)) group by time(5m)
time integral
---- --------
[...]
2022-12-19T12:05:00Z 0
2022-12-19T12:15:00Z 0
2022-12-19T12:20:00Z 23.35575
2022-12-19T12:25:00Z 132.00171128113408
2022-12-19T12:30:00Z 143.92108901794174
[...]
```

Seems to work, great! Now we integrate using CUMULATIVE_SUM() and add a constant offset to match the expected meter reading for today (129.05 kWh = 464580000 Joule).

```
# As Joule
select cumulative_sum(integral(fillval_value,1s))/1000/3600 - (16.188*1000*3600) + (129.05*1000*3600) from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-09-25T00:00:00Z' group by time(60s) fill(0)) group by time(5m)
time cumulative_sum
---- --------------
[...]
2022-12-19T12:05:00Z 454645787.3819719
2022-12-19T12:10:00Z 454645787.3819719
2022-12-19T12:15:00Z 454645787.3819719
2022-12-19T12:20:00Z 454669143.1319719
2022-12-19T12:25:00Z 454801144.843253
2022-12-19T12:30:00Z 454945065.932271
[...]
# Keep tags alive, round, and name as 'value'
select round(cumulative_sum(integral(fillval_value,1s))/1000/3600 - (16.188*1000*3600) + (129.05*1000*3600)) as value from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-09-25T00:00:00Z' group by time(60s),* fill(0)) group by time(5m),* limit 5
tags: quantity=myquantity, source=dryer, type=mytype, uniqueid=myid
time value
---- -----
2022-09-25T00:00:00Z 406303200
2022-09-25T00:05:00Z 406303200
2022-09-25T00:10:00Z 406303200
2022-09-25T00:15:00Z 406303200
2022-09-25T00:20:00Z 406303200
```

Hooray, this also works.

# Replace data in database

Now we being the actual editing in the database. We keep tags alive, store in temp measurement.

```
select round(cumulative_sum(integral(fillval_value,1s))/1000/3600 - (16.188*1000*3600) + (129.05*1000*3600)) as value into _temp_dryerfix from (select mean(*) as fillval from _temp_dryerorig where source='dryer' and time > '2022-09-25T00:00:00Z' group by time(60s),* fill(0)) group by time(5m),*
```

Next we store the original measurement as backup (maybe should do this first):

```
select * into _temp_dryerorig from energyv3 where source='dryer' and time > '2022-09-25T00:00:00Z' group by *
```

Finally, we delete the wrong data in the original measurement, then port fixed data back. We apply an extra WHERE clause here just to reduce the chance to mess up stuff.

```
SELECT count(*) FROM energyv3 where source='dryer' and time > '2022-09-25T00:00:00Z' and uniqueid='myid'
DELETE FROM energyv3 where source='dryer' and time > '2022-09-25T00:00:00Z' and uniqueid='myid'
select * from _temp_dryerfix group by * limit 1
select * into energyv3 from _temp_dryerfix group by * limit 1
select * into energyv3 from _temp_dryerfix group by *
```

And hooray, it works! We can clean up our temp measurements now and be happy.

```
drop measurement _temp_dryerfix
drop measurement _temp_dryerorig
```