This package and vignette makes
extensive use of data.table
. If you’re unfamiliar with the
data.table
syntax, a brief review of that package’s
introductory vignette may be useful.
Consider the following dataset which represents average (predicted) pm2.5 exposure and no2 exposure at some location over four sequential 7-day periods at the beginning of the year 2000:
exposure_dataset <- data.table(location_id=1, start=seq(as.IDate("2000-01-01"),by=7,length=4),
end=seq(as.IDate("2000-01-07"),by=7,length=4),pm25=rnorm(4,mean=15), no2=rnorm(4,mean=25))
exposure_dataset
#> location_id start end pm25 no2
#> <num> <IDat> <IDat> <num> <num>
#> 1: 1 2000-01-01 2000-01-07 14.37355 25.32951
#> 2: 1 2000-01-08 2000-01-14 15.18364 24.17953
#> 3: 1 2000-01-15 2000-01-21 14.16437 25.48743
#> 4: 1 2000-01-22 2000-01-28 16.59528 25.73832
Note that the above intervals are stored as a column for the start of the interval and a column for the end of the interval. For the purpose of this package, intervals are ALWAYS treated as closed (i.e. inclusive of start and end values) and the variables storing interval starts and ends must be discrete (e.g., class integer or IDate).
If we wanted to calculate the average of the first two weeks of pm25 data, this would simply be the average of the two pm25 values for those weeks:
But we wanted the average of the first 10 days of that pm25 data, we would need to take a weighted average since the period from Jan 1 to Jan 10 doesn’t align perfectly with the intervals over which the pm2.5 data is recorded:
exposure_dataset[start %in% as.IDate(c("2000-01-01","2000-01-08")),weighted.mean(pm25,w=c(7/10,3/10))]
#> [1] 14.61658
The intervalaverage
package and specifically the
intervalaverage
function was written to facilitate this
sort averaging operation. In order to use this the package function,
we’ll need a dataset containing data that’s stored over intervals (such
as in exposure_dataset
) as well as a dataset containing the
periods you’d like to average over.
Let’s create a dataset containing some periods we’d like averages over:
averaging_periods <- data.table(start=seq(as.IDate("2000-01-01"),by=10,length=3),
end=seq(as.IDate("2000-01-10"),by=10,length=3))
averaging_periods
#> start end
#> <IDat> <IDat>
#> 1: 2000-01-01 2000-01-10
#> 2: 2000-01-11 2000-01-20
#> 3: 2000-01-21 2000-01-30
Now that we have defined intervals to average over, let’s use the
intervalaverage
function to calculate the averages:
Note that in order for the intervalaverage function to work, the
start and end columns need to have the same column names in
x
and in y
. These column names are specified
via the interval_vars
argument. And the variables in
x
that you want averages calculated for are specified via
value_vars
.
averaged_exposures <- intervalaverage(x=exposure_dataset,y=averaging_periods,
interval_vars=c("start","end"),value_vars=c("pm25","no2")
)
averaged_exposures[, list(start,end,pm25,no2)]
#> Key: <start, end>
#> start end pm25 no2
#> <IDat> <IDat> <num> <num>
#> 1: 2000-01-01 2000-01-10 14.61658 24.98451
#> 2: 2000-01-11 2000-01-20 14.57208 24.96427
#> 3: 2000-01-21 2000-01-30 NA NA
The return value of the intervalaverage
function is a
data.table
. Just the first four columns of that return
data.table are printed. The return data.table always contains the exact
intervals specified in y (and, as such, the number of rows of the return
is always the number of rows in y
). The return also
contains a column for each value_var
specified from x.
These columns contain the values of the variables from x
averaged over periods in y
. Note that the value of the pm25
in the first row is what we calculated manually above.
Note that the third entry for both the pm25
and the
no2
column is NA
or missing. This makes sense
because the x
( exposure_dataset
) didn’t have
measurements for every day in the interval in y
(averaging_periods
) from Jan 21, 2000 to Jan 30, 2000.
Displaying the full data.table returned by the function gives us some more information:
averaged_exposures
#> Key: <start, end>
#> start end pm25 no2 yduration xduration nobs_pm25
#> <IDat> <IDat> <num> <num> <num> <int> <int>
#> 1: 2000-01-01 2000-01-10 14.61658 24.98451 10 10 10
#> 2: 2000-01-11 2000-01-20 14.57208 24.96427 10 10 10
#> 3: 2000-01-21 2000-01-30 NA NA 10 8 8
#> nobs_no2 xminstart xmaxend
#> <int> <IDat> <IDat>
#> 1: 10 2000-01-01 2000-01-10
#> 2: 10 2000-01-11 2000-01-20
#> 3: 8 2000-01-21 2000-01-28
The xduration
column tells us the number of days that
were present in x
for each interval specified in
y
. The first two averaging_periods
intervals
were fully represented in x
, whereas x
only
contained data for 8 of the 10 days in the third y
interval. The xmaxend
column shows us that the last day in
the interval from Jan 21,2000 to Jan 30, 2000 that was present in
y
was Jan 28, 2000.
These supplementary columns are useful for diagnosing incomplete data
in exposure_dataset
.
If we’re ok with calculating an average based on incomplete data, we can set the the tolerance for missingness lower. Let’s say we’re ok with calculating a non-missing average if 75% or more of the period is observed:
intervalaverage(x=exposure_dataset,
y=averaging_periods,
interval_vars=c("start","end"),
value_vars=c("pm25","no2"),
required_percentage = 75)
#> Key: <start, end>
#> start end pm25 no2 yduration xduration nobs_pm25
#> <IDat> <IDat> <num> <num> <num> <int> <int>
#> 1: 2000-01-01 2000-01-10 14.61658 24.98451 10 10 10
#> 2: 2000-01-11 2000-01-20 14.57208 24.96427 10 10 10
#> 3: 2000-01-21 2000-01-30 16.29142 25.70696 10 8 8
#> nobs_no2 xminstart xmaxend
#> <int> <IDat> <IDat>
#> 1: 10 2000-01-01 2000-01-10
#> 2: 10 2000-01-11 2000-01-20
#> 3: 8 2000-01-21 2000-01-28
The results are the same for the first two rows but now we have
nonmissing values in the third which are calculated based on the
available data in exposure_dataset
. If there had been a
period with less than 75% of the data present, the function would still
return NA
for those value variables.
Often, we might have interval data at more than one location or
identifier at a time. Let’s create a data.table similar to
exposure_dataset
but with several (three) locations:
exposure_dataset2 <- rbindlist(lapply(1:3, function(z){
data.table(location_id=z,
start=seq(as.IDate("2000-01-01"),by=7,length=4),
end=seq(as.IDate("2000-01-07"),by=7,length=4),pm25=rnorm(4,mean=15),
no2=rnorm(4,mean=25))} ))
exposure_dataset2
#> location_id start end pm25 no2
#> <int> <IDat> <IDat> <num> <num>
#> 1: 1 2000-01-01 2000-01-07 15.57578 24.37876
#> 2: 1 2000-01-08 2000-01-14 14.69461 22.78530
#> 3: 1 2000-01-15 2000-01-21 16.51178 26.12493
#> 4: 1 2000-01-22 2000-01-28 15.38984 24.95507
#> 5: 2 2000-01-01 2000-01-07 14.98381 25.91898
#> 6: 2 2000-01-08 2000-01-14 15.94384 25.78214
#> 7: 2 2000-01-15 2000-01-21 15.82122 25.07456
#> 8: 2 2000-01-22 2000-01-28 15.59390 23.01065
#> 9: 3 2000-01-01 2000-01-07 15.61983 24.52185
#> 10: 3 2000-01-08 2000-01-14 14.94387 25.41794
#> 11: 3 2000-01-15 2000-01-21 14.84420 26.35868
#> 12: 3 2000-01-22 2000-01-28 13.52925 24.89721
If you want to use the intervalaverage
function to
calculate averages of values in x
over a set of averaging
periods separately for each level of an identifier variable, that
identifier variable needs to be crossed with every averaging period in
y
. It takes an extra step to cross the identifier with the
averaging periods to create y
, but in creating
y
this way you explicitly define the form of the return
value of intervalaverage
, since
intervalaverage
always returns one row for each row in
y
.
Let’s cross the previous averaging periods table with every unique value of the identifier in the new exposure dataset:
#unexpanded:
averaging_periods
#> start end
#> <IDat> <IDat>
#> 1: 2000-01-01 2000-01-10
#> 2: 2000-01-11 2000-01-20
#> 3: 2000-01-21 2000-01-30
#expanded to every location_id:
rbindlist(lapply(1:3, function(z)copy(averaging_periods)[,location_id:=z][]))
#> start end location_id
#> <IDat> <IDat> <int>
#> 1: 2000-01-01 2000-01-10 1
#> 2: 2000-01-11 2000-01-20 1
#> 3: 2000-01-21 2000-01-30 1
#> 4: 2000-01-01 2000-01-10 2
#> 5: 2000-01-11 2000-01-20 2
#> 6: 2000-01-21 2000-01-30 2
#> 7: 2000-01-01 2000-01-10 3
#> 8: 2000-01-11 2000-01-20 3
#> 9: 2000-01-21 2000-01-30 3
The above code is a bit esoteric so the intervalaverage package
contains function to simplify and generalize this process of
repeating/expanding a set of intervals (or more generally, a set of rows
in a table) for every location_id
(or more generally, for
every row in another table). To use this CJ.dt
function,
just create a data.table
with a column containing unique
ids, then call CJ.dt
on the two tables:
exposure_dataset2_unique_locs <- data.table(location_id=unique(exposure_dataset2$location_id))
averaging_periods2 <- CJ.dt(averaging_periods, exposure_dataset2_unique_locs)
averaging_periods2
#> start end location_id
#> <IDat> <IDat> <int>
#> 1: 2000-01-01 2000-01-10 1
#> 2: 2000-01-11 2000-01-20 1
#> 3: 2000-01-21 2000-01-30 1
#> 4: 2000-01-01 2000-01-10 2
#> 5: 2000-01-11 2000-01-20 2
#> 6: 2000-01-21 2000-01-30 2
#> 7: 2000-01-01 2000-01-10 3
#> 8: 2000-01-11 2000-01-20 3
#> 9: 2000-01-21 2000-01-30 3
#or, more concisely:
averaging_periods2 <- CJ.dt(averaging_periods, unique(exposure_dataset2[,list(location_id)]))
Now, to take averages of x
values over intervals in
y
within groups, all we have to do is use the same call as
in the first example to intervalaverage
while specifying
one more argument: group_vars="location_id"
.
intervalaverage(x=exposure_dataset2,
y=averaging_periods2,
interval_vars=c("start","end"),
value_vars=c("pm25","no2"),
group_vars="location_id",
required_percentage = 75)[, list(location_id, start,end, pm25,no2)]
#> Key: <location_id, start, end>
#> location_id start end pm25 no2
#> <int> <IDat> <IDat> <num> <num>
#> 1: 1 2000-01-01 2000-01-10 15.31143 23.90072
#> 2: 1 2000-01-11 2000-01-20 15.78491 24.78908
#> 3: 1 2000-01-21 2000-01-30 15.53009 25.10130
#> 4: 2 2000-01-01 2000-01-10 15.27182 25.87793
#> 5: 2 2000-01-11 2000-01-20 15.87027 25.35759
#> 6: 2 2000-01-21 2000-01-30 15.62232 23.26864
#> 7: 3 2000-01-01 2000-01-10 15.41704 24.79068
#> 8: 3 2000-01-11 2000-01-20 14.88407 25.98238
#> 9: 3 2000-01-21 2000-01-30 13.69362 25.07990
The group_vars
argument tells the
intervalaverage
function to calculate averages separately
within each group.
Of course, we could have completed the above by calling
intervalaverage
repeatedly for each value of
location_id
in x
and y
using a
for loop. The reason to prefer using the group_vars
approach is that the intervalaverage
function is written to
be faster than looping when with dealing with grouping. It also saves
you the trouble of writing a loop and combining the results.
Additionally, group_vars
accepts a vector of character
column names, meaning that you can calculate averages within
combinations of groups without writing nested for loops.
Note that all the intervals used in this package are treated as inclusive. So far, we’ve dealt with data which have intervals which do not overlap. However, consider the following dataset where the end day of a previous interval is the start day of the next interval:
exposure_dataset_overlapping <- rbindlist(lapply(1:3, function(z){
data.table(location_id=z,
start=seq(as.IDate("2000-01-01"),by=7,length=4),
end=seq(as.IDate("2000-01-08"),by=7,length=4),
pm25=rnorm(4,mean=15),
no2=rnorm(4,mean=25) )
} ))
exposure_dataset_overlapping
#> location_id start end pm25 no2
#> <int> <IDat> <IDat> <num> <num>
#> 1: 1 2000-01-01 2000-01-08 15.38767 24.60571
#> 2: 1 2000-01-08 2000-01-15 14.94619 24.94069
#> 3: 1 2000-01-15 2000-01-22 13.62294 26.10003
#> 4: 1 2000-01-22 2000-01-29 14.58501 25.76318
#> 5: 2 2000-01-01 2000-01-08 14.83548 24.31124
#> 6: 2 2000-01-08 2000-01-15 14.74664 24.29250
#> 7: 2 2000-01-15 2000-01-22 15.69696 25.36458
#> 8: 2 2000-01-22 2000-01-29 15.55666 25.76853
#> 9: 3 2000-01-01 2000-01-08 14.88765 25.34112
#> 10: 3 2000-01-08 2000-01-15 15.88111 23.87064
#> 11: 3 2000-01-15 2000-01-22 15.39811 26.43302
#> 12: 3 2000-01-22 2000-01-29 14.38797 26.98040
If we try to average this exposure dataset, we get an error:
intervalaverage(exposure_dataset_overlapping,averaging_periods2,
interval_vars=c("start","end"),
value_vars=c("pm25","no2"),
group_vars="location_id",
required_percentage = 75)
#> Error in intervalaverage(exposure_dataset_overlapping, averaging_periods2, : nrow(data.table::foverlaps(x, x)) == nrow(x) is not TRUE
That’s because the intervalaverage
function is written
to throw an error if there are overlaps in within groups. This is to
encourage the user to explicitly and consciously deal with overlaps
prior to averaging.
Note that we can also check whether there are overlapping intervals
(within specified groups) using is.overlapping
is.overlapping(exposure_dataset_overlapping, interval_vars=c('start','end'),group_vars="location_id")
#> [1] TRUE
In order to deal with partially overlapping intervals, we need to
split intervals into areas of exact overlap and non-overlap with the
isolateoverlaps
function:
exposure_dataset_isolated <- isolateoverlaps(exposure_dataset_overlapping,
interval_vars=c("start","end"),
group_vars="location_id",
interval_vars_out=c("start2","end2"))
exposure_dataset_isolated[1:15] #only show the first 15 rows
#> location_id start2 end2 start end pm25 no2
#> <int> <IDat> <IDat> <IDat> <IDat> <num> <num>
#> 1: 1 2000-01-01 2000-01-07 2000-01-01 2000-01-08 15.38767 24.60571
#> 2: 1 2000-01-08 2000-01-08 2000-01-01 2000-01-08 15.38767 24.60571
#> 3: 1 2000-01-08 2000-01-08 2000-01-08 2000-01-15 14.94619 24.94069
#> 4: 1 2000-01-09 2000-01-14 2000-01-08 2000-01-15 14.94619 24.94069
#> 5: 1 2000-01-15 2000-01-15 2000-01-08 2000-01-15 14.94619 24.94069
#> 6: 1 2000-01-15 2000-01-15 2000-01-15 2000-01-22 13.62294 26.10003
#> 7: 1 2000-01-16 2000-01-21 2000-01-15 2000-01-22 13.62294 26.10003
#> 8: 1 2000-01-22 2000-01-22 2000-01-15 2000-01-22 13.62294 26.10003
#> 9: 1 2000-01-22 2000-01-22 2000-01-22 2000-01-29 14.58501 25.76318
#> 10: 1 2000-01-23 2000-01-29 2000-01-22 2000-01-29 14.58501 25.76318
#> 11: 2 2000-01-01 2000-01-07 2000-01-01 2000-01-08 14.83548 24.31124
#> 12: 2 2000-01-08 2000-01-08 2000-01-01 2000-01-08 14.83548 24.31124
#> 13: 2 2000-01-08 2000-01-08 2000-01-08 2000-01-15 14.74664 24.29250
#> 14: 2 2000-01-09 2000-01-14 2000-01-08 2000-01-15 14.74664 24.29250
#> 15: 2 2000-01-15 2000-01-15 2000-01-08 2000-01-15 14.74664 24.29250
Inspect the above table and compare it to
exposure_dataset
. start2
and end2
are the new intervals and start
and end
are
the original intervals. Note how there are two rows for every
overlapping period (ie in the start2
and end2
columns), but the pm25 and no2 values differ within these rows since one
value comes from the first overlapping period and the second value comes
from the second overlapping period.
We can then average exposure values within periods of exact overlap:
exposure_dataset_overlaps_averaged <-
exposure_dataset_isolated[, list(pm25=mean(pm25),no2=mean(no2)),by=c("location_id","start2","end2")]
setnames(exposure_dataset_overlaps_averaged, c("start2","end2"),c("start","end"))
exposure_dataset_overlaps_averaged
#> location_id start end pm25 no2
#> <int> <IDat> <IDat> <num> <num>
#> 1: 1 2000-01-01 2000-01-07 15.38767 24.60571
#> 2: 1 2000-01-08 2000-01-08 15.16693 24.77320
#> 3: 1 2000-01-09 2000-01-14 14.94619 24.94069
#> 4: 1 2000-01-15 2000-01-15 14.28457 25.52036
#> 5: 1 2000-01-16 2000-01-21 13.62294 26.10003
#> 6: 1 2000-01-22 2000-01-22 14.10397 25.93160
#> 7: 1 2000-01-23 2000-01-29 14.58501 25.76318
#> 8: 2 2000-01-01 2000-01-07 14.83548 24.31124
#> 9: 2 2000-01-08 2000-01-08 14.79106 24.30187
#> 10: 2 2000-01-09 2000-01-14 14.74664 24.29250
#> 11: 2 2000-01-15 2000-01-15 15.22180 24.82854
#> 12: 2 2000-01-16 2000-01-21 15.69696 25.36458
#> 13: 2 2000-01-22 2000-01-22 15.62681 25.56656
#> 14: 2 2000-01-23 2000-01-29 15.55666 25.76853
#> 15: 3 2000-01-01 2000-01-07 14.88765 25.34112
#> 16: 3 2000-01-08 2000-01-08 15.38438 24.60588
#> 17: 3 2000-01-09 2000-01-14 15.88111 23.87064
#> 18: 3 2000-01-15 2000-01-15 15.63961 25.15183
#> 19: 3 2000-01-16 2000-01-21 15.39811 26.43302
#> 20: 3 2000-01-22 2000-01-22 14.89304 26.70671
#> 21: 3 2000-01-23 2000-01-29 14.38797 26.98040
#> location_id start end pm25 no2
This version of the dataset where values in overlapping periods have
already been averaged can now be averaged to the times in
averaging_periods2
using the intervalaverage
function:
intervalaverage(exposure_dataset_overlaps_averaged,
averaging_periods2,
interval_vars=c("start","end"),
value_vars=c("pm25","no2"),
group_vars="location_id",
required_percentage = 75)[,list(location_id, start,end,pm25,no2)]
#> Key: <location_id, start, end>
#> location_id start end pm25 no2
#> <int> <IDat> <IDat> <num> <num>
#> 1: 1 2000-01-01 2000-01-10 15.27730 24.68945
#> 2: 1 2000-01-11 2000-01-20 14.21840 25.57832
#> 3: 1 2000-01-21 2000-01-30 14.42466 25.81932
#> 4: 2 2000-01-01 2000-01-10 14.81327 24.30656
#> 5: 2 2000-01-11 2000-01-20 15.26932 24.88215
#> 6: 2 2000-01-21 2000-01-30 15.58005 25.70121
#> 7: 3 2000-01-01 2000-01-10 15.13602 24.97350
#> 8: 3 2000-01-11 2000-01-20 15.61546 25.27995
#> 9: 3 2000-01-21 2000-01-30 14.55633 26.88917
While overlapping periods in x
are not allowed, there’s
nothing wrong with averaging to multiple partially overlapping periods
in y at the same time:
overlapping_averaging_periods <- data.table(start=as.IDate(c("2000-01-01","2000-01-01")),
end=as.IDate(c("2000-01-10","2000-01-20"))
)
overlapping_averaging_periods
#> start end
#> <IDat> <IDat>
#> 1: 2000-01-01 2000-01-10
#> 2: 2000-01-01 2000-01-20
overlapping_averaging_periods_expanded <-
CJ.dt(overlapping_averaging_periods,unique(exposure_dataset2[,list(location_id)]))
overlapping_averaging_periods_expanded
#> start end location_id
#> <IDat> <IDat> <int>
#> 1: 2000-01-01 2000-01-10 1
#> 2: 2000-01-01 2000-01-20 1
#> 3: 2000-01-01 2000-01-10 2
#> 4: 2000-01-01 2000-01-20 2
#> 5: 2000-01-01 2000-01-10 3
#> 6: 2000-01-01 2000-01-20 3
intervalaverage(exposure_dataset_overlaps_averaged,
overlapping_averaging_periods_expanded,
interval_vars=c("start","end"),
value_vars=c("pm25","no2"),
group_vars="location_id",
required_percentage = 75)[,list(location_id, start,end,pm25,no2)]
#> Key: <location_id, start, end>
#> location_id start end pm25 no2
#> <int> <IDat> <IDat> <num> <num>
#> 1: 1 2000-01-01 2000-01-10 15.27730 24.68945
#> 2: 1 2000-01-01 2000-01-20 14.74785 25.13389
#> 3: 2 2000-01-01 2000-01-10 14.81327 24.30656
#> 4: 2 2000-01-01 2000-01-20 15.04129 24.59435
#> 5: 3 2000-01-01 2000-01-10 15.13602 24.97350
#> 6: 3 2000-01-01 2000-01-20 15.37574 25.12672
Note that if you specify identical intervals (within groups defined
by group_vars
), duplicate intervals in y
will
be dropped with a warning resulting in a return data.table with fewer
rows than in y
.
Often we’re interested in calculating averages over different periods for different locations. First to make this more realistic, let’s generate ~20 years of data at 2000 locations:
n_locs <- 2000
n_weeks <- 1000
exposure_dataset3 <- rbindlist(
lapply(1:n_locs, function(id) {
data.table(
location_id = id,
start = seq(as.IDate("2000-01-01"), by = 7, length = n_weeks),
end = seq(as.IDate("2000-01-07"), by = 7, length = n_weeks),
pm25 = rnorm(n_weeks, mean = 15),
no2 = rnorm(n_weeks, mean = 25)
)
})
)
exposure_dataset3
#> location_id start end pm25 no2
#> <int> <IDat> <IDat> <num> <num>
#> 1: 1 2000-01-01 2000-01-07 14.63278 26.21289
#> 2: 1 2000-01-08 2000-01-14 13.95587 24.37274
#> 3: 1 2000-01-15 2000-01-21 15.56972 26.71116
#> 4: 1 2000-01-22 2000-01-28 14.86495 24.60563
#> 5: 1 2000-01-29 2000-02-04 17.40162 22.67851
#> ---
#> 1999996: 2000 2019-01-26 2019-02-01 15.00646 25.27885
#> 1999997: 2000 2019-02-02 2019-02-08 16.41525 26.14573
#> 1999998: 2000 2019-02-09 2019-02-15 15.29644 24.84448
#> 1999999: 2000 2019-02-16 2019-02-22 15.05604 24.01209
#> 2000000: 2000 2019-02-23 2019-03-01 17.37829 26.74711
Now let’s pick a different random start and end date for each location’s averaging period. We’ll pick start dates at random and define the end the date as 3 years after each start date, thus creating different three-year intervals for every location.
averaging_periods3 <- data.table(location_id=1:n_locs,
start=sample(
x=seq(as.IDate("2000-01-01"),as.IDate("2019-12-31"),by=1),
size=n_locs
)
)
averaging_periods3[,end:=start+round(3*365.25)]
averaging_periods3
#> location_id start end
#> <int> <IDat> <IDat>
#> 1: 1 2006-12-11 2009-12-11
#> 2: 2 2018-02-17 2021-02-17
#> 3: 3 2019-12-23 2022-12-23
#> 4: 4 2019-01-08 2022-01-08
#> 5: 5 2017-03-16 2020-03-16
#> ---
#> 1996: 1996 2012-09-19 2015-09-20
#> 1997: 1997 2003-04-09 2006-04-09
#> 1998: 1998 2001-03-31 2004-03-31
#> 1999: 1999 2016-12-11 2019-12-12
#> 2000: 2000 2012-06-11 2015-06-12
Because we’ve already generated y
(averaging_period3
) to contain the desired averaging
interval for each value of the grouping variable
location_id
, it’s ready to be used as an argument to
intervalaverag
:
intervalaverage(exposure_dataset3,
averaging_periods3,
interval_vars=c("start","end"),
value_vars=c("pm25","no2"),
group_vars="location_id")[,list(location_id,start,end,pm25,no2)]
#> Key: <location_id, start, end>
#> location_id start end pm25 no2
#> <int> <IDat> <IDat> <num> <num>
#> 1: 1 2006-12-11 2009-12-11 14.92398 25.08483
#> 2: 2 2018-02-17 2021-02-17 NA NA
#> 3: 3 2019-12-23 2022-12-23 NA NA
#> 4: 4 2019-01-08 2022-01-08 NA NA
#> 5: 5 2017-03-16 2020-03-16 NA NA
#> ---
#> 1996: 1996 2012-09-19 2015-09-20 15.02879 25.05804
#> 1997: 1997 2003-04-09 2006-04-09 14.88153 24.99050
#> 1998: 1998 2001-03-31 2004-03-31 15.05534 24.90188
#> 1999: 1999 2016-12-11 2019-12-12 NA NA
#> 2000: 2000 2012-06-11 2015-06-12 15.09675 25.04789
You shouldn’t be surprised to see some missingness since the earliest
possible latest possible averaging period start date is Dec 31, 2019 but
the exposures stop in early 2019. The required_percentage
argument could be set here to something less than the default of 100 to
compute partial averages and get fewer missing values.
Finally, a quick trick if you’d like to calculate 1-year, 2-year, and 3-year averages all at once, starting with a fixed set of end dates:
averaging_periods3[, avg3yr:=end-round(3*365.25)]
averaging_periods3[, avg2yr:=end-round(2*365.25)]
averaging_periods3[, avg1yr:=end-round(1*365.25)]
#reshape the data.table:
averaging_periods4 <- melt(averaging_periods3,id.vars=c("location_id","end"),
measure.vars = c("avg3yr","avg2yr","avg1yr"))
setnames(averaging_periods4, "value","start")
setnames(averaging_periods4, "variable","averaging_period")
averaging_periods4
#> location_id end averaging_period start
#> <int> <IDat> <fctr> <IDat>
#> 1: 1 2009-12-11 avg3yr 2006-12-11
#> 2: 2 2021-02-17 avg3yr 2018-02-17
#> 3: 3 2022-12-23 avg3yr 2019-12-23
#> 4: 4 2022-01-08 avg3yr 2019-01-08
#> 5: 5 2020-03-16 avg3yr 2017-03-16
#> ---
#> 5996: 1996 2015-09-20 avg1yr 2014-09-20
#> 5997: 1997 2006-04-09 avg1yr 2005-04-09
#> 5998: 1998 2004-03-31 avg1yr 2003-04-01
#> 5999: 1999 2019-12-12 avg1yr 2018-12-12
#> 6000: 2000 2015-06-12 avg1yr 2014-06-12
intervalaverage(exposure_dataset3,averaging_periods4,interval_vars=c("start","end"),
value_vars=c("pm25","no2"),
group_vars=c("location_id"),
required_percentage = 75)[,list(location_id,start,end,pm25,no2)]
#> Key: <location_id, start, end>
#> location_id start end pm25 no2
#> <int> <IDat> <IDat> <num> <num>
#> 1: 1 2006-12-11 2009-12-11 14.92398 25.08483
#> 2: 1 2007-12-12 2009-12-11 14.96129 25.01816
#> 3: 1 2008-12-11 2009-12-11 14.99041 25.02960
#> 4: 2 2018-02-17 2021-02-17 NA NA
#> 5: 2 2019-02-18 2021-02-17 NA NA
#> ---
#> 5996: 1999 2017-12-12 2019-12-12 NA NA
#> 5997: 1999 2018-12-12 2019-12-12 NA NA
#> 5998: 2000 2012-06-11 2015-06-12 15.09675 25.04789
#> 5999: 2000 2013-06-12 2015-06-12 15.07803 25.03959
#> 6000: 2000 2014-06-12 2015-06-12 14.95297 25.14248
So far we’ve fully covered the functionality of the
intervalaverage
function and how to use it when we want to
average over time at specific locations.
The above examples also cover the approach we’d use if we wanted to average over a cohort of study participants for whom we only have a single address (and we are ok assuming that participants never move).
However, in cohort studies, each participants shares their past locations/addresses and indicated the time periods over which they lived at each of those addresses. Typically this information is represented through a table we refer to as an “address history.”
We’ll start with a very simple example to demonstrate what an address history looks like and how we might use this in exposure averaging. Consider the following address history and exposure datasets:
exposure_dataset5
#> addr_id exp_start exp_end exp_value
#> <int> <int> <int> <num>
#> 1: 1 1 7 1.3345175
#> 2: 1 8 14 1.0233578
#> 3: 1 15 21 1.0334900
#> 4: 2 1 7 -1.3967607
#> 5: 2 8 14 1.2922388
#> 6: 2 15 21 -1.3653823
#> 7: 3 1 7 -0.9738918
#> 8: 3 8 14 0.9785869
#> 9: 3 15 21 -0.2132947
#> 10: 4 1 7 0.2635925
#> 11: 4 8 14 0.2779942
#> 12: 4 15 21 -0.7184920
Note that exposure_dataset5
has two regular (length-7)
intervals for each address and corresponding measurements for those
periods.
Here is a sample address history table:
address_history0
#> addr_id ppt_id addr_start addr_end
#> <num> <num> <int> <int>
#> 1: 1 1 1 9
#> 2: 2 1 10 11
#> 3: 2 1 12 14
#> 4: 3 2 1 12
#> 5: 5 2 13 15
The first thing to note is that the address intervals
(addr_start
and addr_end
) are non-overlapping,
which is good because intervalaverage
requires
non-overlapping intervals as we saw previously. (If the addresses were
overlapping we might consider using the isolateoverlaps
function on it to identify overlapping periods in the address history
and make decisions about which address to use in each overlapping
period).
The address_history0
table has one participant with
three rows and two addresses (addr_id
s 1 and 2). In
practice this would be the same data if the two intervals where
ppt_id==1 & addr_id==2
were stored as a single row
corresponding to the interval [10,14]
, but the dataset has
been created like this to demonstrate that a single address represented
over non-overlapping intervals doesn’t cause problems. The second
participant also has two addresses (addr_ids
s 3 and 5).
The goal here is to get exposures merged and clipped to the address
intervals, but the problem is that the address intervals don’t line up
nicely with the exposure intervals. Participant 1 lived add address 1
from [1,9]
but exposure is measured over [1,7]
and [8,14]
. The solution is to create two rows for that
participant, one row for [1,7]
and a second row from
[8,9]
. This can be accomplished using the
intervalintersect
function:
exposure_addresss_table <- intervalintersect(exposure_dataset5,
address_history0,
interval_vars=c(exp_start="addr_start",
exp_end="addr_end"),
"addr_id")
exposure_addresss_table
#> Key: <addr_id, start, end>
#> addr_id start end exp_value ppt_id
#> <int> <int> <int> <num> <num>
#> 1: 1 1 7 1.3345175 1
#> 2: 1 8 9 1.0233578 1
#> 3: 2 10 11 1.2922388 1
#> 4: 2 12 14 1.2922388 1
#> 5: 3 1 7 -0.9738918 2
#> 6: 3 8 12 0.9785869 2
intervalintersect
takes every possible combination of
overlapping intervals within group_vars
(in this sense it
is a cartesian join. More on this below).
intervalintersect
is also an inner join because rows
from either table that are not joined are not included in the output.
For example, rows where addr_id==4
in
exposure_dataset5
is not included since there are no rows
in address_history0
where addr_id==4
.
Additionally, none of the exposure periods from
exposure_dataset5
measured over
exposure_start==15
to exposure_start==21
are
included in the result, because none of the address history intervals
overlap with those periods. Finally, there is an address
(addr_id==5
from ppt_id==2
) in the
addr_history
table that isn’t in the
exposure_dataset5
table. This address is also excluded from
the result since exposure estimates do not exist for that
participant.
It’s worth doing some checks after completion of the intersection to identify what information has been dropped by the inner join:
setdiff(address_history0$addr_id,exposure_addresss_table$addr_id)
#> [1] 5
setdiff(exposure_dataset5$addr_id,exposure_addresss_table$addr_id)
#> [1] 4
Finally, note that the syntax of interval_vars
allows
those columns to be named differently in x
and
y
via a named vector:
interval_vars=c(exposure_start="addr_start",exposure_end="addr_end")
.
This is useful for keeping track of interval names since the return
data.table has three sets of intervals: those from x
, those
from y
, and their intersections.
starting with the unique set of locations extracted from
exposure_dataset3
, let’s generate 300 participants and a
random number of addresses each participant lived at
n_ppt <- 300
addr_history <- data.table(ppt_id=paste0("ppt",sprintf("%03d", 1:n_ppt)))
addr_history[, n_addr := rbinom(.N,size=length(unique(exposure_dataset3$location_id)),prob=.001)]
addr_history[n_addr <1L, n_addr := 1L]
#repl=TRUE because it's possible for an address to be lived at multiple different time intervals:
addr_history <- addr_history[,
list(location_id=sample(exposure_dataset3$location_id,n_addr,replace=TRUE)),
by="ppt_id"
]
addr_history
#> ppt_id location_id
#> <char> <int>
#> 1: ppt001 1793
#> 2: ppt001 337
#> 3: ppt002 1492
#> 4: ppt002 1943
#> 5: ppt002 1853
#> ---
#> 616: ppt299 1047
#> 617: ppt299 878
#> 618: ppt300 1228
#> 619: ppt300 1194
#> 620: ppt300 241
#note that not all of these 2000 locations in exposure_dataset3 were "lived at" in this cohort:
length(unique(addr_history$location_id))
#> [1] 535
#also note that it's possible for different participants to live at the same address.
addr_history[,list(loc_with_more_than_one_ppt=length(unique(ppt_id))>1),
by=location_id][,
sum(loc_with_more_than_one_ppt)]
#> [1] 77
#Because of the way I generated this data, it's way more common than you'd expect in a real cohort
#but it does happen especially in cohorts with familial recruitment or people living in nursing home complexes.
I’ve generated intervals which are non-overlapping representing the participant address history (the code to achieve this is hidden because it’s complicated and not the point of this vignette):
addr_history
#> Key: <ppt_id, addr_start>
#> ppt_id location_id addr_start addr_end addr_id
#> <char> <int> <IDat> <IDat> <char>
#> 1: ppt001 1793 1961-04-01 1986-07-03 ppt001_2
#> 2: ppt001 337 1989-06-14 9999-01-01 ppt001_1
#> 3: ppt002 1492 1970-07-01 1978-04-12 ppt002_2
#> 4: ppt002 1943 1980-07-02 1981-12-07 ppt002_4
#> 5: ppt002 1853 1984-07-10 1990-08-19 ppt002_3
#> ---
#> 616: ppt299 1047 1974-06-28 1983-01-04 ppt299_2
#> 617: ppt299 878 1988-08-05 9999-01-01 ppt299_1
#> 618: ppt300 1228 1972-09-01 1975-04-17 ppt300_3
#> 619: ppt300 1194 1975-08-27 1989-11-11 ppt300_2
#> 620: ppt300 241 2004-06-20 9999-01-01 ppt300_1
Oftentimes participant addresses are given their own keys that are
distinct from location_id and that’s represented in the above table.
This means that a single location_id
may map to multiple
addr_ids
.
It’s important for these address tables to be non-overlapping within
ppt. As shown previously, there’s a function in the
intervalaverage
package for that check:
is.overlapping(addr_history,interval_vars=c("addr_start","addr_end"),
group_vars="ppt_id") #FALSE is good--it means there's no overlap of dates within ppt.
#> [1] FALSE
This table passes that check because I’ve generated the data to be
non-overlapping, but often times people report overlapping address
histories and analytic decisions need to be made to de-overlap them
(again, the isolateoverlap
function would be useful for
isolating sections of overlapping address intervals).
So far we’ve seen exposure datasets stored by
location_id
but it’s possible also to store exposures
stored by addr_id
(such that the series of exposure
estimates for a single locations may be repeated multiple time if that
location_id
maps to multiple addr_id
s )
exposure_dataset3_addr <- unique(addr_history[, list(location_id,addr_id)])[exposure_dataset3,
on=c("location_id"),
allow.cartesian=TRUE,
nomatch=NULL]
exposure_dataset3
#> location_id start end pm25 no2
#> <int> <IDat> <IDat> <num> <num>
#> 1: 1 2000-01-01 2000-01-07 14.63278 26.21289
#> 2: 1 2000-01-08 2000-01-14 13.95587 24.37274
#> 3: 1 2000-01-15 2000-01-21 15.56972 26.71116
#> 4: 1 2000-01-22 2000-01-28 14.86495 24.60563
#> 5: 1 2000-01-29 2000-02-04 17.40162 22.67851
#> ---
#> 1999996: 2000 2019-01-26 2019-02-01 15.00646 25.27885
#> 1999997: 2000 2019-02-02 2019-02-08 16.41525 26.14573
#> 1999998: 2000 2019-02-09 2019-02-15 15.29644 24.84448
#> 1999999: 2000 2019-02-16 2019-02-22 15.05604 24.01209
#> 2000000: 2000 2019-02-23 2019-03-01 17.37829 26.74711
exposure_dataset3_addr
#> location_id addr_id start end pm25 no2
#> <int> <char> <IDat> <IDat> <num> <num>
#> 1: 1 ppt105_1 2000-01-01 2000-01-07 14.63278 26.21289
#> 2: 1 ppt105_1 2000-01-08 2000-01-14 13.95587 24.37274
#> 3: 1 ppt105_1 2000-01-15 2000-01-21 15.56972 26.71116
#> 4: 1 ppt105_1 2000-01-22 2000-01-28 14.86495 24.60563
#> 5: 1 ppt105_1 2000-01-29 2000-02-04 17.40162 22.67851
#> ---
#> 619996: 1997 ppt150_1 2019-02-09 2019-02-15 14.45723 25.47479
#> 619997: 1997 ppt054_2 2019-02-16 2019-02-22 14.05648 26.26435
#> 619998: 1997 ppt150_1 2019-02-16 2019-02-22 14.05648 26.26435
#> 619999: 1997 ppt054_2 2019-02-23 2019-03-01 15.54294 22.57325
#> 620000: 1997 ppt150_1 2019-02-23 2019-03-01 15.54294 22.57325
Note that exposure_dataset3_addr
contains repeat
locations whereas exposure_dataset3
contains exactly one
location per time point:
exposure_dataset3[, sum(duplicated(location_id)),by=c("start")][,max(V1)] #no duplicate locations at any date
#> [1] 0
exposure_dataset3_addr[, sum(duplicated(location_id)),by=c("start")][,max(V1)]
#> [1] 85
exposure_dataset3_addr
has duplicate locations since
multiple ppts may live at the same location or because a single
participant lives at the same location multiple times.
(Storing exposure data according to addr_id
rather than
location_id
takes up more space but may be beneficial for
constraining exposure model revisions to be the same within cohorts)
location_id
may not even be present in the address table
if it’s stored by address_id:
Even if this distinction between how exposures are stored doesn’t
seem relevant, this section will demonstrate how
intervalintersect
is actually a cartesian join (in addition
to being an inner interval join).
Whether the exposure dataset is stored by address or location, the
intervalintersect
will result in values from the exposure
dataset merged to every address. In the case of the exposure table being
stored by addr_id
, this is a simple one to one merge (since
for every address in the address history there’s a set of exposures in
the exposure table).
But in the case of the exposure table being stored by
location_id
, this becomes a one to many merge (since a
single location id may merge to multiple locations in the address
history table).
This works because the function that intervalintersect
relies on (data.table::foverlaps
) is performing an inner
cartesian merge: that is–it only takes rows which match on the keying
variables but also does a cartesian expansion if there are multiple
matches in both tables.
z <- intervalintersect(x=exposure_dataset3,
y=addr_history,
interval_vars=c(
start="addr_start",
end="addr_end"
),
group_vars=c("location_id"),
interval_vars_out=c("start2","end2")
)
z_addr <- intervalintersect(x=exposure_dataset3_addr,
y=addr_history,
interval_vars=c(
start="addr_start",
end="addr_end"
),
group_vars=c("addr_id"),
interval_vars_out=c("start2","end2")
)
setkey(z,ppt_id,start2,end2)
setkey(z_addr,ppt_id,start2,end2)
all.equal(z,z_addr)
#> [1] "Different column order"
z
#> Key: <ppt_id, start2, end2>
#> location_id start2 end2 pm25 no2 ppt_id addr_id
#> <int> <IDat> <IDat> <num> <num> <char> <char>
#> 1: 337 2000-01-01 2000-01-07 15.15344 26.26791 ppt001 ppt001_1
#> 2: 337 2000-01-08 2000-01-14 13.68185 23.93172 ppt001 ppt001_1
#> 3: 337 2000-01-15 2000-01-21 13.06989 24.23879 ppt001 ppt001_1
#> 4: 337 2000-01-22 2000-01-28 15.90837 26.32763 ppt001 ppt001_1
#> 5: 337 2000-01-29 2000-02-04 15.50739 24.37651 ppt001 ppt001_1
#> ---
#> 264965: 241 2019-01-26 2019-02-01 14.57620 25.32593 ppt300 ppt300_1
#> 264966: 241 2019-02-02 2019-02-08 14.78497 26.70319 ppt300 ppt300_1
#> 264967: 241 2019-02-09 2019-02-15 14.55443 24.04428 ppt300 ppt300_1
#> 264968: 241 2019-02-16 2019-02-22 14.07411 26.19916 ppt300 ppt300_1
#> 264969: 241 2019-02-23 2019-03-01 13.29841 25.35809 ppt300 ppt300_1
(note that this example of a one to many join maybe isn’t a true “cartesian” join, but intervalintersect is capable of doing a true many-to-many cartesian expansion if provided the right datasets, although I’m not sure in context that would actually make any sense.)
In any case, the morale here is that whether the exposures are stored
by location or address, using intervalintersect
in
combination will result in a dataset containing relevant exposures
clipped to each address period.
This dataset can then be used to calculate averages over where a participant lived in a given period:
final_averaging_periods <- data.table(ppt_id=sort(unique(addr_history$ppt_id)))
final_averaging_periods[, end2:=sample(seq(as.IDate("2003-01-01"),as.IDate("2015-01-01"),by=1),.N)]
final_averaging_periods[,start2:=as.IDate(floor(as.numeric(end2-3*365.25)))]
final_averaging_periods
#> ppt_id end2 start2
#> <char> <IDat> <IDat>
#> 1: ppt001 2011-12-22 2008-12-21
#> 2: ppt002 2004-12-28 2001-12-28
#> 3: ppt003 2008-02-24 2005-02-23
#> 4: ppt004 2010-09-17 2007-09-17
#> 5: ppt005 2010-03-18 2007-03-18
#> ---
#> 296: ppt296 2013-01-14 2010-01-14
#> 297: ppt297 2004-05-14 2001-05-14
#> 298: ppt298 2009-10-27 2006-10-27
#> 299: ppt299 2004-08-07 2001-08-07
#> 300: ppt300 2006-05-14 2003-05-14
intervalaverage(z,final_averaging_periods, interval_vars=c("start2","end2"),
value_vars=c("pm25","no2"),group_vars="ppt_id",required_percentage = 95
)
#> Key: <ppt_id, start2, end2>
#> ppt_id start2 end2 pm25 no2 yduration xduration
#> <char> <IDat> <IDat> <num> <num> <num> <int>
#> 1: ppt001 2008-12-21 2011-12-22 15.10230 24.95967 1097 1097
#> 2: ppt002 2001-12-28 2004-12-28 14.95920 24.96800 1097 1097
#> 3: ppt003 2005-02-23 2008-02-24 14.87310 25.03825 1097 1097
#> 4: ppt004 2007-09-17 2010-09-17 14.92221 24.96349 1097 1097
#> 5: ppt005 2007-03-18 2010-03-18 14.97623 24.93725 1097 1097
#> ---
#> 296: ppt296 2010-01-14 2013-01-14 NA NA 1097 0
#> 297: ppt297 2001-05-14 2004-05-14 15.09852 24.93369 1097 1097
#> 298: ppt298 2006-10-27 2009-10-27 NA NA 1097 552
#> 299: ppt299 2001-08-07 2004-08-07 15.11682 25.05562 1097 1097
#> 300: ppt300 2003-05-14 2006-05-14 NA NA 1097 694
#> nobs_pm25 nobs_no2 xminstart xmaxend
#> <int> <int> <IDat> <IDat>
#> 1: 1097 1097 2008-12-21 2011-12-22
#> 2: 1097 1097 2001-12-28 2004-12-28
#> 3: 1097 1097 2005-02-23 2008-02-24
#> 4: 1097 1097 2007-09-17 2010-09-17
#> 5: 1097 1097 2007-03-18 2010-03-18
#> ---
#> 296: 0 0 <NA> <NA>
#> 297: 1097 1097 2001-05-14 2004-05-14
#> 298: 552 552 2008-04-24 2009-10-27
#> 299: 1097 1097 2001-08-07 2004-08-07
#> 300: 694 694 2004-06-20 2006-05-14
There’s lots of missingness because the address history I generated is nowhere near complete, but this demonstrates how important it is to have a good address history!