Unconditional Bernoulli Tree-Based Scan Statistics for R
Usage
TreeMineR(
data,
tree,
p = NULL,
n_exposed = NULL,
n_unexposed = NULL,
dictionary = NULL,
delimiter = "/",
n_monte_carlo_sim = 9999,
random_seed = FALSE,
return_test_dist = FALSE,
future_control = list(strategy = "sequential")
)
Arguments
- data
The dataset used for the computation. The dataset needs to include the following columns:
id
An integer that is unique to every individual.
leaf
A string identifying the unique diagnoses or leafs for each individual.
exposed
A 0/1 indicator of the individual's exposure status.
See below for the first and last rows included in the example dataset.
- tree
A dataset with one variable
pathString
defining the tree structure that you would like to use. This dataset can, e.g., be created usingcreate_tree
.- p
The proportion of exposed individuals in the dataset. Will be calculated based on
n_exposed
, andn_unexposed
if both are supplied.- n_exposed
Number of exposed individuals (Optional).
- n_unexposed
Number of unexposed individuals (Optional).
- dictionary
A
data.frame
that includes onenode
column and atitle
column, which are used for labeling the cuts in the output ofTreeMineR
.- delimiter
A character defining the delimiter of different tree levels within your
pathString
. The default is/
.- n_monte_carlo_sim
The number of Monte-Carlo simulations to be used for calculating P-values.
- random_seed
Random seed used for the Monte-Carlo simulations.
- return_test_dist
If
true
, a data.frame of the maximum log-likelihood ratios in each Monte Carlo simulation will be returned. This distribution of the maximum log-likelihood ratios is used for estimating the P-value reported in the result table.- future_control
A list of arguments passed
future::plan
. This is useful if one would like to parallelise the Monte-Carlo simulations to decrease the computation time. The default is a sequential run of the Monte-Carlo simulations.
Value
A data.frame
with the following columns:
cut
The name of the cut G.
n1
The number of exposed events belonging to cut G.
n1
The number of inexposed events belonging to cut G.
risk1
The absolute risk of getting an event belonging to cut G among the exposed.
risk0
The absolute risk of getting an event belonging to cut G among the unexposed.
RR
The risk ratio of the absolute risk among the exposed over the absolute risk among the unexposed
llr
The log-likelihood ratio comparing the observed and expected number of exposed events belonging to cut G.
p
The P-value that cut G is a cluster of events.
If return_test_dist
is true
the function returns a list of two
data.frame.
result_table
A data.frame including the results as described above.
test_dist
A data.frame with two columns:
iteration
the number of the Monte Carlo iteration. Note that iteration is the calculation based on the original data and is, hence, not included in this data.fame.max_llr
: the highest observed log-likelihood ratio for each Monte Carlo simulation
References
Kulldorff et al. (2003) A tree-based scan statistic for database disease surveillance. Biometrics 56(2): 323-331. DOI: 10.1111/1541-0420.00039.
Examples
TreeMineR(data = diagnoses,
tree = icd_10_se,
p = 1/11,
n_monte_carlo_sim = 99,
random_seed = 1234) |>
head()
#> cut n1 n0 llr p
#> 1 12 122 669 16.18714 0.01
#> 2 11 132 782 13.65786 0.01
#> 3 V01-X59 241 1687 12.26816 0.01
#> 4 V01-V99 210 1438 11.95732 0.01
#> 5 15 133 822 11.79775 0.01
#> 6 19 306 2281 10.80614 0.01