# Nested Dirichlet Process for Inhomogenous Poisson Processes

This vignette serves to introduce the bendr R package. bendr is an R package that fits the Nested Dirichlet Process to Built Environment data, modeled as realizations of an Inhomogenous Poisson Process. We begin with substantive motivation and theoretical notation to elucidate some key model components. The data that we generate below is available in the package as school_data.

## Motivation and Notation

The motivation for this model comes from a scientific interest in classifying food environments around school or residential environments. If we pick, say, Fast Food Restaurants (FFRs), we could model the rate at which these restaurants occur around schools by modeling their distances from schools as an inhomogenous poisson process.

Denote $$r_{ij}$$ as the distance between the $$i$$th FFR and $$j$$th school, where all $$r_{ij} <R$$. This $$R$$ is something like a 1 or 2 mile radius chosen based on data available at hand and the scientific question of interest. If we model these distances as realizations from an inhomogenous poisson process, then the likelihood will decompoase as follows:

$p(\{r_{ij}\}|\gamma,f_j(\cdot)) \propto \prod_{j=1}^{J}\gamma_j^{n_j}\exp\{-\gamma_j\}\prod_{i=1}^{n_j}f_j(r_{ij})$ In the above, $$f_j(\cdot)$$ is the spatial FFR incidence density (normalized intensity function) of the $$j$$th school, $$\gamma_j$$ is the expected number of FFRs within radius $$R$$ and $$n_j$$ is the number of FFRs observed around the $$j$$th school within the $$R$$ distance boundary.

Since the $$\gamma_j$$ and the $$f_j$$ are multiplied together, they’re independent and can be modeled separately. The rndpp package offers solutions for both, but the modeling of $$\gamma_j$$ in rndpp is akin to a generalized linear model with a gamma distribution and log link function : $$\gamma_j = \exp(x_j^{T}\beta)$$. Since this kind of model is fairly common, we won’t spend much time writing about it here, beyond showing what function call you can use to estimate the coefficients, $$\beta$$ in rndpp.

In contrast, modeling the $$f_j(\cdot)$$ will be our main focus, as it offers us a lot more opportunities to investigate how the rate of FFRs around schools change as a function of distance. Let’s start loading in some libraries and simulating some data before elaborating further.

library(bendr)
library(dplyr)
library(ggplot2)
library(tidyr)

## Simulation Set-up

We’ll simulate distances from the following 3 intensity functions, each composed themselves of a mixture of simpler densities: $f_1(d) = \frac{1}{2}\text{beta}(d|1,8) + \frac{1}{2}\text{beta}(d|6,1), \\ f_2(d) = \frac{1}{5}\text{beta}(d|3,2) + \frac{2}{3}\text{beta}(d|3,1) + \frac{2}{15}\text{beta}(d|1,1)\\ f_3(d) = \frac{1}{2}\text{beta}(d|8,2) + \frac{1}{2}\text{dbeta}(d,30,50).$ We’ll plot these below, choosing a 5 mile distance boundary, but any boundary could be used in principal.

Each intensity function corresponds to a different kind of FFR exposure around schools. Schools around the first intensity function will have a lot of FFRs within the first mile or so from their school, but then not many until getting about 4 or more miles from school. Similarly, school’s that have FFRs simulated from cluster 3 will have a lot of FFrs at around 2 miles away from them, but almost none within 1 mile.

These different kinds of patterns may be of scientific interest because they could help explain why certain kids are more likely to go to FFRs and others aren’t.

Let’s assume a constant $$\gamma$$ across all the schools, simulating 50 schools from each intensity function.

We can see that the empirical estimates (when we know which cluster each school belongs to), correspond well to the real densities. Now how are we going to model these densities and cluster each school to it’s appropriate cluster?

## The Nested Dirichlet Process

The Nested Dirichlet Process (NDP) is an extension of the Dirichlet Process, that uses a Dirichlet Process as the base measure for another Dirichlet Process - hence the nested name. The mathematic formulation is abstract, but can be seen below. If you don’t have familiarity with the Dirichlet Process or Random Measures (especially random stick-breaking measures), the $$\delta(\cdot)$$ function is a point mass indicator that that only exists at the subscript value and the $$*$$ on the $$\pi_k^*$$ and other values reflects the idea that there may be multiple duplicate $$\pi_k$$’s and the $$\pi_k^*$$ are the unique values. For further reading see Chapter 23 of Bayesian Data Analysis (3rd edition), for now let’s see the math:

$G_j \sim Q = DP(\alpha,DP(\rho,G_0))\\ Q = \sum_{k=1}^{\infty} \pi^*_k\delta_{G^*_j(\cdot)}(\cdot) \\ G_j = \sum_{l=1}^{\infty} w^*_{lk}\delta_{\theta_{lk}^*}(\cdot)$

Essentially we’re going to use the Nested Dirichlet process to draw several different clusters of mixing measures to combine several simpler densities (like the Normal) that will ultimately be used to estimate the inhomogenous poisson process intensity function.

The math is below. For computational purposes we’re going to use a Normal mixing kernel and truncate the infinite mixture above with some finite number of components $$K$$ and $$L$$, for the two different indices, respectively. If we wanted to use a different mixing measure, the rndpp package also offers the use of a Beta mixing measure with either a global or cluster specific variance parameter. Much of the code we’re using here will be the same, but you can see rndpp::beta_nd_nhpp for more information.

$f_j(r) = \int \mathcal{N}(r|\mu,\sigma^2)dG_j((\mu,\sigma^2))\\ G_j \sim Q = \sum_{k=1}^{K} \pi_k^*\delta_{G^*_j(\cdot)}(\cdot)\\ G_j = \sum_{l=1}^{L} w_{lk}^*\delta_{(\mu,\sigma^2)^*}(\cdot)$

We can use a smaller number of $$L$$ and $$K$$ stick components for computational convenience and since we know the true number of clusters - in a real data analysis you would want to use more. We’ll set $$\alpha=\rho=1$$ to encourage a small number of clusters, set non-informative conjugate priors and fit the model as follows.

fit <- nd_nhpp_fixed(distances_col = "distances",
id_col = "school_id",
data = school_data,
L = 5, K = 5,
alpha = 1, rho = 1,
mu_0 = 0, kappa_0 = 1, ## Normal Mean Base Measure Hyperparameters
sigma_0 = 1, nu_0 = 1, ## Inverse Chi square prior Hyperparameters
iter_max = 5E4, ## To get good resolution, could possibly use more or less depending on convergence
warm_up = 4E4,
thin = 1,
seed = 34143)
#> Beginning Sampling
#> ----------------------------------------------------------------------
#> [Chain 1] Beginning of iteration: 1 / 50000 (0%)  [Warmup]
#> [Chain 1] Beginning of iteration: 5000 / 50000 (10%)  [Warmup]
#> [Chain 1] Beginning of iteration: 10000 / 50000 (20%)     [Warmup]
#> [Chain 1] Beginning of iteration: 15000 / 50000 (30%)     [Warmup]
#> [Chain 1] Beginning of iteration: 20000 / 50000 (40%)     [Warmup]
#> [Chain 1] Beginning of iteration: 25000 / 50000 (50%)     [Warmup]
#> [Chain 1] Beginning of iteration: 30000 / 50000 (60%)     [Warmup]
#> [Chain 1] Beginning of iteration: 35000 / 50000 (70%)     [Warmup]
#> [Chain 1] Beginning of iteration: 40000 / 50000 (80%)     [Warmup]
#> [Chain 1] Beginning of iteration: 40001 / 50000 (80%)     [Sampling]
#> [Chain 1] Beginning of iteration: 45000 / 50000 (90%)     [Sampling]
#> [Chain 1] Beginning of iteration: 50000 / 50000 (100%)    [Sampling]

The ndp model object returned by nd_nhpp contains the parameters of interest as coda::mcmc objects, so that the coda functions can be called on them easily. For example, we can look at the $$\pi_k$$ parameters that denote the probability of a school being assigned to the $$k$$th cluster.

summary(fit$pi) #> #> Iterations = 1:10000 #> Thinning interval = 1 #> Number of chains = 1 #> Sample size per chain = 10000 #> #> 1. Empirical mean and standard deviation for each variable, #> plus standard error of the mean: #> #> Mean SD Naive SE Time-series SE #> pi k: 1 0.35792 0.04725 0.0004725 0.0008328 #> pi k: 2 0.34085 0.07665 0.0007665 0.0082088 #> pi k: 3 0.17393 0.11691 0.0011691 0.0257276 #> pi k: 4 0.08646 0.11359 0.0011359 0.0224056 #> pi k: 5 0.04085 0.07303 0.0007303 0.0099824 #> #> 2. Quantiles for each variable: #> #> 2.5% 25% 50% 75% 97.5% #> pi k: 1 0.2682407 0.325072 0.35698 0.38972 0.4530 #> pi k: 2 0.0881584 0.317389 0.35155 0.38437 0.4473 #> pi k: 3 0.0017461 0.039976 0.21255 0.27000 0.3482 #> pi k: 4 0.0001307 0.003152 0.01643 0.19049 0.3357 #> pi k: 5 0.0001101 0.002627 0.01076 0.03857 0.2794 We can also call rndpp specific functions that will help us check whether our model is correctly estimating the densities. To begin with, we can look at an estimate of the global density estimate (averaging over all clusters’ densities): plot_global_density(fit,r = school_data$distances)

The above plot looks similar to the empirical estimate. This is a good sign! While the height might not be the same, we have to remember that since we’re using a normal mixing measure, we’re likely to see less accuracy in the density estimation near the boundaries. This might be a good reason to try the beta mixing measure instead, though it should be noted that since the beta mixing measure is not conjugate, its sampling is typically less efficient.

Let’s now see what our estimates of the different plots look like. As you can see from the summary of the pi’s there are nonzero means for all five probabilities, but the medians of the 4th and 5th cluster are pretty small. We only want to look at the clusters that have a meaningful $$\pi_k$$. We can use the plot_cluster_densities, which has the argument pi_threshold to allow users to only look at clusters with a certain probability of assignment. The default is 0.1, which should be sufficient in this case.

plot_cluster_densities(fit,switch = 'color')

This looks good! Both the u shaped intensity functions are well defined and we can tell that even though the monotonicaly increasing intensity function has a decay at the end of the bounary, likely because of the edge effect, the model is still able to capture the general pattern.

It is important to note that since we transformed the distances, the previous plot of the densities does not show the “true” densities, but only densities that are proportional on the domain. The true densities can be plotted using the plot_cluster_densities function with the arguments transform = TRUE as listed below.

plot_cluster_densities(fit,switch='facet',transform = TRUE)

In order to determine how well the model is able to discriminate between the differing clusters, we can plot the pairwise probability of each school being assigned to the same cluster as another school. Since we simulated fifty schools from each intensity function we would expect to see fifty by fifty blocks of high probability along the diagnoal.

plot_pairs(fit,sort = F)

Looking at the plot, we can see that the model is able to discriminate between the differing intensity functiosn fairly well, though it has a hard time discriminating between the second and third cluster. We can also sort the pairwise probability plot to try and identify groups - since in real world data we won’t know where the “true clusters” really are.

plot_pairs(fit,sort = T)

Turning our attention now to estimationg of $$\gamma_j$$, the mean number of FFRs, we can easily fit this model via the nhpp_hmc function. As said, this isn’t anything particularly novel, so we provide the syntax below for a simple intercept model, but a more complicated regression could be fit as well.

df <- data.frame(y=school_data %>% group_by(school_id) %>% count() %>% pull(n))
reg_fit <- nhpp_hmc(y~1,data=df,iter_max=2E3,warm_up = 1E3)
#> [Chain 1] Beginning of iteration: 1 / 2000 (0%)   [Warmup]
#> [Chain 1] Beginning of iteration: 200 / 2000 (10%)    [Warmup]
#> [Chain 1] Beginning of iteration: 400 / 2000 (20%)    [Warmup]
#> [Chain 1] Beginning of iteration: 600 / 2000 (30%)    [Warmup]
#> [Chain 1] Beginning of iteration: 800 / 2000 (40%)    [Warmup]
#> [Chain 1] Beginning of iteration: 1000 / 2000 (50%)   [Warmup]
#> [Chain 1] Beginning of iteration: 1001 / 2000 (50%)   [Sampling]
#> [Chain 1] Beginning of iteration: 1200 / 2000 (60%)   [Sampling]
#> [Chain 1] Beginning of iteration: 1400 / 2000 (70%)   [Sampling]
#> [Chain 1] Beginning of iteration: 1600 / 2000 (80%)   [Sampling]
#> [Chain 1] Beginning of iteration: 1800 / 2000 (90%)   [Sampling]
#> [Chain 1] Beginning of iteration: 2000 / 2000 (100%)  [Sampling]
reg_fit
#>
#>  groups:
#> ------
#> Regression Statistics
#> ------
#>             Median SD
#> (Intercept) 2.28   0.03
#> ---
#>
#> Sample avg. posterior predictive distribution of y:
#>          Median SD
#> mean_PPD 9.82   0.28
#>
#> ------

To provide a point of reference, this model can be replicated in standard libraries, e.g. stats::glm.

reg_fit2 <- glm(y~1,data=df,family=poisson(link='log'))
reg_fit2
#>
#> Call:  glm(formula = y ~ 1, family = poisson(link = "log"), data = df)
#>
#> Coefficients:
#> (Intercept)
#>       2.286
#>
#> Degrees of Freedom: 149 Total (i.e. Null);  149 Residual
#> Null Deviance:       142.8
#> Residual Deviance: 142.8     AIC: 758.4