Methodology Note

Shannon Diversity Index for Food Consumption in India

Construction, GAM-Based Modeling, and Prediction Data Generation

Dr Shamika Ravi (Member, EAC to PM) & Dr Mudit Kapoor (CECFEE, EPU, ISI-Delhi Center)
February 2026

1. Introduction

1.1 Motivation

The Shannon Diversity Index provides a single summary of dietary quality that captures both variety and adequacy. Unlike simple adequacy ratios, it distinguishes between a balanced diet and one concentrated in a few food groups. Individual food item models tell us what people eat, but not how balanced their diet is. The Shannon diversity index (Shannon, 1948)^[1] addresses this by summarizing the evenness of consumption across food groups, combined with a requirement-adequacy adjustment.

This methodology document outlines the construction, modeling, and prediction of the Shannon Diversity Index for food consumption in India, using household-level data from the Household Consumer Expenditure Survey (HCES) 2011–12 and 2023–24, benchmarked against ICMR-NIN Recommended Dietary Allowances.

1.2 Food Categories Analyzed

The analysis covers ten food categories based on the ICMR-NIN recommended balanced diet for an adult woman (55 kg, moderately active, non-lactating):

Food Category	Daily Requirement (grams)
Cereals & Millets	280
Green Leafy Vegetables	100
Other Vegetables	200
Roots & Tubers (excl. potatoes)	100
Fruits	100
Milk & Milk Products	300
Fats & Oils	25
Oilseeds & Nuts	40
Pulses & Beans	95 (combined with Flesh)
Flesh Foods	95 (combined with Pulses)

Note: Pulses & Beans and Flesh Foods share a combined requirement of 95 grams per day. In the gram-based construction, we allocate this flexibly based on household consumption patterns.

2. Gram-Based Shannon Diversity Index

2.1 Construction (Step by Step)

The gram-based Shannon Diversity Index is constructed through the following sequential steps:

Per-AFE Intake: Divide household-level food quantities by Adult Female Equivalent (AFE) household size to obtain grams per day per AFE for each of the 9 food groups.
Cap at Requirement: Set \(a_i = \min(\text{intake}_i, \text{requirement}_i)\) for each group \(i\). Eating more than required does not inflate the diversity score.
Compute Shares: Calculate \(p_i = \frac{a_i}{\sum_j a_j}\) — the proportion of capped intake from each food group.
Shannon Entropy: Compute \(H = -\sum_i p_i \log(p_i)\), where higher values indicate more even distribution across groups.
Adequacy Adjustment: For each food group, calculate \(\min(\text{intake}_i / \text{requirement}_i, 1)\), then average across all groups to obtain \(A\). This penalizes diets that are diverse but nutritionally insufficient.
Cereal Adjustment: Indian diets are typically dominated by cereals. Even after capping at the requirement (Step 2), cereals still command a large share because their requirement (280 g/day) is much larger than most other food groups (e.g., Fats & Oils at 25 g, Oilseeds & Nuts at 40 g). This means a household could score reasonably well on Shannon entropy simply because it eats a lot of rice or wheat — even if it barely consumes any vegetables, fruits, or protein sources. The cereal adjustment corrects for this by penalizing households whose cereal share exceeds what would be expected under a balanced, requirement-proportional diet.
First, compute the expected cereal share under the ICMR-NIN requirements: \(s^{*}_{\mathrm{cereal}} = \dfrac{r_{\mathrm{cereal}}}{\sum_j r_j} = \dfrac{280}{1240} \approx 0.226\). This is the share cereals would have if every food group were consumed at exactly its required level.

Then, compute how much the household's actual cereal share (from Step 3) exceeds this benchmark: \(\delta_{\mathrm{cereal}} = \max\!\left(p_{\mathrm{cereal}} - s^{*}_{\mathrm{cereal}},\; 0\right)\). If the household's cereal share is at or below the expected share, \(\delta_{\mathrm{cereal}} = 0\) and no penalty is applied.

Finally, convert the excess share into an exponentially decaying penalty: \(C = \exp\!\left(-3 \times \delta_{\mathrm{cereal}}\right)\). The multiplier of 3 controls the severity: a cereal share 10 percentage points above the benchmark yields \(C = e^{-0.3} \approx 0.74\) (a 26% penalty), while a share 20 points above yields \(C = e^{-0.6} \approx 0.55\) (a 45% penalty). The exponential form ensures \(C \in (0, 1]\), with \(C = 1\) when cereal consumption is proportionate and \(C \to 0\) as cereal dominance becomes extreme.

In practice, this adjustment matters most for low-income households that spend the bulk of their food budget on rice or wheat and consume negligible amounts of fruits, vegetables, dairy, and protein-rich foods. Without this correction, such households would receive misleadingly high diversity scores.
Final Index: Calculate \(H_{\text{adj}} = H \times A \times C\).

2.2 Mathematical Formulation

Let \(q_i\) denote the intake (in grams per day per AFE) of food group \(i\), and \(r_i\) denote the ICMR-NIN requirement for that group.

Shannon Entropy:

\[H = -\sum_{i=1}^{K} p_i \log(p_i) \quad \text{where} \quad p_i = \frac{\min(q_i, r_i)}{\sum_j \min(q_j, r_j)}\]

Adequacy Score:

\[A = \frac{1}{K} \sum_{i=1}^K \min\!\left(\frac{q_i}{r_i}, 1\right)\]

Cereal Penalty:

\[C = \exp\!\left(-3 \cdot \max\!\left(p_{\text{cereal}} - \frac{r_{\text{cereal}}}{\sum_j r_j}, 0\right)\right)\]

Combined Diversity Index:

\[H_{\text{adj}} = H \times A \times C\]

2.3 Interpreting the Components

The Shannon Diversity Index combines three complementary dimensions of diet quality:

Component	What It Captures	Range
\(H\) (Shannon)	Evenness of consumption across food groups	\([0, H^*]\)
\(A\) (Adequacy)	Whether each food group meets its requirement	\([0, 1]\)
\(C\) (Cereal Penalty)	Penalizes cereal over-dependence	\((0, 1]\)
\(H_{\text{adj}}\)	Overall dietary quality	\([0, H^*]\)

The theoretical maximum of the Shannon entropy for \(K = 9\) food groups is \(\ln(9) \approx 2.20\), attained when all shares are equal (\(p_i = 1/9\)). Because intakes are capped at requirements (Step 2) and requirements differ across food groups, a household meeting all requirements exactly will have shares \(p_i = r_i / \sum_j r_j\), which are not uniform. Therefore the effective maximum under this construction is \(H^* = -\sum (r_i / \sum r_j)\,\ln(r_i / \sum r_j) \approx 1.97 < \ln(9)\). A household consuming only rice has \(H \approx 0\) (no diversity); a household consuming from all groups in balanced proportions has \(H \approx H^*\).

3. Ratio-Based Shannon Diversity Index (Alternative)

3.1 Motivation

In the gram-based construction, food groups with large requirements (e.g., cereals at 280g) dominate the shares \(p_i\), making it harder for smaller-requirement groups (e.g., oilseeds at 40g) to influence diversity. The ratio-based alternative normalizes each intake by its requirement first, placing all food groups on a common 0–1 scale regardless of requirement magnitude.

3.2 Construction (Step by Step)

The ratio-based construction follows a similar sequence, with one key modification:

Per-AFE Intake: Same as gram-based.
Ratio and Cap: Set \(a_i = \min(q_i / r_i, 1)\) for each group. This is the adequacy ratio, capped at 1.
Compute Shares: Calculate \(p_i = \frac{a_i}{\sum_j a_j}\) — the proportion of capped adequacy ratios.
Shannon Entropy: Compute \(H = -\sum_i p_i \log(p_i)\).
Adequacy Adjustment: Same as gram-based: \(A = \frac{1}{K} \sum_i \min(q_i/r_i, 1)\).
Final Index: Calculate \(H_{\text{adj}}^{\text{ratio}} = H \times A\) (no cereal adjustment).

3.3 Mathematical Formulation

\[H = -\sum_{i=1}^{K} p_i \log(p_i) \quad \text{where} \quad p_i = \frac{\min(q_i/r_i, 1)}{\sum_j \min(q_j/r_j, 1)}\]

\[A = \frac{1}{K} \sum_{i=1}^K \min\!\left(\frac{q_i}{r_i}, 1\right)\]

\[H_{\text{adj}}^{\text{ratio}} = H \times A\]

3.4 Key Differences from Gram-Based

Feature	Gram-Based	Ratio-Based
Capping Rule	\(a_i = \min(q_i, r_i)\)	\(a_i = \min(q_i/r_i, 1)\)
Scale	Grams per day	Proportional (0–1)
Share Interpretation	Gram proportion of capped intake	Proportion of capped adequacy ratios
Cereal Dominance	High (large requirement)	Moderate (equal max of 1.0)
Cereal Adjustment	Yes, essential	No, unnecessary
Max Shannon	\(H^* \approx 1.97\) (requirement-weighted)	\(\ln(9) \approx 2.20\) (uniform when all groups fully met)

3.5 Why Cereal Adjustment Is Unnecessary

In the ratio-based construction, each food group contributes at most 1.0 to the share denominator, regardless of gram requirement. Cereal's share can only become large if other groups have genuinely insufficient intakes. Because the construction already equalizes the maximum contribution of each group, no additional cereal penalty is needed.

4. Why Shannon Over Adequacy Ratio?

4.1 MAR Cannot Distinguish Balance

The simple Mean Adequacy Ratio (MAR) is defined as \(\text{MAR} = \frac{1}{K} \sum_i \min(q_i/r_i, 1)\) — it captures only the first moment of the adequacy distribution. Consider three households:

Household	Cereal Adequacy	Veg Adequacy	Protein Adequacy	MAR	Shannon \(H_{\text{adj}}\)
A	1.0	0.5	0.2	0.57	0.42
B	0.8	0.6	0.5	0.63	0.58
C	0.6	0.6	0.6	0.60	0.55

Households B and C have similar MAR values, yet their diet distributions are quite different: B has one adequate group and two under-consumed; C has all three groups equally under-consumed. The Shannon index captures this distributional difference through the entropy component \(H\).

4.2 Shannon Captures Distribution

The Shannon entropy \(H\) measures the distributional shape — how evenly consumption is spread. Two diets with identical adequacy scores but different shapes (one concentrated in one group, one spread across groups) will have different Shannon values. The combined index \(H_{\text{adj}} = H \times A \times C\) thus captures both the level of intake (via \(A\)) and its distribution (via \(H\)).

4.3 Policy Relevance

Interventions to improve diet quality depend on the diagnostic:

Low \(A\), High \(H\): Diet is well-balanced but insufficient. Increase overall food intake through income/food subsidies.
Moderate \(A\), Low \(H\): Diet relies heavily on few foods. Diversification campaigns needed.
Low \(A\), Low \(H\): Both levels and diversity are inadequate. Comprehensive nutrition intervention needed.

5. Modeling the Diversity Index

5.1 Data Preparation

The analysis uses data from HCES 2011–12 and 2023–24. Key covariates include:

Covariate	Description	Role
`log_mpce_real_afe`	Log of real MPCE per AFE (standardized to base year)	Global Engel curve
`state_code`	State identifier	Geographic variation
`sector`	Rural / Urban	Urbanization effect
`nss_region`	NSS statistical region	Regional variation
`reg_sector`	Regional × Sector interaction	Regional urbanization patterns
`social`	Social group (SC/ST/OBC/Others)	Social equity
`rel`	Religion	Dietary preferences
`child`	Binary: presence of children	Household composition
`female_headed`	Binary: female household head	Gender dimension
`seasonal`	Season of survey (coded by round)	Seasonal variation

5.2 Transformation to the Real Line

The Shannon index \(H_{\text{adj}}\) is bounded: \(H_{\text{adj}} \in [0, H^*]\). To model it with a Gaussian GAM, we apply a logit transformation:

Normalize: \(\tilde{H}_{\text{adj}} = \frac{H_{\text{adj}}}{H^*} \in [0, 1]\)
Clip: Clip to \([\varepsilon, 1-\varepsilon]\) where \(\varepsilon = 10^{-6}\) (prevents infinities)
Logit transform: \(y = \log\left(\frac{\tilde{H}_{\text{adj}}}{1 - \tilde{H}_{\text{adj}}}\right) \in \mathbb{R}\)

After modeling on the logit scale, predictions are back-transformed to the original scale via the inverse logit and scaling.

5.3 GAM Specification

The GAM uses a semi-parametric specification with three types of terms:

Note: The GAM formula combines a global Engel curve (smooth in log MPCE) with factor-smooth interactions (region × sector with shared smoothness penalty) and random intercepts for household-level clustering.

The full specification (from model_food_AFE.R) is:

quant <- z ~ 1 +
    s(log_mpce_real_afe) +
    s(log_mpce_real_afe, state_code, bs = "fs", m = 1) +
    s(log_mpce_real_afe, seasonal, bs = "fs", m = 1) +
    s(log_mpce_real_afe, child, bs = "fs", m = 1) +
    s(log_mpce_real_afe, female_headed, bs = "fs", m = 1) +
    s(log_mpce_real_afe, social, bs = "fs", m = 1) +
    s(log_mpce_real_afe, rel, bs = "fs", m = 1) +
    s(sector, bs = "re") +
    s(nss_region, bs = "re") +
    s(reg_sector, bs = "re") +
    s(social, bs = "re") +
    s(rel, bs = "re") +
    s(child, bs = "re") +
    s(female_headed, bs = "re")

This specification includes:

Global Engel curve: s(log_mpce_real_afe) — a thin-plate regression spline capturing the overall relationship between income and dietary diversity.
Factor-smooth interactions: s(log_mpce_real_afe, state_code, bs = "fs", m = 1) and analogous terms for seasonal, child, female_headed, social, and rel — these allow the Engel curve to vary by group while shrinking toward the global curve via a first-order penalty.
Random intercepts: s(sector, bs = "re"), s(nss_region, bs = "re"), s(reg_sector, bs = "re"), s(social, bs = "re"), s(rel, bs = "re"), s(child, bs = "re"), s(female_headed, bs = "re") — Gaussian random effects that allow group-level intercept shifts.

5.4 Estimation Details

The model is estimated using mgcv::bam() with the following settings:

Parameter	Value	Rationale
Smoothing Method	fREML (fast REML)	Efficient for large datasets
`discrete=TRUE`	TRUE	Reduces memory for large \(n\)
`gamma`	1.4	Slight over-smoothing for stability
`select=TRUE`	TRUE	Allows shrinkage of individual terms toward zero
Family	Gaussian	Appropriate for logit-transformed \(y\)

5.5 Models Estimated

Two variants are estimated for each round of data (2011–12 and 2023–24):

Gram-based Shannon Diversity Index (2 models)
Ratio-based Shannon Diversity Index (2 models)

This provides a total of 4 models, permitting comparison of both construction approaches across survey rounds.

6. Prediction Grid Construction

6.1 The Standardization Problem

To compare dietary diversity across demographic groups while accounting for differences in expenditure distribution, we standardize the geographic composition. For instance, rural and urban sectors differ not only in consumption patterns but also in their geographic concentration — rural populations are concentrated in poorer states, which would confound any rural–urban comparison without standardization.

The standardization formula for a demographic group \(g\) is:

\[\mathbb{E}[H_{\text{adj}} | g] = \sum_{r,s} \mathbb{E}[H_{\text{adj}} | g, r, s] \times P(r, s | \text{standard distribution})\]

where \(r\) is region and \(s\) is sector, and the "standard distribution" is the aggregate geographic distribution across both sectors.

6.2 Grid Construction Logic

Prediction grids are constructed by specifying combinations of:

Log MPCE: A fine sequence from \(\log(200)\) to \(\log(30000)\) in real per-AFE rupees
Demographic factors: Each level of social group, religion, child presence, female headship, sector, and season
Geographic cells: All combinations of region and sector in the "standardized" set

For each cell, we predict \(H_{\text{adj}}\) holding geography fixed at the standardized distribution, yielding group-level predictions that are geographically comparable.

6.3 Expenditure Binning

To avoid over-smoothing or under-smoothing in the expenditure dimension, deciles are calculated and predictions are made at the median of each decile:

Decile	Percentile Range	Prediction Point (Median of Range)
D1	0–10%	5th percentile
D2	10–20%	15th percentile
...	...	...
D10	90–100%	95th percentile

6.4 Grid Construction Steps

Standardize geography: Compute the aggregate distribution of (region, sector) pairs as weights.
Create base grid: Expand combinations of demographic factors (social, religion, child, female-headed, season) and expenditure deciles.
Replicate for geography: For each base grid row, replicate across all (region, sector) pairs to be standardized over.
Compute linear predictor matrix (lpmatrix): Use \(\texttt{predict}(\text{model}, \text{grid}, \text{type}='lp\text{matrix}')\) to obtain the design matrix \(\mathbf{X}\).
Store for posterior simulation: The lpmatrix is used to compute predictions from posterior coefficient draws.

7. Posterior Simulation and Uncertainty Propagation

7.1 Covariance Matrix Validation

After model fitting, the posterior covariance matrix \(\hat{\mathbf{V}}_p\) is extracted. To ensure numerical stability, we check:

Non-zero diagonal elements (positive variance)
Symmetry
Condition number (indicator of numerical stability)

If checks fail, the matrix is regularized using eigenvalue decomposition:

\[\mathbf{V}_p^{\text{reg}} = \mathbf{Q} \Lambda_{\text{reg}} \mathbf{Q}^T \quad \text{where} \quad \Lambda_{\text{reg}} = \max(\Lambda, \text{tol} \times \lambda_{\max})\]

7.2 Coefficient Draws

From the validated covariance matrix, we draw \(M\) posterior coefficient vectors:

\[\boldsymbol{\beta}^{(m)} \sim \mathcal{N}(\hat{\boldsymbol{\beta}}, \hat{\mathbf{V}}_p) \quad \text{for } m = 1, \ldots, M\]

Typically, \(M = 1000\) to 5000 draws are used, balancing accuracy and computational cost.

7.3 Unified Chunked Computation

Predictions are computed via matrix multiplication on the logit scale:

\[\hat{y}^{(m)} = \mathbf{X} \boldsymbol{\beta}^{(m)}\]

To avoid memory overload, the computation is chunked: the \(n_g \times n_p\) lpmatrix (where \(n_g\) is grid size and \(n_p\) is number of parameters) is processed in blocks of rows, with results accumulated.

After logit scale predictions, back-transformation is applied:

\[\tilde{H}^{(m)} = \frac{\exp(\hat{y}^{(m)})}{1 + \exp(\hat{y}^{(m)})} \quad \text{and} \quad H_{\text{adj}}^{(m)} = H^* \times \tilde{H}^{(m)}\]

7.4 Survey Standard Errors

The original survey data have sampling structure (multi-stage design, stratification). Survey standard errors are computed for each group using the survey design weights via svyby():

\[\text{SE}_{\text{survey}} = \sqrt{\frac{\text{Var}_{\text{design}}}{n_{\text{group}}}}\]

These standard errors reflect the clustering and stratification of the sample.

7.5 Injecting Sampling Uncertainty

To combine model uncertainty (via posterior draws) with sampling uncertainty (from the survey design), we inject noise proportional to the survey standard error. Using the mean-preserving noise approach:

\[\sigma_{\log}^2 = \left[\log\left(1 + \text{CV}^2\right)\right] \quad \text{where} \quad \text{CV} = \frac{\text{SE}_{\text{survey}}}{\text{Group Mean}}\]

For each posterior draw \(m\), an additional noise term is applied:

\[H_{\text{adj}}^{(m), \text{noisy}} = H_{\text{adj}}^{(m)} \times \exp\left(\varepsilon_m - \tfrac{1}{2}\sigma_{\log}^2\right) \quad \text{where} \quad \varepsilon_m \sim \mathcal{N}(0, \sigma_{\log}^2)\]

The bias correction term \(-\tfrac{1}{2}\sigma_{\log}^2\) ensures the mean is preserved under the log-normal transformation.

7.6 Final Summary Statistics

From the \(M\) posterior draws (with injected noise), we compute:

Point estimates: Posterior mean \(\mathbb{E}[H_{\text{adj}}] \approx \frac{1}{M} \sum_m H_{\text{adj}}^{(m)}\)
Credible intervals: 2.5th and 97.5th percentiles of the draws for 95% credible interval
Standard deviation: Posterior SD across draws
Grouped aggregates: Weighted averages across grid cells for each demographic group

8. R Code Reference

8.1 Index Construction

The Shannon Diversity Index is constructed in R using tidyverse and data.table operations. The key function analysis_shannon_food() is sourced from model_food_AFE.R:

# Pseudocode for index construction
analysis_shannon_food <- function(data) {
  # Pivot from wide (food categories) to long format
  data_long <- pivot_longer(data, cols = food_cols,
                            names_to = "food", values_to = "intake_gm")

  # Merge with requirements
  data_long <- merge(data_long, requirements_df, by = "food")

  # Compute per-AFE intake
  data_long$intake_afe <- data_long$intake_gm / data_long$afe_size

  # Cap at requirement
  data_long$capped <- pmin(data_long$intake_afe, data_long$requirement)

  # Summarize by household
  by_hh <- data_long[, .(total_capped = sum(capped)), by = household_id]
  data_long <- merge(data_long, by_hh, by = "household_id")

  # Compute shares
  data_long$share <- data_long$capped / data_long$total_capped

  # Shannon entropy
  data_long$h_component <- -data_long$share * log(data_long$share + 1e-10)
  shannon_h <- data_long[, .(H = sum(h_component)), by = household_id]

  # Adequacy adjustment
  data_long$adequacy <- pmin(data_long$intake_afe / data_long$requirement, 1)
  adequacy_a <- data_long[, .(A = mean(adequacy)), by = household_id]

  # Cereal adjustment (gram-based)
  cereal_data <- data_long[food == "cereals"]
  cereal_adj <- pmax(cereal_data$share - cereal_req_share, 0)
  cereal_penalty <- exp(-3 * cereal_adj)

  # Combine
  result <- merge(shannon_h, adequacy_a, by = "household_id")
  result$cereal_penalty <- cereal_penalty
  result$H_adj <- result$H * result$A * result$cereal_penalty

  return(result)
}

8.2 Model Estimation

The GAM is estimated using mgcv::bam() for efficiency:

# Normalize and logit-transform the Shannon index
K <- length(food_vars)  # number of food groups (9)
eps <- 1e-6

data <- data1 %>%
  mutate(
    Hnorm   = shannon_req_A / log(K),
    Hnorm01 = pmin(pmax(Hnorm, eps), 1 - eps),
    z       = qlogis(Hnorm01)
  )

# GAM formula
quant <- z ~ 1 +
  s(log_mpce_real_afe) +
  s(log_mpce_real_afe, state_code, bs = "fs", m = 1) +
  s(log_mpce_real_afe, seasonal, bs = "fs", m = 1) +
  s(log_mpce_real_afe, child, bs = "fs", m = 1) +
  s(log_mpce_real_afe, female_headed, bs = "fs", m = 1) +
  s(log_mpce_real_afe, social, bs = "fs", m = 1) +
  s(log_mpce_real_afe, rel, bs = "fs", m = 1) +
  s(sector, bs = "re") +
  s(nss_region, bs = "re") +
  s(reg_sector, bs = "re") +
  s(social, bs = "re") +
  s(rel, bs = "re") +
  s(child, bs = "re") +
  s(female_headed, bs = "re")

# Fit using bam() for large-dataset efficiency
model_shannon_food <- mgcv::bam(
  quant,
  data = data,
  weights = data$w_pc,
  method = "fREML",
  discrete = TRUE,
  gamma = 1.4,
  gc.level = 2,
  select = TRUE
)

# Estimate for both survey rounds and save
models <- list(
  HCES2011_model = analysis_shannon_food("HCES2011"),
  HCES2023_model = analysis_shannon_food("HCES2023")
)
save(models, file = file.path(folder, "Shannon_Diversity_food_model.RData"))

8.3 Prediction Data Generation

The prediction pipeline is handled by data_for_fig_shannon_food_AFE.R, which calls compute_grp_draws_unified() to generate posterior draws and svyby() for survey standard errors:

data_analysis_shannon_food <- function(
    n = "HCES2023",
    group_vars = c("nss", "rel"),
    n_sims = 1000,
    seed = 1234,
    jitter_eps = 1e-6
) {
  # Step 1-2: Load MPCE data and pre-fitted Shannon models
  obj <- models[[paste0(n, "_model")]]
  model_main <- obj$model_shannon_food
  model_adj  <- obj$model_shannon_food_cereal_adj
  K <- length(obj$food_vars$category_balanced_diet)

  # Step 3: Assign households to MPCE decile bins
  data <- obj$data %>%
    left_join(data_mpce_cutoff, by = group_vars) %>%
    mutate(bin = factor(findInterval(mpce_real_2011_afe, ...)))

  # Step 4: Build standardized prediction grid
  data_gr <- grid_f(n = n, group_vars = group_vars, season = "TRUE")
  nd <- data_gr[["nd"]]

  # Step 5: Draw coefficients from posterior (MVN)
  B_main <- draw_beta(model_main, n_sims, jitter_eps)
  B_adj  <- draw_beta(model_adj, n_sims, jitter_eps)

  # Step 6: Unified chunked computation (both models in single pass)
  grp_draws <- compute_grp_draws_unified(
    nd = nd,
    model_list = list(main = model_main, adj = model_adj),
    B_list = list(main = B_main, adj = B_adj),
    n_sims = n_sims,
    by_cols = c(group_vars, "bin", "log_mpce_real_afe", "grid_id"),
    K = K,
    compute_ratio = TRUE
  )

  # Step 7: Survey standard errors
  des <- svydesign(ids = ~psu, strata = ~strata, weights = ~wts,
                   data = data_svy, nest = TRUE)
  svy_results <- svyby(~shannon_req_A_hat, by = by_f,
                        design = des, FUN = svymean, vartype = "se")

  # Step 8: Inject sampling uncertainty
  add_noise <- function(draws, se) {
    if (is.na(se) || !is.finite(se) || se <= 0) return(draws)
    draws + rnorm(length(draws), mean = 0, sd = se)
  }
  out <- out %>%
    mutate(shannon_req_A_g_svy = Map(add_noise, shannon_req_A_g, shannon_se_svy))

  return(out_final)
}

9. Summary of Model Assumptions

The methodology rests on several key assumptions:

Smooth Engel Curves: We assume the relationship between log expenditure and dietary diversity is smooth and continuous (penalized spline assumption).
Normal Posterior Approximation: The posterior distribution of model coefficients is assumed multivariate normal, justified by large-sample Bayesian asymptotics.
Additivity of Uncertainty: Model and sampling uncertainty are combined additively (in log space), assuming they are approximately independent.
Logit Transformation Adequacy: The logit transformation correctly maps the bounded Shannon index to the real line for Gaussian regression.
Exchangeability: Within survey rounds, households are treated as exchangeable conditional on covariates, justifying the random effects specification.

Limitations: The methodology assumes ICMR-NIN requirements are appropriate for all household types. Individual-level requirements (e.g., for pregnant women, children, or the elderly) are not modeled. The method also assumes measurement error in the survey's food quantity data is negligible.

References

[1] ICMR-NIN (2024). Recommended Dietary Allowances and Estimated Average Requirements for Indians. Indian Council of Medical Research – National Institute of Nutrition, Hyderabad.

[2] Shannon, C.E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.

[3] Wood, S.N. (2017). Generalized Additive Models: An Introduction with R (2nd ed.). Chapman and Hall/CRC.

[4] Cragg, J.G. (1971). Some statistical models for limited dependent variables with application to the demand for durable goods. Econometrica, 39(5), 829–844.

[5] Mullahy, J. (1998). Much ado about two: reconsidering retransformation and the two-part model in health econometrics. Journal of Health Economics, 17(3), 247–281.

[6] National Sample Survey Office (2013). Household Consumer Expenditure Survey 2011–12 (68th Round). Ministry of Statistics and Programme Implementation, Government of India.

[7] Ministry of Statistics and Programme Implementation (2024). Household Consumer Expenditure Survey 2023–24. Government of India.

[8] Gelman, A. (2006). Multilevel (hierarchical) modeling: What it can and can't do. Technometrics, 48(3), 432–435.

[9] Wood, S.N., Goude, Y. & Shaw, S. (2015). Generalized additive models for large data sets. Journal of the Royal Statistical Society: Series C, 64(1), 139–155.

[10] Li, R. & Shively, T.S. (2008). Variable selection in semiparametric regression modeling. Annals of Statistics, 36(1), 261–286.

Last updated: February 2026