Shannon Diversity Index for Food Consumption in India
1. Introduction
1.1 Motivation
The Shannon Diversity Index provides a single summary of dietary quality that captures both variety and adequacy. Unlike simple adequacy ratios, it distinguishes between a balanced diet and one concentrated in a few food groups. Individual food item models tell us what people eat, but not how balanced their diet is. The Shannon diversity index (Shannon, 1948)[1] addresses this by summarizing the evenness of consumption across food groups, combined with a requirement-adequacy adjustment.
This methodology document outlines the construction, modeling, and prediction of the Shannon Diversity Index for food consumption in India, using household-level data from the Household Consumer Expenditure Survey (HCES) 2011–12 and 2023–24, benchmarked against ICMR-NIN Recommended Dietary Allowances.
1.2 Food Categories Analyzed
The analysis covers ten food categories based on the ICMR-NIN recommended balanced diet for an adult woman (55 kg, moderately active, non-lactating):
| Food Category | Daily Requirement (grams) |
|---|---|
| Cereals & Millets | 280 |
| Green Leafy Vegetables | 100 |
| Other Vegetables | 200 |
| Roots & Tubers (excl. potatoes) | 100 |
| Fruits | 100 |
| Milk & Milk Products | 300 |
| Fats & Oils | 25 |
| Oilseeds & Nuts | 40 |
| Pulses & Beans | 95 (combined with Flesh) |
| Flesh Foods | 95 (combined with Pulses) |
Note: Pulses & Beans and Flesh Foods share a combined requirement of 95 grams per day. In the gram-based construction, we allocate this flexibly based on household consumption patterns.
2. Gram-Based Shannon Diversity Index
2.1 Construction (Step by Step)
The gram-based Shannon Diversity Index is constructed through the following sequential steps:
- Per-AFE Intake: Divide household-level food quantities by Adult Female Equivalent (AFE) household size to obtain grams per day per AFE for each of the 9 food groups.
- Cap at Requirement: Set \(a_i = \min(\text{intake}_i, \text{requirement}_i)\) for each group \(i\). Eating more than required does not inflate the diversity score.
- Compute Shares: Calculate \(p_i = \frac{a_i}{\sum_j a_j}\) — the proportion of capped intake from each food group.
- Shannon Entropy: Compute \(H = -\sum_i p_i \log(p_i)\), where higher values indicate more even distribution across groups.
- Adequacy Adjustment: For each food group, calculate \(\min(\text{intake}_i / \text{requirement}_i, 1)\), then average across all groups to obtain \(A\). This penalizes diets that are diverse but nutritionally insufficient.
- Cereal Adjustment: Indian diets are typically dominated by cereals. Even after capping at the requirement (Step 2), cereals still command a large share because their requirement (280 g/day) is much larger than most other food groups (e.g., Fats & Oils at 25 g, Oilseeds & Nuts at 40 g). This means a household could score reasonably well on Shannon entropy simply because it eats a lot of rice or wheat — even if it barely consumes any vegetables, fruits, or protein sources. The cereal adjustment corrects for this by penalizing households whose cereal share exceeds what would be expected under a balanced, requirement-proportional diet.
First, compute the expected cereal share under the ICMR-NIN requirements: \(s^{*}_{\mathrm{cereal}} = \dfrac{r_{\mathrm{cereal}}}{\sum_j r_j} = \dfrac{280}{1240} \approx 0.226\). This is the share cereals would have if every food group were consumed at exactly its required level.
Then, compute how much the household's actual cereal share (from Step 3) exceeds this benchmark: \(\delta_{\mathrm{cereal}} = \max\!\left(p_{\mathrm{cereal}} - s^{*}_{\mathrm{cereal}},\; 0\right)\). If the household's cereal share is at or below the expected share, \(\delta_{\mathrm{cereal}} = 0\) and no penalty is applied.
Finally, convert the excess share into an exponentially decaying penalty: \(C = \exp\!\left(-3 \times \delta_{\mathrm{cereal}}\right)\). The multiplier of 3 controls the severity: a cereal share 10 percentage points above the benchmark yields \(C = e^{-0.3} \approx 0.74\) (a 26% penalty), while a share 20 points above yields \(C = e^{-0.6} \approx 0.55\) (a 45% penalty). The exponential form ensures \(C \in (0, 1]\), with \(C = 1\) when cereal consumption is proportionate and \(C \to 0\) as cereal dominance becomes extreme.
In practice, this adjustment matters most for low-income households that spend the bulk of their food budget on rice or wheat and consume negligible amounts of fruits, vegetables, dairy, and protein-rich foods. Without this correction, such households would receive misleadingly high diversity scores.
- Final Index: Calculate \(H_{\text{adj}} = H \times A \times C\).
2.2 Mathematical Formulation
Let \(q_i\) denote the intake (in grams per day per AFE) of food group \(i\), and \(r_i\) denote the ICMR-NIN requirement for that group.
Shannon Entropy:
Adequacy Score:
Cereal Penalty:
Combined Diversity Index:
2.3 Interpreting the Components
The Shannon Diversity Index combines three complementary dimensions of diet quality:
| Component | What It Captures | Range |
|---|---|---|
| \(H\) (Shannon) | Evenness of consumption across food groups | \([0, H^*]\) |
| \(A\) (Adequacy) | Whether each food group meets its requirement | \([0, 1]\) |
| \(C\) (Cereal Penalty) | Penalizes cereal over-dependence | \((0, 1]\) |
| \(H_{\text{adj}}\) | Overall dietary quality | \([0, H^*]\) |
The theoretical maximum of the Shannon entropy for \(K = 9\) food groups is \(\ln(9) \approx 2.20\), attained when all shares are equal (\(p_i = 1/9\)). Because intakes are capped at requirements (Step 2) and requirements differ across food groups, a household meeting all requirements exactly will have shares \(p_i = r_i / \sum_j r_j\), which are not uniform. Therefore the effective maximum under this construction is \(H^* = -\sum (r_i / \sum r_j)\,\ln(r_i / \sum r_j) \approx 1.97 < \ln(9)\). A household consuming only rice has \(H \approx 0\) (no diversity); a household consuming from all groups in balanced proportions has \(H \approx H^*\).
3. Ratio-Based Shannon Diversity Index (Alternative)
3.1 Motivation
In the gram-based construction, food groups with large requirements (e.g., cereals at 280g) dominate the shares \(p_i\), making it harder for smaller-requirement groups (e.g., oilseeds at 40g) to influence diversity. The ratio-based alternative normalizes each intake by its requirement first, placing all food groups on a common 0–1 scale regardless of requirement magnitude.
3.2 Construction (Step by Step)
The ratio-based construction follows a similar sequence, with one key modification:
- Per-AFE Intake: Same as gram-based.
- Ratio and Cap: Set \(a_i = \min(q_i / r_i, 1)\) for each group. This is the adequacy ratio, capped at 1.
- Compute Shares: Calculate \(p_i = \frac{a_i}{\sum_j a_j}\) — the proportion of capped adequacy ratios.
- Shannon Entropy: Compute \(H = -\sum_i p_i \log(p_i)\).
- Adequacy Adjustment: Same as gram-based: \(A = \frac{1}{K} \sum_i \min(q_i/r_i, 1)\).
- Final Index: Calculate \(H_{\text{adj}}^{\text{ratio}} = H \times A\) (no cereal adjustment).
3.3 Mathematical Formulation
3.4 Key Differences from Gram-Based
| Feature | Gram-Based | Ratio-Based |
|---|---|---|
| Capping Rule | \(a_i = \min(q_i, r_i)\) | \(a_i = \min(q_i/r_i, 1)\) |
| Scale | Grams per day | Proportional (0–1) |
| Share Interpretation | Gram proportion of capped intake | Proportion of capped adequacy ratios |
| Cereal Dominance | High (large requirement) | Moderate (equal max of 1.0) |
| Cereal Adjustment | Yes, essential | No, unnecessary |
| Max Shannon | \(H^* \approx 1.97\) (requirement-weighted) | \(\ln(9) \approx 2.20\) (uniform when all groups fully met) |
3.5 Why Cereal Adjustment Is Unnecessary
In the ratio-based construction, each food group contributes at most 1.0 to the share denominator, regardless of gram requirement. Cereal's share can only become large if other groups have genuinely insufficient intakes. Because the construction already equalizes the maximum contribution of each group, no additional cereal penalty is needed.
4. Why Shannon Over Adequacy Ratio?
4.1 MAR Cannot Distinguish Balance
The simple Mean Adequacy Ratio (MAR) is defined as \(\text{MAR} = \frac{1}{K} \sum_i \min(q_i/r_i, 1)\) — it captures only the first moment of the adequacy distribution. Consider three households:
| Household | Cereal Adequacy | Veg Adequacy | Protein Adequacy | MAR | Shannon \(H_{\text{adj}}\) |
|---|---|---|---|---|---|
| A | 1.0 | 0.5 | 0.2 | 0.57 | 0.42 |
| B | 0.8 | 0.6 | 0.5 | 0.63 | 0.58 |
| C | 0.6 | 0.6 | 0.6 | 0.60 | 0.55 |
Households B and C have similar MAR values, yet their diet distributions are quite different: B has one adequate group and two under-consumed; C has all three groups equally under-consumed. The Shannon index captures this distributional difference through the entropy component \(H\).
4.2 Shannon Captures Distribution
The Shannon entropy \(H\) measures the distributional shape — how evenly consumption is spread. Two diets with identical adequacy scores but different shapes (one concentrated in one group, one spread across groups) will have different Shannon values. The combined index \(H_{\text{adj}} = H \times A \times C\) thus captures both the level of intake (via \(A\)) and its distribution (via \(H\)).
4.3 Policy Relevance
Interventions to improve diet quality depend on the diagnostic:
- Low \(A\), High \(H\): Diet is well-balanced but insufficient. Increase overall food intake through income/food subsidies.
- Moderate \(A\), Low \(H\): Diet relies heavily on few foods. Diversification campaigns needed.
- Low \(A\), Low \(H\): Both levels and diversity are inadequate. Comprehensive nutrition intervention needed.
5. Modeling the Diversity Index
5.1 Data Preparation
The analysis uses data from HCES 2011–12 and 2023–24. Key covariates include:
| Covariate | Description | Role |
|---|---|---|
log_mpce_real_afe |
Log of real MPCE per AFE (standardized to base year) | Global Engel curve |
state_code |
State identifier | Geographic variation |
sector |
Rural / Urban | Urbanization effect |
nss_region |
NSS statistical region | Regional variation |
reg_sector |
Regional × Sector interaction | Regional urbanization patterns |
social |
Social group (SC/ST/OBC/Others) | Social equity |
rel |
Religion | Dietary preferences |
child |
Binary: presence of children | Household composition |
female_headed |
Binary: female household head | Gender dimension |
seasonal |
Season of survey (coded by round) | Seasonal variation |
5.2 Transformation to the Real Line
The Shannon index \(H_{\text{adj}}\) is bounded: \(H_{\text{adj}} \in [0, H^*]\). To model it with a Gaussian GAM, we apply a logit transformation:
- Normalize: \(\tilde{H}_{\text{adj}} = \frac{H_{\text{adj}}}{H^*} \in [0, 1]\)
- Clip: Clip to \([\varepsilon, 1-\varepsilon]\) where \(\varepsilon = 10^{-6}\) (prevents infinities)
- Logit transform: \(y = \log\left(\frac{\tilde{H}_{\text{adj}}}{1 - \tilde{H}_{\text{adj}}}\right) \in \mathbb{R}\)
After modeling on the logit scale, predictions are back-transformed to the original scale via the inverse logit and scaling.
5.3 GAM Specification
The GAM uses a semi-parametric specification with three types of terms:
The full specification (from model_food_AFE.R) is:
quant <- z ~ 1 +
s(log_mpce_real_afe) +
s(log_mpce_real_afe, state_code, bs = "fs", m = 1) +
s(log_mpce_real_afe, seasonal, bs = "fs", m = 1) +
s(log_mpce_real_afe, child, bs = "fs", m = 1) +
s(log_mpce_real_afe, female_headed, bs = "fs", m = 1) +
s(log_mpce_real_afe, social, bs = "fs", m = 1) +
s(log_mpce_real_afe, rel, bs = "fs", m = 1) +
s(sector, bs = "re") +
s(nss_region, bs = "re") +
s(reg_sector, bs = "re") +
s(social, bs = "re") +
s(rel, bs = "re") +
s(child, bs = "re") +
s(female_headed, bs = "re")
This specification includes:
- Global Engel curve:
s(log_mpce_real_afe)— a thin-plate regression spline capturing the overall relationship between income and dietary diversity. - Factor-smooth interactions:
s(log_mpce_real_afe, state_code, bs = "fs", m = 1)and analogous terms for seasonal, child, female_headed, social, and rel — these allow the Engel curve to vary by group while shrinking toward the global curve via a first-order penalty. - Random intercepts:
s(sector, bs = "re"),s(nss_region, bs = "re"),s(reg_sector, bs = "re"),s(social, bs = "re"),s(rel, bs = "re"),s(child, bs = "re"),s(female_headed, bs = "re")— Gaussian random effects that allow group-level intercept shifts.
5.4 Estimation Details
The model is estimated using mgcv::bam() with the following settings:
| Parameter | Value | Rationale |
|---|---|---|
| Smoothing Method | fREML (fast REML) | Efficient for large datasets |
discrete=TRUE |
TRUE | Reduces memory for large \(n\) |
gamma |
1.4 | Slight over-smoothing for stability |
select=TRUE |
TRUE | Allows shrinkage of individual terms toward zero |
| Family | Gaussian | Appropriate for logit-transformed \(y\) |
5.5 Models Estimated
Two variants are estimated for each round of data (2011–12 and 2023–24):
- Gram-based Shannon Diversity Index (2 models)
- Ratio-based Shannon Diversity Index (2 models)
This provides a total of 4 models, permitting comparison of both construction approaches across survey rounds.
6. Prediction Grid Construction
6.1 The Standardization Problem
To compare dietary diversity across demographic groups while accounting for differences in expenditure distribution, we standardize the geographic composition. For instance, rural and urban sectors differ not only in consumption patterns but also in their geographic concentration — rural populations are concentrated in poorer states, which would confound any rural–urban comparison without standardization.
The standardization formula for a demographic group \(g\) is:
where \(r\) is region and \(s\) is sector, and the "standard distribution" is the aggregate geographic distribution across both sectors.
6.2 Grid Construction Logic
Prediction grids are constructed by specifying combinations of:
- Log MPCE: A fine sequence from \(\log(200)\) to \(\log(30000)\) in real per-AFE rupees
- Demographic factors: Each level of social group, religion, child presence, female headship, sector, and season
- Geographic cells: All combinations of region and sector in the "standardized" set
For each cell, we predict \(H_{\text{adj}}\) holding geography fixed at the standardized distribution, yielding group-level predictions that are geographically comparable.
6.3 Expenditure Binning
To avoid over-smoothing or under-smoothing in the expenditure dimension, deciles are calculated and predictions are made at the median of each decile:
| Decile | Percentile Range | Prediction Point (Median of Range) |
|---|---|---|
| D1 | 0–10% | 5th percentile |
| D2 | 10–20% | 15th percentile |
| ... | ... | ... |
| D10 | 90–100% | 95th percentile |
6.4 Grid Construction Steps
- Standardize geography: Compute the aggregate distribution of (region, sector) pairs as weights.
- Create base grid: Expand combinations of demographic factors (social, religion, child, female-headed, season) and expenditure deciles.
- Replicate for geography: For each base grid row, replicate across all (region, sector) pairs to be standardized over.
- Compute linear predictor matrix (lpmatrix): Use \(\texttt{predict}(\text{model}, \text{grid}, \text{type}='lp\text{matrix}')\) to obtain the design matrix \(\mathbf{X}\).
- Store for posterior simulation: The lpmatrix is used to compute predictions from posterior coefficient draws.
7. Posterior Simulation and Uncertainty Propagation
7.1 Covariance Matrix Validation
After model fitting, the posterior covariance matrix \(\hat{\mathbf{V}}_p\) is extracted. To ensure numerical stability, we check:
- Non-zero diagonal elements (positive variance)
- Symmetry
- Condition number (indicator of numerical stability)
If checks fail, the matrix is regularized using eigenvalue decomposition:
7.2 Coefficient Draws
From the validated covariance matrix, we draw \(M\) posterior coefficient vectors:
Typically, \(M = 1000\) to 5000 draws are used, balancing accuracy and computational cost.
7.3 Unified Chunked Computation
Predictions are computed via matrix multiplication on the logit scale:
To avoid memory overload, the computation is chunked: the \(n_g \times n_p\) lpmatrix (where \(n_g\) is grid size and \(n_p\) is number of parameters) is processed in blocks of rows, with results accumulated.
After logit scale predictions, back-transformation is applied:
7.4 Survey Standard Errors
The original survey data have sampling structure (multi-stage design, stratification). Survey standard errors are computed for each group using the survey design weights via svyby():
These standard errors reflect the clustering and stratification of the sample.
7.5 Injecting Sampling Uncertainty
To combine model uncertainty (via posterior draws) with sampling uncertainty (from the survey design), we inject noise proportional to the survey standard error. Using the mean-preserving noise approach:
For each posterior draw \(m\), an additional noise term is applied:
The bias correction term \(-\tfrac{1}{2}\sigma_{\log}^2\) ensures the mean is preserved under the log-normal transformation.
7.6 Final Summary Statistics
From the \(M\) posterior draws (with injected noise), we compute:
- Point estimates: Posterior mean \(\mathbb{E}[H_{\text{adj}}] \approx \frac{1}{M} \sum_m H_{\text{adj}}^{(m)}\)
- Credible intervals: 2.5th and 97.5th percentiles of the draws for 95% credible interval
- Standard deviation: Posterior SD across draws
- Grouped aggregates: Weighted averages across grid cells for each demographic group
8. R Code Reference
8.1 Index Construction
The Shannon Diversity Index is constructed in R using tidyverse and data.table operations. The key function analysis_shannon_food() is sourced from model_food_AFE.R:
# Pseudocode for index construction
analysis_shannon_food <- function(data) {
# Pivot from wide (food categories) to long format
data_long <- pivot_longer(data, cols = food_cols,
names_to = "food", values_to = "intake_gm")
# Merge with requirements
data_long <- merge(data_long, requirements_df, by = "food")
# Compute per-AFE intake
data_long$intake_afe <- data_long$intake_gm / data_long$afe_size
# Cap at requirement
data_long$capped <- pmin(data_long$intake_afe, data_long$requirement)
# Summarize by household
by_hh <- data_long[, .(total_capped = sum(capped)), by = household_id]
data_long <- merge(data_long, by_hh, by = "household_id")
# Compute shares
data_long$share <- data_long$capped / data_long$total_capped
# Shannon entropy
data_long$h_component <- -data_long$share * log(data_long$share + 1e-10)
shannon_h <- data_long[, .(H = sum(h_component)), by = household_id]
# Adequacy adjustment
data_long$adequacy <- pmin(data_long$intake_afe / data_long$requirement, 1)
adequacy_a <- data_long[, .(A = mean(adequacy)), by = household_id]
# Cereal adjustment (gram-based)
cereal_data <- data_long[food == "cereals"]
cereal_adj <- pmax(cereal_data$share - cereal_req_share, 0)
cereal_penalty <- exp(-3 * cereal_adj)
# Combine
result <- merge(shannon_h, adequacy_a, by = "household_id")
result$cereal_penalty <- cereal_penalty
result$H_adj <- result$H * result$A * result$cereal_penalty
return(result)
}
8.2 Model Estimation
The GAM is estimated using mgcv::bam() for efficiency:
# Normalize and logit-transform the Shannon index
K <- length(food_vars) # number of food groups (9)
eps <- 1e-6
data <- data1 %>%
mutate(
Hnorm = shannon_req_A / log(K),
Hnorm01 = pmin(pmax(Hnorm, eps), 1 - eps),
z = qlogis(Hnorm01)
)
# GAM formula
quant <- z ~ 1 +
s(log_mpce_real_afe) +
s(log_mpce_real_afe, state_code, bs = "fs", m = 1) +
s(log_mpce_real_afe, seasonal, bs = "fs", m = 1) +
s(log_mpce_real_afe, child, bs = "fs", m = 1) +
s(log_mpce_real_afe, female_headed, bs = "fs", m = 1) +
s(log_mpce_real_afe, social, bs = "fs", m = 1) +
s(log_mpce_real_afe, rel, bs = "fs", m = 1) +
s(sector, bs = "re") +
s(nss_region, bs = "re") +
s(reg_sector, bs = "re") +
s(social, bs = "re") +
s(rel, bs = "re") +
s(child, bs = "re") +
s(female_headed, bs = "re")
# Fit using bam() for large-dataset efficiency
model_shannon_food <- mgcv::bam(
quant,
data = data,
weights = data$w_pc,
method = "fREML",
discrete = TRUE,
gamma = 1.4,
gc.level = 2,
select = TRUE
)
# Estimate for both survey rounds and save
models <- list(
HCES2011_model = analysis_shannon_food("HCES2011"),
HCES2023_model = analysis_shannon_food("HCES2023")
)
save(models, file = file.path(folder, "Shannon_Diversity_food_model.RData"))
8.3 Prediction Data Generation
The prediction pipeline is handled by data_for_fig_shannon_food_AFE.R, which calls compute_grp_draws_unified() to generate posterior draws and svyby() for survey standard errors:
data_analysis_shannon_food <- function(
n = "HCES2023",
group_vars = c("nss", "rel"),
n_sims = 1000,
seed = 1234,
jitter_eps = 1e-6
) {
# Step 1-2: Load MPCE data and pre-fitted Shannon models
obj <- models[[paste0(n, "_model")]]
model_main <- obj$model_shannon_food
model_adj <- obj$model_shannon_food_cereal_adj
K <- length(obj$food_vars$category_balanced_diet)
# Step 3: Assign households to MPCE decile bins
data <- obj$data %>%
left_join(data_mpce_cutoff, by = group_vars) %>%
mutate(bin = factor(findInterval(mpce_real_2011_afe, ...)))
# Step 4: Build standardized prediction grid
data_gr <- grid_f(n = n, group_vars = group_vars, season = "TRUE")
nd <- data_gr[["nd"]]
# Step 5: Draw coefficients from posterior (MVN)
B_main <- draw_beta(model_main, n_sims, jitter_eps)
B_adj <- draw_beta(model_adj, n_sims, jitter_eps)
# Step 6: Unified chunked computation (both models in single pass)
grp_draws <- compute_grp_draws_unified(
nd = nd,
model_list = list(main = model_main, adj = model_adj),
B_list = list(main = B_main, adj = B_adj),
n_sims = n_sims,
by_cols = c(group_vars, "bin", "log_mpce_real_afe", "grid_id"),
K = K,
compute_ratio = TRUE
)
# Step 7: Survey standard errors
des <- svydesign(ids = ~psu, strata = ~strata, weights = ~wts,
data = data_svy, nest = TRUE)
svy_results <- svyby(~shannon_req_A_hat, by = by_f,
design = des, FUN = svymean, vartype = "se")
# Step 8: Inject sampling uncertainty
add_noise <- function(draws, se) {
if (is.na(se) || !is.finite(se) || se <= 0) return(draws)
draws + rnorm(length(draws), mean = 0, sd = se)
}
out <- out %>%
mutate(shannon_req_A_g_svy = Map(add_noise, shannon_req_A_g, shannon_se_svy))
return(out_final)
}
9. Summary of Model Assumptions
The methodology rests on several key assumptions:
- Smooth Engel Curves: We assume the relationship between log expenditure and dietary diversity is smooth and continuous (penalized spline assumption).
- Normal Posterior Approximation: The posterior distribution of model coefficients is assumed multivariate normal, justified by large-sample Bayesian asymptotics.
- Additivity of Uncertainty: Model and sampling uncertainty are combined additively (in log space), assuming they are approximately independent.
- Logit Transformation Adequacy: The logit transformation correctly maps the bounded Shannon index to the real line for Gaussian regression.
- Exchangeability: Within survey rounds, households are treated as exchangeable conditional on covariates, justifying the random effects specification.
References
Last updated: February 2026