Methodology Note

Shannon Diversity Index for Food Consumption in India

Construction, GAM-Based Modeling, and Prediction Data Generation
Dr Shamika Ravi (Member, EAC to PM) & Dr Mudit Kapoor (CECFEE, EPU, ISI-Delhi Center)
February 2026

1. Introduction

1.1 Motivation

The Shannon Diversity Index provides a single summary of dietary quality that captures both variety and adequacy. Unlike simple adequacy ratios, it distinguishes between a balanced diet and one concentrated in a few food groups. Individual food item models tell us what people eat, but not how balanced their diet is. The Shannon diversity index (Shannon, 1948)[1] addresses this by summarizing the evenness of consumption across food groups, combined with a requirement-adequacy adjustment.

This methodology document outlines the construction, modeling, and prediction of the Shannon Diversity Index for food consumption in India, using household-level data from the Household Consumer Expenditure Survey (HCES) 2011–12 and 2023–24, benchmarked against ICMR-NIN Recommended Dietary Allowances.

1.2 Food Categories Analyzed

The analysis covers ten food categories based on the ICMR-NIN recommended balanced diet for an adult woman (55 kg, moderately active, non-lactating):

Food Category Daily Requirement (grams)
Cereals & Millets 280
Green Leafy Vegetables 100
Other Vegetables 200
Roots & Tubers (excl. potatoes) 100
Fruits 100
Milk & Milk Products 300
Fats & Oils 25
Oilseeds & Nuts 40
Pulses & Beans 95 (combined with Flesh)
Flesh Foods 95 (combined with Pulses)

Note: Pulses & Beans and Flesh Foods share a combined requirement of 95 grams per day. In the gram-based construction, we allocate this flexibly based on household consumption patterns.

2. Gram-Based Shannon Diversity Index

2.1 Construction (Step by Step)

The gram-based Shannon Diversity Index is constructed through the following sequential steps:

  1. Per-AFE Intake: Divide household-level food quantities by Adult Female Equivalent (AFE) household size to obtain grams per day per AFE for each of the 9 food groups.
  2. Cap at Requirement: Set \(a_i = \min(\text{intake}_i, \text{requirement}_i)\) for each group \(i\). Eating more than required does not inflate the diversity score.
  3. Compute Shares: Calculate \(p_i = \frac{a_i}{\sum_j a_j}\) — the proportion of capped intake from each food group.
  4. Shannon Entropy: Compute \(H = -\sum_i p_i \log(p_i)\), where higher values indicate more even distribution across groups.
  5. Adequacy Adjustment: For each food group, calculate \(\min(\text{intake}_i / \text{requirement}_i, 1)\), then average across all groups to obtain \(A\). This penalizes diets that are diverse but nutritionally insufficient.
  6. Cereal Adjustment: Indian diets are typically dominated by cereals. Even after capping at the requirement (Step 2), cereals still command a large share because their requirement (280 g/day) is much larger than most other food groups (e.g., Fats & Oils at 25 g, Oilseeds & Nuts at 40 g). This means a household could score reasonably well on Shannon entropy simply because it eats a lot of rice or wheat — even if it barely consumes any vegetables, fruits, or protein sources. The cereal adjustment corrects for this by penalizing households whose cereal share exceeds what would be expected under a balanced, requirement-proportional diet.

    First, compute the expected cereal share under the ICMR-NIN requirements: \(s^{*}_{\mathrm{cereal}} = \dfrac{r_{\mathrm{cereal}}}{\sum_j r_j} = \dfrac{280}{1240} \approx 0.226\). This is the share cereals would have if every food group were consumed at exactly its required level.

    Then, compute how much the household's actual cereal share (from Step 3) exceeds this benchmark: \(\delta_{\mathrm{cereal}} = \max\!\left(p_{\mathrm{cereal}} - s^{*}_{\mathrm{cereal}},\; 0\right)\). If the household's cereal share is at or below the expected share, \(\delta_{\mathrm{cereal}} = 0\) and no penalty is applied.

    Finally, convert the excess share into an exponentially decaying penalty: \(C = \exp\!\left(-3 \times \delta_{\mathrm{cereal}}\right)\). The multiplier of 3 controls the severity: a cereal share 10 percentage points above the benchmark yields \(C = e^{-0.3} \approx 0.74\) (a 26% penalty), while a share 20 points above yields \(C = e^{-0.6} \approx 0.55\) (a 45% penalty). The exponential form ensures \(C \in (0, 1]\), with \(C = 1\) when cereal consumption is proportionate and \(C \to 0\) as cereal dominance becomes extreme.

    In practice, this adjustment matters most for low-income households that spend the bulk of their food budget on rice or wheat and consume negligible amounts of fruits, vegetables, dairy, and protein-rich foods. Without this correction, such households would receive misleadingly high diversity scores.

  7. Final Index: Calculate \(H_{\text{adj}} = H \times A \times C\).

2.2 Mathematical Formulation

Let \(q_i\) denote the intake (in grams per day per AFE) of food group \(i\), and \(r_i\) denote the ICMR-NIN requirement for that group.

Shannon Entropy:

\[H = -\sum_{i=1}^{K} p_i \log(p_i) \quad \text{where} \quad p_i = \frac{\min(q_i, r_i)}{\sum_j \min(q_j, r_j)}\]

Adequacy Score:

\[A = \frac{1}{K} \sum_{i=1}^K \min\!\left(\frac{q_i}{r_i}, 1\right)\]

Cereal Penalty:

\[C = \exp\!\left(-3 \cdot \max\!\left(p_{\text{cereal}} - \frac{r_{\text{cereal}}}{\sum_j r_j}, 0\right)\right)\]

Combined Diversity Index:

\[H_{\text{adj}} = H \times A \times C\]

2.3 Interpreting the Components

The Shannon Diversity Index combines three complementary dimensions of diet quality:

Component What It Captures Range
\(H\) (Shannon) Evenness of consumption across food groups \([0, H^*]\)
\(A\) (Adequacy) Whether each food group meets its requirement \([0, 1]\)
\(C\) (Cereal Penalty) Penalizes cereal over-dependence \((0, 1]\)
\(H_{\text{adj}}\) Overall dietary quality \([0, H^*]\)

The theoretical maximum of the Shannon entropy for \(K = 9\) food groups is \(\ln(9) \approx 2.20\), attained when all shares are equal (\(p_i = 1/9\)). Because intakes are capped at requirements (Step 2) and requirements differ across food groups, a household meeting all requirements exactly will have shares \(p_i = r_i / \sum_j r_j\), which are not uniform. Therefore the effective maximum under this construction is \(H^* = -\sum (r_i / \sum r_j)\,\ln(r_i / \sum r_j) \approx 1.97 < \ln(9)\). A household consuming only rice has \(H \approx 0\) (no diversity); a household consuming from all groups in balanced proportions has \(H \approx H^*\).

3. Ratio-Based Shannon Diversity Index (Alternative)

3.1 Motivation

In the gram-based construction, food groups with large requirements (e.g., cereals at 280g) dominate the shares \(p_i\), making it harder for smaller-requirement groups (e.g., oilseeds at 40g) to influence diversity. The ratio-based alternative normalizes each intake by its requirement first, placing all food groups on a common 0–1 scale regardless of requirement magnitude.

3.2 Construction (Step by Step)

The ratio-based construction follows a similar sequence, with one key modification:

  1. Per-AFE Intake: Same as gram-based.
  2. Ratio and Cap: Set \(a_i = \min(q_i / r_i, 1)\) for each group. This is the adequacy ratio, capped at 1.
  3. Compute Shares: Calculate \(p_i = \frac{a_i}{\sum_j a_j}\) — the proportion of capped adequacy ratios.
  4. Shannon Entropy: Compute \(H = -\sum_i p_i \log(p_i)\).
  5. Adequacy Adjustment: Same as gram-based: \(A = \frac{1}{K} \sum_i \min(q_i/r_i, 1)\).
  6. Final Index: Calculate \(H_{\text{adj}}^{\text{ratio}} = H \times A\) (no cereal adjustment).

3.3 Mathematical Formulation

\[H = -\sum_{i=1}^{K} p_i \log(p_i) \quad \text{where} \quad p_i = \frac{\min(q_i/r_i, 1)}{\sum_j \min(q_j/r_j, 1)}\]
\[A = \frac{1}{K} \sum_{i=1}^K \min\!\left(\frac{q_i}{r_i}, 1\right)\]
\[H_{\text{adj}}^{\text{ratio}} = H \times A\]

3.4 Key Differences from Gram-Based

Feature Gram-Based Ratio-Based
Capping Rule \(a_i = \min(q_i, r_i)\) \(a_i = \min(q_i/r_i, 1)\)
Scale Grams per day Proportional (0–1)
Share Interpretation Gram proportion of capped intake Proportion of capped adequacy ratios
Cereal Dominance High (large requirement) Moderate (equal max of 1.0)
Cereal Adjustment Yes, essential No, unnecessary
Max Shannon \(H^* \approx 1.97\) (requirement-weighted) \(\ln(9) \approx 2.20\) (uniform when all groups fully met)

3.5 Why Cereal Adjustment Is Unnecessary

In the ratio-based construction, each food group contributes at most 1.0 to the share denominator, regardless of gram requirement. Cereal's share can only become large if other groups have genuinely insufficient intakes. Because the construction already equalizes the maximum contribution of each group, no additional cereal penalty is needed.

4. Why Shannon Over Adequacy Ratio?

4.1 MAR Cannot Distinguish Balance

The simple Mean Adequacy Ratio (MAR) is defined as \(\text{MAR} = \frac{1}{K} \sum_i \min(q_i/r_i, 1)\) — it captures only the first moment of the adequacy distribution. Consider three households:

Household Cereal Adequacy Veg Adequacy Protein Adequacy MAR Shannon \(H_{\text{adj}}\)
A 1.0 0.5 0.2 0.57 0.42
B 0.8 0.6 0.5 0.63 0.58
C 0.6 0.6 0.6 0.60 0.55

Households B and C have similar MAR values, yet their diet distributions are quite different: B has one adequate group and two under-consumed; C has all three groups equally under-consumed. The Shannon index captures this distributional difference through the entropy component \(H\).

4.2 Shannon Captures Distribution

The Shannon entropy \(H\) measures the distributional shape — how evenly consumption is spread. Two diets with identical adequacy scores but different shapes (one concentrated in one group, one spread across groups) will have different Shannon values. The combined index \(H_{\text{adj}} = H \times A \times C\) thus captures both the level of intake (via \(A\)) and its distribution (via \(H\)).

4.3 Policy Relevance

Interventions to improve diet quality depend on the diagnostic:

5. Modeling the Diversity Index

5.1 Data Preparation

The analysis uses data from HCES 2011–12 and 2023–24. Key covariates include:

Covariate Description Role
log_mpce_real_afe Log of real MPCE per AFE (standardized to base year) Global Engel curve
state_code State identifier Geographic variation
sector Rural / Urban Urbanization effect
nss_region NSS statistical region Regional variation
reg_sector Regional × Sector interaction Regional urbanization patterns
social Social group (SC/ST/OBC/Others) Social equity
rel Religion Dietary preferences
child Binary: presence of children Household composition
female_headed Binary: female household head Gender dimension
seasonal Season of survey (coded by round) Seasonal variation

5.2 Transformation to the Real Line

The Shannon index \(H_{\text{adj}}\) is bounded: \(H_{\text{adj}} \in [0, H^*]\). To model it with a Gaussian GAM, we apply a logit transformation:

  1. Normalize: \(\tilde{H}_{\text{adj}} = \frac{H_{\text{adj}}}{H^*} \in [0, 1]\)
  2. Clip: Clip to \([\varepsilon, 1-\varepsilon]\) where \(\varepsilon = 10^{-6}\) (prevents infinities)
  3. Logit transform: \(y = \log\left(\frac{\tilde{H}_{\text{adj}}}{1 - \tilde{H}_{\text{adj}}}\right) \in \mathbb{R}\)

After modeling on the logit scale, predictions are back-transformed to the original scale via the inverse logit and scaling.

5.3 GAM Specification

The GAM uses a semi-parametric specification with three types of terms:

Note: The GAM formula combines a global Engel curve (smooth in log MPCE) with factor-smooth interactions (region × sector with shared smoothness penalty) and random intercepts for household-level clustering.

The full specification (from model_food_AFE.R) is:

quant <- z ~ 1 +
    s(log_mpce_real_afe) +
    s(log_mpce_real_afe, state_code, bs = "fs", m = 1) +
    s(log_mpce_real_afe, seasonal, bs = "fs", m = 1) +
    s(log_mpce_real_afe, child, bs = "fs", m = 1) +
    s(log_mpce_real_afe, female_headed, bs = "fs", m = 1) +
    s(log_mpce_real_afe, social, bs = "fs", m = 1) +
    s(log_mpce_real_afe, rel, bs = "fs", m = 1) +
    s(sector, bs = "re") +
    s(nss_region, bs = "re") +
    s(reg_sector, bs = "re") +
    s(social, bs = "re") +
    s(rel, bs = "re") +
    s(child, bs = "re") +
    s(female_headed, bs = "re")

This specification includes:

5.4 Estimation Details

The model is estimated using mgcv::bam() with the following settings:

Parameter Value Rationale
Smoothing Method fREML (fast REML) Efficient for large datasets
discrete=TRUE TRUE Reduces memory for large \(n\)
gamma 1.4 Slight over-smoothing for stability
select=TRUE TRUE Allows shrinkage of individual terms toward zero
Family Gaussian Appropriate for logit-transformed \(y\)

5.5 Models Estimated

Two variants are estimated for each round of data (2011–12 and 2023–24):

This provides a total of 4 models, permitting comparison of both construction approaches across survey rounds.

6. Prediction Grid Construction

6.1 The Standardization Problem

To compare dietary diversity across demographic groups while accounting for differences in expenditure distribution, we standardize the geographic composition. For instance, rural and urban sectors differ not only in consumption patterns but also in their geographic concentration — rural populations are concentrated in poorer states, which would confound any rural–urban comparison without standardization.

The standardization formula for a demographic group \(g\) is:

\[\mathbb{E}[H_{\text{adj}} | g] = \sum_{r,s} \mathbb{E}[H_{\text{adj}} | g, r, s] \times P(r, s | \text{standard distribution})\]

where \(r\) is region and \(s\) is sector, and the "standard distribution" is the aggregate geographic distribution across both sectors.

6.2 Grid Construction Logic

Prediction grids are constructed by specifying combinations of:

For each cell, we predict \(H_{\text{adj}}\) holding geography fixed at the standardized distribution, yielding group-level predictions that are geographically comparable.

6.3 Expenditure Binning

To avoid over-smoothing or under-smoothing in the expenditure dimension, deciles are calculated and predictions are made at the median of each decile:

Decile Percentile Range Prediction Point (Median of Range)
D1 0–10% 5th percentile
D2 10–20% 15th percentile
... ... ...
D10 90–100% 95th percentile

6.4 Grid Construction Steps

  1. Standardize geography: Compute the aggregate distribution of (region, sector) pairs as weights.
  2. Create base grid: Expand combinations of demographic factors (social, religion, child, female-headed, season) and expenditure deciles.
  3. Replicate for geography: For each base grid row, replicate across all (region, sector) pairs to be standardized over.
  4. Compute linear predictor matrix (lpmatrix): Use \(\texttt{predict}(\text{model}, \text{grid}, \text{type}='lp\text{matrix}')\) to obtain the design matrix \(\mathbf{X}\).
  5. Store for posterior simulation: The lpmatrix is used to compute predictions from posterior coefficient draws.

7. Posterior Simulation and Uncertainty Propagation

7.1 Covariance Matrix Validation

After model fitting, the posterior covariance matrix \(\hat{\mathbf{V}}_p\) is extracted. To ensure numerical stability, we check:

If checks fail, the matrix is regularized using eigenvalue decomposition:

\[\mathbf{V}_p^{\text{reg}} = \mathbf{Q} \Lambda_{\text{reg}} \mathbf{Q}^T \quad \text{where} \quad \Lambda_{\text{reg}} = \max(\Lambda, \text{tol} \times \lambda_{\max})\]

7.2 Coefficient Draws

From the validated covariance matrix, we draw \(M\) posterior coefficient vectors:

\[\boldsymbol{\beta}^{(m)} \sim \mathcal{N}(\hat{\boldsymbol{\beta}}, \hat{\mathbf{V}}_p) \quad \text{for } m = 1, \ldots, M\]

Typically, \(M = 1000\) to 5000 draws are used, balancing accuracy and computational cost.

7.3 Unified Chunked Computation

Predictions are computed via matrix multiplication on the logit scale:

\[\hat{y}^{(m)} = \mathbf{X} \boldsymbol{\beta}^{(m)}\]

To avoid memory overload, the computation is chunked: the \(n_g \times n_p\) lpmatrix (where \(n_g\) is grid size and \(n_p\) is number of parameters) is processed in blocks of rows, with results accumulated.

After logit scale predictions, back-transformation is applied:

\[\tilde{H}^{(m)} = \frac{\exp(\hat{y}^{(m)})}{1 + \exp(\hat{y}^{(m)})} \quad \text{and} \quad H_{\text{adj}}^{(m)} = H^* \times \tilde{H}^{(m)}\]

7.4 Survey Standard Errors

The original survey data have sampling structure (multi-stage design, stratification). Survey standard errors are computed for each group using the survey design weights via svyby():

\[\text{SE}_{\text{survey}} = \sqrt{\frac{\text{Var}_{\text{design}}}{n_{\text{group}}}}\]

These standard errors reflect the clustering and stratification of the sample.

7.5 Injecting Sampling Uncertainty

To combine model uncertainty (via posterior draws) with sampling uncertainty (from the survey design), we inject noise proportional to the survey standard error. Using the mean-preserving noise approach:

\[\sigma_{\log}^2 = \left[\log\left(1 + \text{CV}^2\right)\right] \quad \text{where} \quad \text{CV} = \frac{\text{SE}_{\text{survey}}}{\text{Group Mean}}\]

For each posterior draw \(m\), an additional noise term is applied:

\[H_{\text{adj}}^{(m), \text{noisy}} = H_{\text{adj}}^{(m)} \times \exp\left(\varepsilon_m - \tfrac{1}{2}\sigma_{\log}^2\right) \quad \text{where} \quad \varepsilon_m \sim \mathcal{N}(0, \sigma_{\log}^2)\]

The bias correction term \(-\tfrac{1}{2}\sigma_{\log}^2\) ensures the mean is preserved under the log-normal transformation.

7.6 Final Summary Statistics

From the \(M\) posterior draws (with injected noise), we compute:

8. R Code Reference

8.1 Index Construction

The Shannon Diversity Index is constructed in R using tidyverse and data.table operations. The key function analysis_shannon_food() is sourced from model_food_AFE.R:

# Pseudocode for index construction
analysis_shannon_food <- function(data) {
  # Pivot from wide (food categories) to long format
  data_long <- pivot_longer(data, cols = food_cols,
                            names_to = "food", values_to = "intake_gm")

  # Merge with requirements
  data_long <- merge(data_long, requirements_df, by = "food")

  # Compute per-AFE intake
  data_long$intake_afe <- data_long$intake_gm / data_long$afe_size

  # Cap at requirement
  data_long$capped <- pmin(data_long$intake_afe, data_long$requirement)

  # Summarize by household
  by_hh <- data_long[, .(total_capped = sum(capped)), by = household_id]
  data_long <- merge(data_long, by_hh, by = "household_id")

  # Compute shares
  data_long$share <- data_long$capped / data_long$total_capped

  # Shannon entropy
  data_long$h_component <- -data_long$share * log(data_long$share + 1e-10)
  shannon_h <- data_long[, .(H = sum(h_component)), by = household_id]

  # Adequacy adjustment
  data_long$adequacy <- pmin(data_long$intake_afe / data_long$requirement, 1)
  adequacy_a <- data_long[, .(A = mean(adequacy)), by = household_id]

  # Cereal adjustment (gram-based)
  cereal_data <- data_long[food == "cereals"]
  cereal_adj <- pmax(cereal_data$share - cereal_req_share, 0)
  cereal_penalty <- exp(-3 * cereal_adj)

  # Combine
  result <- merge(shannon_h, adequacy_a, by = "household_id")
  result$cereal_penalty <- cereal_penalty
  result$H_adj <- result$H * result$A * result$cereal_penalty

  return(result)
}

8.2 Model Estimation

The GAM is estimated using mgcv::bam() for efficiency:

# Normalize and logit-transform the Shannon index
K <- length(food_vars)  # number of food groups (9)
eps <- 1e-6

data <- data1 %>%
  mutate(
    Hnorm   = shannon_req_A / log(K),
    Hnorm01 = pmin(pmax(Hnorm, eps), 1 - eps),
    z       = qlogis(Hnorm01)
  )

# GAM formula
quant <- z ~ 1 +
  s(log_mpce_real_afe) +
  s(log_mpce_real_afe, state_code, bs = "fs", m = 1) +
  s(log_mpce_real_afe, seasonal, bs = "fs", m = 1) +
  s(log_mpce_real_afe, child, bs = "fs", m = 1) +
  s(log_mpce_real_afe, female_headed, bs = "fs", m = 1) +
  s(log_mpce_real_afe, social, bs = "fs", m = 1) +
  s(log_mpce_real_afe, rel, bs = "fs", m = 1) +
  s(sector, bs = "re") +
  s(nss_region, bs = "re") +
  s(reg_sector, bs = "re") +
  s(social, bs = "re") +
  s(rel, bs = "re") +
  s(child, bs = "re") +
  s(female_headed, bs = "re")

# Fit using bam() for large-dataset efficiency
model_shannon_food <- mgcv::bam(
  quant,
  data = data,
  weights = data$w_pc,
  method = "fREML",
  discrete = TRUE,
  gamma = 1.4,
  gc.level = 2,
  select = TRUE
)

# Estimate for both survey rounds and save
models <- list(
  HCES2011_model = analysis_shannon_food("HCES2011"),
  HCES2023_model = analysis_shannon_food("HCES2023")
)
save(models, file = file.path(folder, "Shannon_Diversity_food_model.RData"))

8.3 Prediction Data Generation

The prediction pipeline is handled by data_for_fig_shannon_food_AFE.R, which calls compute_grp_draws_unified() to generate posterior draws and svyby() for survey standard errors:

data_analysis_shannon_food <- function(
    n = "HCES2023",
    group_vars = c("nss", "rel"),
    n_sims = 1000,
    seed = 1234,
    jitter_eps = 1e-6
) {
  # Step 1-2: Load MPCE data and pre-fitted Shannon models
  obj <- models[[paste0(n, "_model")]]
  model_main <- obj$model_shannon_food
  model_adj  <- obj$model_shannon_food_cereal_adj
  K <- length(obj$food_vars$category_balanced_diet)

  # Step 3: Assign households to MPCE decile bins
  data <- obj$data %>%
    left_join(data_mpce_cutoff, by = group_vars) %>%
    mutate(bin = factor(findInterval(mpce_real_2011_afe, ...)))

  # Step 4: Build standardized prediction grid
  data_gr <- grid_f(n = n, group_vars = group_vars, season = "TRUE")
  nd <- data_gr[["nd"]]

  # Step 5: Draw coefficients from posterior (MVN)
  B_main <- draw_beta(model_main, n_sims, jitter_eps)
  B_adj  <- draw_beta(model_adj, n_sims, jitter_eps)

  # Step 6: Unified chunked computation (both models in single pass)
  grp_draws <- compute_grp_draws_unified(
    nd = nd,
    model_list = list(main = model_main, adj = model_adj),
    B_list = list(main = B_main, adj = B_adj),
    n_sims = n_sims,
    by_cols = c(group_vars, "bin", "log_mpce_real_afe", "grid_id"),
    K = K,
    compute_ratio = TRUE
  )

  # Step 7: Survey standard errors
  des <- svydesign(ids = ~psu, strata = ~strata, weights = ~wts,
                   data = data_svy, nest = TRUE)
  svy_results <- svyby(~shannon_req_A_hat, by = by_f,
                        design = des, FUN = svymean, vartype = "se")

  # Step 8: Inject sampling uncertainty
  add_noise <- function(draws, se) {
    if (is.na(se) || !is.finite(se) || se <= 0) return(draws)
    draws + rnorm(length(draws), mean = 0, sd = se)
  }
  out <- out %>%
    mutate(shannon_req_A_g_svy = Map(add_noise, shannon_req_A_g, shannon_se_svy))

  return(out_final)
}

9. Summary of Model Assumptions

The methodology rests on several key assumptions:

Limitations: The methodology assumes ICMR-NIN requirements are appropriate for all household types. Individual-level requirements (e.g., for pregnant women, children, or the elderly) are not modeled. The method also assumes measurement error in the survey's food quantity data is negligible.

References

[1] ICMR-NIN (2024). Recommended Dietary Allowances and Estimated Average Requirements for Indians. Indian Council of Medical Research – National Institute of Nutrition, Hyderabad.
[2] Shannon, C.E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.
[3] Wood, S.N. (2017). Generalized Additive Models: An Introduction with R (2nd ed.). Chapman and Hall/CRC.
[4] Cragg, J.G. (1971). Some statistical models for limited dependent variables with application to the demand for durable goods. Econometrica, 39(5), 829–844.
[5] Mullahy, J. (1998). Much ado about two: reconsidering retransformation and the two-part model in health econometrics. Journal of Health Economics, 17(3), 247–281.
[6] National Sample Survey Office (2013). Household Consumer Expenditure Survey 2011–12 (68th Round). Ministry of Statistics and Programme Implementation, Government of India.
[7] Ministry of Statistics and Programme Implementation (2024). Household Consumer Expenditure Survey 2023–24. Government of India.
[8] Gelman, A. (2006). Multilevel (hierarchical) modeling: What it can and can't do. Technometrics, 48(3), 432–435.
[9] Wood, S.N., Goude, Y. & Shaw, S. (2015). Generalized additive models for large data sets. Journal of the Royal Statistical Society: Series C, 64(1), 139–155.
[10] Li, R. & Shively, T.S. (2008). Variable selection in semiparametric regression modeling. Annals of Statistics, 36(1), 261–286.

Last updated: February 2026