Methodology Note

Modeling Excess Food Consumption in India

A Two-Stage Hurdle Approach Using Semi-Parametric Generalized Additive Models
Dr Shamika Ravi (Member, EAC to PM) & Dr Mudit Kapoor (CECFEE, EPU, ISI-Delhi Center)
February 2026

Data and Variable Definitions

Survey Data

The analysis uses unit-level records from two rounds of the HCES:

  • HCES 2011–12 (68th round of the National Sample Survey)
  • HCES 2023–24 (the most recent HCES round)

Exclusion criterion. Households reporting only cooking fuel expenditure (cooking code 12 for 2023–24 or code 10 for 2011–12) are excluded, as they do not represent regular food consumption patterns.


Outcome Variables

For each food item \(m\), three household-level outcomes are constructed:

  • Per-AFE consumption: \(q_i = \text{Quantity}_{i}^{\,\text{HH}} / \text{HH Size (AFE)}_i\) — total household consumption (grams/day) divided by the number of adult female equivalents.

  • Excess quantity:

\[\text{Excess}_i = \max(q_i - \text{Requirement}_m,\; 0)\]

  • Excess indicator: \(\text{Excess}_i = \mathbf{1}[q_i > \text{Requirement}_m]\) — whether the household consumes in excess.

Why Adult Female Equivalents?

The AFE scaling accounts for the fact that household members differ in their caloric and nutritional requirements. Using AFE rather than simple per capita measures avoids systematically overstating consumption adequacy in households with many children or elderly members.


Key Covariates

The models include a continuous income measure and multiple categorical covariates:

Variable Description Role in Model
log_mpce_real_afe Log real MPCE per AFE (2011–12 prices) Continuous — Engel curve driver
state_code Indian state (~36 levels) Factor-smooth interaction + random effect
sector Rural (1) or Urban (2) Random effect
nss_region NSS sub-state region (~80+ levels) Random effect
reg_sector Interaction: NSS region × sector Random effect
social Social group: SC (1), ST (2), OBC (3), Others (9), residual (0) Factor-smooth interaction + random effect
rel Religion: Hindu (1), Islam (2), Christian (3), residual (0) Factor-smooth interaction + random effect
child Presence of children in household Factor-smooth interaction + random effect
female_headed Female-headed household indicator Factor-smooth interaction + random effect

Survey Weights

Weights are rescaled by household AFE size and then normalized:

\[w_i^{\,\text{pc}} = \frac{w_i \times \text{HH Size (AFE)}_i}{\,\overline{w \times \text{HH Size (AFE)}}\,}\]

This ensures that each household’s weight reflects the number of adult female equivalents it represents.


The Hurdle Model for Excess Consumption

Why a Hurdle Model?

Many households do not consume any given food item in excess of its requirement. For example, a large share of households consume fats & oils at or below 25 grams/day. A single regression model would conflate the probability of exceeding the requirement with the magnitude of excess. The hurdle model separates these:

Household decision process:
   Step 1: Does the household consume item m in excess of the requirement?
           → Modeled by logit (participation model)
   
   Step 2: If yes, how much excess?
           → Modeled by Gamma or Log-Normal (conditional quantity model)

Part 1: Participation Model (Logit)

The probability of consuming food item \(m\) in excess of its daily requirement is modeled as:

\[\text{logit}\big(P(\text{Excess}_i = 1)\big) = \alpha_0 + f(\text{log}\_\text{mpce}_i) + \sum_k f_k(\text{log}\_\text{mpce}_i, \; g_{k,i}) + \sum_j u_j\]

The model has three types of terms:

1. Global Engel curve\(f(\text{log\_mpce})\): A smooth function capturing the overall relationship between income and the probability of excess consumption. Estimated as a thin-plate regression spline.

2. Factor-smooth interactions\(f_k(\text{log\_mpce}, \; g_k)\): These allow the Engel curve to vary by group. For example, \(f(\text{log\_mpce}, \text{state\_code})\) lets each state have its own income–excess consumption relationship, while shrinking toward the global curve. Specified as bs = "fs" with first-order penalty (m = 1).

3. Random intercepts\(u_j\): Gaussian random effects for sector, NSS region, region × sector interaction, social group, religion, child presence, and female headship. Specified as s(variable, bs = "re").


Part 2: Conditional Quantity Model

For households with positive excess (\(\text{Excess}_i > 0\)), the amount of excess is modeled with the same covariate structure as the logit model, but using a continuous response family.

Model Selection: Gamma vs. Log-Normal

Two candidate models are estimated for each food item:

Model Response Family Link
Gamma \(\text{Excess}_i\) (raw grams) \(\text{Gamma}\) \(\log\)
Log-Normal \(\log(\text{Excess}_i)\) \(\text{Gaussian}\) identity

Selection criterion: The model with the lower AIC is retained. This data-driven choice accommodates the fact that different food items have different distributional properties — cereals (commonly consumed in large excess) may suit a different family than fats & oils (typically consumed in smaller excess amounts).

Estimation details (common to both candidates):

Setting Value Rationale
Method Fast REML (fREML) Smoothing parameter selection
Discrete method Enabled Computational efficiency for large datasets
Gamma penalty \(\gamma = 1.4\) Guards against overfitting (Wood, 2017)
Selection select = TRUE Allows unnecessary random effects to shrink to zero
Weights Normalized survey weights \(w_i^{\text{pc}} = w_i / \bar{w}\)

Why Factor-Smooth Interactions?

The bs = "fs" specification is central to the model. It allows each level of a grouping factor (e.g., each state) to have its own smooth Engel curve, while penalizing deviations from the global curve:

\[f_k(x, g) = f(x) + f_g(x)\]

where \(f_g(x)\) is the group-specific deviation, subject to a first-order penalty that shrinks it toward zero.

This captures important economic realities:

  1. State-level variation: The relationship between income and cereal overconsumption is fundamentally different in rice-dominated states versus wheat-dominated states. A single national Engel curve would miss this.

  2. Demographic heterogeneity: Households in different social groups may have distinct expenditure gradients for excess consumption, reflecting cultural food norms and dietary preferences.

  3. Automatic shrinkage: States or groups with few observations are shrunk toward the national pattern, preventing noisy estimates.


Parallel Model Estimation

Models are fitted for each combination of survey round (HCES 2011, HCES 2023) and food item (cereals, fats & oils), yielding 4 model pairs (binary + quantity). These are estimated in parallel:

Architecture:
   → doParallel/foreach with auto-detected workers
   → Each worker: fits logit + quantity models for one item × round
   → Progress logging and automatic error handling
   → Models saved as combined .RData files

Directory structure:

~/data/bam_models/excess_food/
   ├── cereal_millets/
   │   └── cereal_millets_model.RData
   ├── fats_oils/
   │   └── fats_oils_model.RData
   └── all_excess_food_models.RData

Each model .RData file contains a list with:

Object Contents
HCES2011_model Models and data for the 2011–12 round
HCES2023_model Models and data for the 2023–24 round
model_l Logit (excess participation) model
model_q Quantity model (Gamma or Log-Normal, AIC-selected)
model_info String: “Gamma (log link)” or “log normal”
data Analysis dataset used for fitting

Interpreting the Components

Component What it captures Range
\(H\) (Shannon) Evenness of consumption across food groups \([0, \log K]\)
\(A\) (Adequacy) Whether each food group meets its requirement \([0, 1]\)
\(C\) (Cereal penalty) Penalizes over-dependence on cereals \((0, 1]\)
\(H_{\text{adj}}\) Overall dietary quality \([0, \log K]\)

Example: A household that eats only rice (high quantity, one food group) would have: \(H \approx 0\) (no diversity), \(A\) low (most groups unmet), \(C < 1\) (cereal-heavy). The adjusted index \(H_{\text{adj}}\) would be very low despite adequate caloric intake.


Prediction Grid Construction

Before generating predictions, the analysis constructs standardized prediction grids that control for confounding when comparing groups. This is a critical step: raw comparisons between, say, rural and urban excess consumption would conflate the true sector effect with differences in the geographic and demographic composition of each sector.

The Standardization Problem

Consider comparing cereal overconsumption between rural and urban India. Rural households are disproportionately located in rice-dominant states and have different demographic profiles (more children, different social group composition). A naive rural–urban comparison would mix three effects: the actual sector effect on excess consumption, geographic composition differences, and demographic composition differences.

The prediction grid solves this by constructing a counterfactual population where all non-focal variables are held at the same national distribution.


Grid Construction Logic

The grid is constructed differently depending on the type of comparison. In all cases, the approach follows the same principle: hold the focal variable at its actual values, and standardize everything else to a common national distribution.

Grid Construction by Comparison Type:

STATE comparison [c("nss", "state_code")]
   → Keep: actual geographic structure (regions, rural/urban) within each state
   → Standardize: demographics (social, religion, child, female_headed)
   → Result: isolates state-level excess consumption differences net of demographics

SECTOR comparison [c("nss", "sector")]
   → Standardize: state-region distribution across rural/urban
   → Standardize: demographics nationally
   → Result: isolates PURE rural-urban effect

NSS REGION comparison [c("nss", "nss_region")]
   → Standardize: sector distribution across regions
   → Standardize: demographics nationally
   → Result: isolates regional effects net of urban/rural mix and demographics

DEMOGRAPHIC comparison [c("nss", <focal_var>)]
   → Standardize: full geography (state, region, sector)
   → Standardize: other demographics nationally
   → Result: isolates effect of focal demographic variable

Geographic standardization. When comparing sectors or regions, the grid standardizes the geographic distribution so that, for example, rural and urban get the same state–region composition. This isolates the “pure” sector effect from geographic confounding. Without standardization, sector differences would partly reflect the fact that rural populations are concentrated in poorer states.


Grid Construction Steps (Detailed)

For each comparison type, the grid is built through the following process:

Step 1: Define the core structure. The focal variable × expenditure bin combinations define the “backbone” of the grid. For each cell, the mean MPCE within the bin provides the income value at which predictions are evaluated.

Step 2: Compute geographic weights. For state comparisons, each state keeps its actual nss_region × sector structure. For demographic comparisons, a national geographic distribution is computed:

\[w_{\text{geo}}(s, r, u) = \frac{\sum_{i \in (s,r,u)} w_i}{\sum_i w_i}\]

where \(s\) is state, \(r\) is NSS region, and \(u\) is sector.

Step 3: Compute demographic weights. The national demographic distribution is computed marginally:

\[w_{\text{demo}}(\mathbf{d}) = \frac{\sum_{i: \mathbf{d}_i = \mathbf{d}} w_i}{\sum_i w_i}\]

where \(\mathbf{d}\) is the vector of non-focal demographic variables (social group, religion, child presence, female headship).

Step 4: Cross and combine. The geographic and demographic grids are crossed (Cartesian product) and their weights multiplied:

\[w_{\text{grid}} = w_{\text{geo}} \times w_{\text{demo}}\]

This produces the full prediction grid: every combination of the focal variable, expenditure bin, geographic cell, and demographic cell, with a weight reflecting its representation in the standardized population.

Step 5: Assign grid IDs. Each unique combination of the grouping variable × expenditure bin receives a grid_id, used to aggregate predictions.

Step 6: Build “Overall” grid. A separate grid is constructed using the overall (population-weighted) mean MPCE for each group, rather than decile-specific means. This provides the marginal group-level summary.


Expenditure Binning

Households are assigned to expenditure deciles using group-specific cutpoints from the MPCE distribution:

Bin Label Definition
1 0–10 Below 10th percentile of group-specific MPCE distribution
2 10–20 10th to 20th percentile
10 90–100 Above 90th percentile
Overall Overall Full population (weighted mean MPCE)

The cutpoints (\(q_{10}, q_{20}, \ldots, q_{90}\)) are computed within each group (e.g., separately for rural and urban) to ensure that each bin contains approximately 10% of the group’s weighted population. This means the same income level (say ₹2,000/month) may fall in different deciles for different groups.


Posterior Simulation and Uncertainty Propagation

The prediction pipeline propagates two sources of uncertainty through to the final estimates: model parameter uncertainty (from the GAM coefficients) and survey sampling uncertainty (from the complex survey design).

Overview of the Prediction Pipeline

Step 1: Load pre-fitted models (logit + quantity)
Step 2: Construct standardized prediction grids (from previous section)
Step 3: Draw coefficient vectors from posterior (MVN)
Step 4: Draw dispersion parameters (chi-squared)
Step 5: Compute household-level predictions (chunked for memory)
Step 6: Aggregate to group × decile level (weighted)
Step 7: Compute survey standard errors (svydesign)
Step 8: Inject sampling uncertainty into model draws
Step 9: Summarize (mean, median, 95% credible interval)

Step 1: Covariance Matrix Validation

Before drawing coefficients, the posterior covariance matrix \(\mathbf{V}_p\) must be positive definite. Numerical issues in large GAM fits can occasionally produce matrices with near-zero or negative eigenvalues. Two repair strategies are available:

Method Approach When used
nearPD Finds nearest positive definite matrix (Frobenius norm) Default; more accurate
jitter Adds \(\varepsilon \cdot \mathbf{I}\) to diagonal Fallback; faster

The validation checks all eigenvalues: \(\lambda_{\min}(\mathbf{V}_p) > 10^{-10}\).


Step 2: Draw Coefficient Vectors

Let \(\hat{\boldsymbol{\beta}}\) denote the estimated coefficients and \(\mathbf{V}_p\) the Bayesian posterior covariance matrix1. For each simulation \(m = 1, \ldots, M\):

\[\boldsymbol{\beta}^{(m)} \sim \mathcal{N}\!\left(\hat{\boldsymbol{\beta}},\; \mathbf{V}_p\right)\]

drawn via MASS::mvrnorm(). Separate draws are made for the logit model (\(\boldsymbol{\alpha}^{(m)}\)) and the quantity model (\(\boldsymbol{\beta}^{(m)}\)).


Step 3: Draw Dispersion Parameters

For the Log-Normal model, the residual variance is drawn from its posterior:

\[\sigma^{(m)} = \sqrt{\frac{\hat{\sigma}^2 \cdot \nu}{\chi^2_\nu}}\]

where \(\nu\) is the residual degrees of freedom and \(\chi^2_\nu\) is a chi-squared draw with \(\nu\) degrees of freedom. A bias-correction term is precomputed: \(\delta^{(m)} = \tfrac{1}{2}(\sigma^{(m)})^2\).

For the Gamma model, the dispersion parameter \(\phi\) (inverse of shape) is drawn analogously:

\[\phi^{(m)} = \frac{\hat{\phi} \cdot \nu}{\chi^2_\nu}\]

yielding shape \(= 1 / \phi^{(m)}\) for each draw.


Step 4: Compute Household-Level Predictions (Chunked)

Predictions are computed in memory-efficient blocks (default: 2,000 rows per block) to handle the large prediction grids. For each block of rows in the grid:

4a. Linear predictors via the lpmatrix:

\[\boldsymbol{\eta}_L = \mathbf{X}_L \cdot \mathbf{B}_L^\top \qquad \boldsymbol{\eta}_Q = \mathbf{X}_Q \cdot \mathbf{B}_Q^\top\]

where \(\mathbf{X}_L\) and \(\mathbf{X}_Q\) are the design matrices from predict(..., type = "lpmatrix"), and \(\mathbf{B}_L\), \(\mathbf{B}_Q\) are the \(M \times p\) matrices of coefficient draws. The matrix cross-product tcrossprod() yields an \(n_{\text{block}} \times M\) matrix of predictions in a single operation.

4b. Transform to response scale:

Probability: \(\hat{p}_i^{(m)} = \text{logit}^{-1}(\eta_{L,i}^{(m)})\)

Conditional mean (log-normal): \(\hat{\mu}_i^{(m)} = \exp(\eta_{Q,i}^{(m)} + \delta^{(m)})\)

Conditional mean (Gamma): \(\hat{\mu}_i^{(m)} = \exp(\eta_{Q,i}^{(m)})\)

4c. Unconditional excess quantity:

\[\widehat{EY}_i^{(m)} = \hat{p}_i^{(m)} \times \hat{\mu}_i^{(m)}\]

4d. Posterior predictive draws (full distributional draws, not just means):

The unconditional posterior predictive combines a Bernoulli draw with the conditional draw:

\[Z_i^{(m)} \sim \text{Bernoulli}(\hat{p}_i^{(m)}), \qquad Y_{\text{uncond},i}^{(m)} = Z_i^{(m)} \times Y_{\text{pos},i}^{(m)}\]

where \(Y_{\text{pos},i}^{(m)}\) is drawn from the appropriate conditional distribution (log-normal or Gamma) using the draw-specific parameters.


Step 5: Weighted Aggregation to Group × Decile Level

Within each block, predictions are accumulated into group-level weighted sums. For each group \(g\) (defined by the grouping variable × expenditure bin):

\[\bar{p}_{g}^{(m)} = \sum_{i \in g} \tilde{w}_i \cdot \hat{p}_i^{(m)}\]

where \(\tilde{w}_i = w_i / \sum_{j \in g} w_j\) are the normalized within-group weights. The same weighted averaging is applied to \(\hat{\mu}_i^{(m)}\), \(\widehat{EY}_i^{(m)}\), and \(Y_{\text{uncond},i}^{(m)}\).

The result is a set of \(M\) draws for each group × bin combination, stored as list-columns:

Draw variable Description
p_pos_g Group-level probability of excess draws (M values)
mu_pos_g Group-level conditional excess mean draws
EY_uncond_g Group-level unconditional excess mean draws (\(p \times \mu\))
Yuncond_ppc_g Group-level unconditional posterior predictive draws

Step 6: Survey Standard Errors

To quantify sampling uncertainty, plug-in predictions (\(\hat{p}_i\), \(\hat{\mu}_i\), \(\widehat{EY}_i\) at coefficient point estimates) are computed for every household in the microdata. These are then summarized using the survey package with the full complex survey design (PSU clustering, stratification, survey weights):

\[\text{SE}(\bar{p}_g) = \sqrt{\text{Var}_{\text{svy}}\!\left(\frac{\sum_{i \in g} w_i \hat{p}_i}{\sum_{i \in g} w_i}\right)}\]

computed via survey::svyby() with survey::svymean().

Lonely PSU adjustment. The option survey.lonely.psu = "adjust" is set to handle strata with a single PSU, ensuring that variance estimation does not fail.


Step 7: Injecting Sampling Uncertainty

The survey standard errors are injected into the model draws using scale-appropriate transformations. This combines both sources of uncertainty into a single set of draws.

For probability draws (logit scale):

The delta method gives \(\text{SE}(\text{logit}(p)) \approx \text{SE}(p) / [p(1-p)]\). Noise is added on the logit scale:

\[\tilde{p}^{(m)} = \text{logit}^{-1}\!\left[\text{logit}(\hat{p}^{(m)}) + \varepsilon^{(m)}\right], \qquad \varepsilon^{(m)} \sim \mathcal{N}\!\left(0,\; \left[\frac{\text{SE}(\bar{p})}{\bar{p}(1-\bar{p})}\right]^2\right)\]

This respects the \((0, 1)\) bounds on probabilities.

For quantity draws (log scale, mean-preserving):

Multiplicative log-normal noise is applied with a bias correction that preserves the expected value:

\[\tilde{q}^{(m)} = \hat{q}^{(m)} \times \exp\!\left(\varepsilon^{(m)} - \tfrac{1}{2}\sigma_{\log}^2\right), \qquad \varepsilon^{(m)} \sim \mathcal{N}(0, \sigma_{\log}^2)\]

where \(\sigma_{\log} = \text{SE}(\bar{q}) / \bar{q}\) (the coefficient of variation). The term \(-\tfrac{1}{2}\sigma_{\log}^2\) ensures:

\[E\!\left[\exp(\varepsilon - \tfrac{1}{2}\sigma_{\log}^2)\right] = 1\]

so the mean of the draws is preserved.

Why two uncertainty sources? Model uncertainty captures the imprecision in estimating smooth Engel curves and random effects. Sampling uncertainty captures the fact that we observe a sample, not the full population. For well-identified parameters (e.g., national-level cereal excess), model uncertainty dominates. For thin cells (e.g., tribal households in small northeastern states), sampling uncertainty can be substantial.


Step 8: Final Summary Statistics

The \(M\) survey-adjusted draws for each group × bin are summarized into:

Statistic Description
mean Posterior mean of unconditional excess consumption (grams/day per AFE)
median Posterior median
q2.5 2.5th percentile (lower bound of 95% credible interval)
q97.5 97.5th percentile (upper bound of 95% credible interval)
mean_p Posterior mean of probability of excess consumption
q2.5_p Lower bound of 95% CI for probability
q97.5_p Upper bound of 95% CI for probability

Output Variables

The prediction pipeline produces the following variables for each food item × grouping variable × group level × decile bin:

Variable Description
nss Survey round (HCES2011 or HCES2023)
var_n Grouping variable name (e.g., “sector”, “religion”)
group_n Group level name (e.g., “Rural”, “Hindu”)
item_n Food item category (cereal_millets or fats_oils)
bin MPCE decile bin (0–10, 10–20, …, 90–100, Overall)
log_mpce_real_afe Mean log real MPCE per AFE in the bin
mpce_real_afe_2011 Mean real MPCE per AFE (₹, 2011–12 prices)
requirement Daily requirement for the food item (grams per AFE)
mean Posterior mean of unconditional excess consumption (grams/day)
q2.5 2.5th percentile (lower bound of 95% credible interval)
q97.5 97.5th percentile (upper bound of 95% credible interval)
mean_p Posterior mean of probability of excess consumption
q2.5_p Lower bound of 95% CI for probability
q97.5_p Upper bound of 95% CI for probability

Visualization

Engel Curve Plots

The primary visualization is a pair of Engel curves — the relationship between household expenditure and excess food consumption — plotted for each food item across demographic groups:

Top panel: Proportion consuming in excess (%)
   → Y-axis: Estimated share of people consuming above the daily requirement
   → X-axis: Monthly per capita expenditure (₹, 2011-12 prices, log scale)
   → Faceted by group level within grouping variable

Bottom panel: Average excess burden (grams)
   → Y-axis: Estimated daily excess consumption per AFE
   → X-axis: Same as above
   → Annotation: % of daily requirement at Overall mark

Key visual elements:

  • Points and lines: Posterior means at each decile bin, connected by smoothed curves
  • Ribbons: 95% credible intervals (shaded bands)
  • Color: Survey round (2011–12 vs. 2023–24) — enables visual comparison of changes over time
  • Annotation: Mean values (overall average) marked with dashed crosshairs and labels showing the exact excess quantity and its percentage relative to the daily requirement

Cross-Sectional Plots (Single Survey Round)

For state-level comparisons within HCES 2023–24, an alternative visualization displays:

  • All states within a geographic region (Northern, Central, Eastern, Northeast, Western, Southern) on a single panel
  • Smoothed Engel curves with labeled mean values
  • Choice of excess quantity or proportion in excess on the y-axis

These are combined into multi-panel grids (3 × 2 for six regions) using the magick package for image composition, with an annotated title bar.


Suite of Analyses

The batch processing generates figures across all combinations of:

Dimension Levels
Food items Cereals & millets, Fats & oils
Grouping variables sector, social, religion, child, female_headed, state
Metrics proportion in excess, excess quantity
Survey rounds 2011–12 + 2023–24 (combined) or 2023–24 only

For state-level analyses, figures are further organized by geographic region.


Computational Implementation

Parallel Model Estimation

Each of the 2 food items requires two models (logit + quantity) for each of two survey rounds, yielding 8 model fits. These are estimated in parallel:

Architecture:
   → doParallel/foreach with 4 worker processes
   → Each worker: fits models for one food item × round
   → Models saved to item-specific directories

Prediction Data Generation Pipeline

The figure data generation is the most computationally intensive step. It is orchestrated by a batch processing function that runs across all item × grouping variable combinations in parallel.

Batch Processing Architecture:

   Input: 2 food items × 6 grouping variables = 12 jobs
          (each job = 2 survey rounds)

   For each job (item × grouping_var):
      1. Load pre-fitted models for the item
      2. Load MPCE decile cutpoints for the grouping variable
      3. Construct standardized prediction grid
      4. Draw M=1000 coefficient vectors (MVN) and dispersion parameters
      5. Compute predictions in blocks of 2000 rows
      6. Compute survey standard errors (svydesign)
      7. Inject sampling uncertainty
      8. Summarize to mean + 95% credible interval
      9. Save to item-specific .RData file

   Parallelization:
      → future_lapply() with auto-detected workers
      → Worker count: min(cores-1, memory/2GB per job, n_jobs)
      → Each job: ~2GB peak memory, ~3-5 min per item × group × round

System Resource Management

The pipeline includes adaptive resource management to prevent memory exhaustion:

Constraint Calculation Purpose
Core limit detectCores() - 1 Leave one core for system
Memory limit (total RAM - 4 GB) / 2 GB per job Prevent swapping
Job limit Number of remaining jobs Don’t oversubscribe
Optimal workers min(core, memory, job) Balance throughput and stability

Output File Organization

Each food item × grouping variable combination produces a prediction file:

~/data/bam_models/excess_food/
   ├── cereal_millets/
   │   ├── cereal_millets_model.RData
   │   ├── data_prop_quantity_cereal_millets_sector.RData
   │   ├── data_prop_quantity_cereal_millets_rel.RData
   │   ├── data_prop_quantity_cereal_millets_social.RData
   │   ├── data_prop_quantity_cereal_millets_child.RData
   │   ├── data_prop_quantity_cereal_millets_female_headed.RData
   │   └── data_prop_quantity_state_cereal_millets_sector.RData
   ├── fats_oils/
   │   └── ... (same structure)
   └── all_excess_food_models.RData

All results are also compiled into a single Excel workbook (data_prop_quantity_excess.xlsx) with two sheets: the full dataset and a variable descriptions sheet.


Summary of Model Assumptions

The key assumptions underlying this framework:

  1. Hurdle structure: The decision to consume in excess and the amount of excess are modeled as separate processes. This is appropriate because many households do not exceed the daily requirement for a given food item.

  2. Smooth Engel curves: The relationship between income and excess consumption is assumed to be a smooth (but potentially nonlinear) function. The factor-smooth specification allows this curve to vary by group while borrowing strength across groups.

  3. Distributional assumption for quantity: The conditional excess quantity follows either a Gamma or Log-Normal distribution (selected by AIC). Both accommodate positive, right-skewed data typical of excess consumption amounts.

  4. Gaussian random effects: All categorical covariates enter as Gaussian random effects, implying partial pooling (shrinkage) toward the grand mean.

  5. Posterior normality of coefficients: Coefficients are drawn from a multivariate normal centered at the penalized MLE — the standard Laplace approximation, accurate for large samples.

  6. Additive separability of the Shannon index components: The requirement-adjusted index \(H_{\text{adj}} = H \times A \times C\) assumes that diversity, adequacy, and cereal dependence contribute multiplicatively.


References

  1. Wood, S. N. (2017). Generalized Additive Models: An Introduction with R (2nd ed.). Chapman & Hall/CRC.

  2. Indian Council of Medical Research - National Institute of Nutrition (2020, updated 2024). Nutrient Requirements for Indians: A Report of the Expert Group. ICMR-NIN, Hyderabad. [Page 36]

  3. Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423.

  4. National Sample Survey Office (2013). Household Consumer Expenditure Survey 2011-12. Ministry of Statistics and Programme Implementation, Government of India.


Appendix: Glossary of Terms

AFE (Adult Female Equivalent): A standardization method to account for different caloric needs based on age and gender. The reference unit is an adult woman weighing 55kg.

AIC (Akaike Information Criterion): A measure of model fit that penalizes complexity. Lower AIC indicates a better trade-off between fit and parsimony. Used here to select between Gamma and Log-Normal quantity models.

bam(): The mgcv package function for fitting generalized additive models to large datasets, using discretized covariates and fast REML for efficiency.

Credible Interval: The Bayesian analogue of a confidence interval. A 95% credible interval contains the true parameter value with 95% posterior probability, given the model and data.

Coefficient of Variation (CV): The ratio of the standard error to the mean, \(\text{CV} = \text{SE}/\bar{x}\). Used here to convert survey standard errors to log-scale noise for the mean-preserving multiplicative uncertainty injection.

Decile: Division of population into 10 equal groups based on expenditure levels (Decile 1 = poorest 10%).

Delta Method: A statistical technique for approximating the variance of a transformed variable. If \(g(\hat{\theta})\) is a function of an estimator, then \(\text{Var}(g(\hat{\theta})) \approx [g'(\hat{\theta})]^2 \cdot \text{Var}(\hat{\theta})\). Used here to convert standard errors from the probability scale to the logit scale.

Engel Curve: The relationship between household income (or expenditure) and consumption of a particular good. Named after the 19th-century statistician Ernst Engel.

Excess Consumption: The amount by which a household’s per-AFE daily intake exceeds the ICMR-NIN recommended daily requirement. Defined as \(\max(\text{consumption} - \text{requirement}, 0)\).

Factor-Smooth Interaction (bs = "fs"): A penalized smooth term in mgcv that allows each level of a factor to have its own smooth curve, with deviations from the global curve penalized. Achieves group-specific nonlinear effects with automatic shrinkage.

fREML (Fast REML): Fast restricted maximum likelihood — the method used for smoothing parameter estimation in mgcv::bam().

GAM (Generalized Additive Model): A flexible regression framework that allows non-linear relationships and random effects through penalized smooth terms.

Geographic Standardization: A technique for making group comparisons fair by holding the geographic distribution (state, region, sector) constant across groups. Without standardization, rural–urban differences would partly reflect the geographic concentration of rural populations in poorer states.

HCES: Household Consumer Expenditure Survey conducted by the National Sample Survey Office.

Hurdle Model: A two-part model that separately models (a) whether an event occurs (here: whether excess consumption occurs) and (b) the magnitude conditional on occurrence. Appropriate for data with excess zeros.

ICMR-NIN: Indian Council of Medical Research - National Institute of Nutrition. The body that establishes dietary guidelines for India.

lpmatrix (Linear Predictor Matrix): The design matrix \(\mathbf{X}\) such that \(\mathbf{X}\boldsymbol{\beta}\) gives the linear predictor. Obtained from predict(..., type = "lpmatrix") in mgcv. Essential for posterior simulation because it allows vectorized computation of predictions across all coefficient draws simultaneously via matrix multiplication.

Mean-Preserving Noise: A multiplicative noise injection technique where \(\tilde{q} = q \cdot \exp(\varepsilon - \tfrac{1}{2}\sigma^2)\) with \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\). The bias correction term \(-\tfrac{1}{2}\sigma^2\) ensures \(E[\tilde{q}] = E[q]\).

MPCE: Monthly Per Capita Expenditure — total household expenditure divided by household size (or by AFE units).

Posterior Predictive Draw: A simulated observation from the model’s predictive distribution, incorporating both parameter uncertainty (from coefficient draws) and observation-level randomness (from the assumed distribution).

Shannon Diversity Index: An information-theoretic measure of diversity, originally from ecology. Higher values indicate more even distribution across categories. Maximum value is \(\log K\) where \(K\) is the number of categories.

Survey Weights: Statistical weights assigned to each sampled household to make the sample representative of the population.


Last updated: February 2026


  1. The Bayesian posterior covariance from the GAM, stored as model$Vp.↩︎