Modeling Food Consumption Patterns in India
1. Introduction
This document explains how we model and predict food consumption patterns at the household level — benchmarked against the ICMR-NIN Recommended Dietary Allowances[1] for a reference adult woman (55 kg, moderately active, non-lactating) — using data from the Household Consumer Expenditure Survey (HCES) 2011–12 & 2023–24. The framework uses semi-parametric Generalized Additive Models (GAMs)[2] within a two-stage hurdle structure[3][4] to cover consumption of individual food categories as well as an aggregate measure of dietary diversity.
1.1 What Do We Estimate?
The analysis produces three families of estimates for each of the ten food categories in the ICMR-NIN recommended diet:
Probability of consumption: What share of the population consumes a given food item, and how does this vary with income, geography, and demographics?
Conditional quantity consumed: Among those who consume the item, how many grams per day per Adult Female Equivalent (AFE) do they consume?
Unconditional quantity consumed: Combining the two above — the expected consumption (including zero consumers) across expenditure deciles and demographic groups.
In addition, a Requirement-Adjusted Shannon Diversity Index summarizes overall dietary quality at the household level.
1.2 Food Categories Analyzed
Based on the ICMR-NIN recommended balanced diet[1] for adult woman (55 kg, moderately active, non-lactating), the analysis covers ten food categories:
| Food Category | Daily Requirement (grams) |
|---|---|
| Cereals & Millets | 280 |
| Green leafy vegetables | 100 |
| Other vegetables | 200 |
| Roots & Tubers (excl. potatoes) | 100 |
| Fruits | 100 |
| Milk & Milk products | 300 |
| Fats & Oils | 25 |
| Oilseeds & Nuts | 40 |
| Pulses & Beans | 95 (combined with Flesh Foods) |
| Flesh foods | 95 (combined with Pulses & Beans) |
Note on Pulses & Flesh foods. The ICMR-NIN[1] guidelines specify a combined requirement of 95 grams for “Pulses, Beans & Flesh foods.” In the analysis, Pulses & Beans and Flesh foods are modeled separately to capture their distinct consumption patterns, but share the combined requirement when computing dietary adequacy.
1.3 Overview of Analytical Pipeline
Stage 1: Per-Item Hurdle Models
→ For each food category × survey round:
(a) Fit logit model for P(consumption > 0)
(b) Fit conditional quantity model (Gamma or Log-Normal, AIC-selected)
Stage 2: Shannon Diversity Model
→ Compute requirement-adjusted diversity index per household
→ Model the logit-transformed index as a function of income and demographics
Stage 3: Posterior Simulation and Prediction
→ Draw from parameter uncertainty
→ Predict proportion and quantity by group × expenditure decile
→ Summarize with credible intervals
Stage 4: Visualization
→ Engel curves for proportion and quantity
→ Faceted by demographic group, across survey rounds
2. Data and Variable Definitions
2.1 Survey Data
The analysis uses unit-level records from two rounds of the HCES:
- HCES 2011–12[5] (68th round of the National Sample Survey)
- HCES 2023–24[6] (the most recent HCES round)
Exclusion criterion. Households reporting only cooking fuel expenditure (identified by survey-specific cooking codes) are excluded, as they do not represent regular food consumption patterns.
2.2 Outcome Variables
For each food category \(m\), two household-level outcomes are constructed:
Consumption indicator: \(\text{cons}_i = \mathbf{1}[\text{quantity}_i > 0]\) — whether household \(i\) consumed any quantity of food item \(m\).
Quantity per AFE:
Total household consumption of item \(m\) (grams/day) divided by the number of adult female equivalents.
Why Adult Female Equivalents?
The AFE scaling accounts for the fact that household members differ in their caloric and nutritional requirements. Using AFE rather than simple per capita measures avoids systematically overstating consumption adequacy in households with many children or elderly members.
The energy requirements and corresponding AFE scale factors used in this analysis are shown below. The reference category is the adult woman (moderate activity), whose requirement of 2,130 kcal/day defines one AFE unit.
| Demographic Profile | Activity Level | Energy Requirement (kcal/day) | AFE Scale Factor |
|---|---|---|---|
| Child 0 to 12 mo | — | 595 | 0.28 |
| Child 1 to 3 yrs | — | 1,110 | 0.52 |
| Child 4 to 6 yrs | — | 1,360 | 0.64 |
| Child 7 to 9 yrs | — | 1,700 | 0.80 |
| Girls 10 to 12 yrs | — | 2,060 | 0.97 |
| Boys 10 to 12 yrs | — | 2,220 | 1.04 |
| Girls 13 to 15 yrs | — | 2,400 | 1.13 |
| Boys 13 to 15 yrs | — | 2,860 | 1.34 |
| Girls 16 to 18 yrs | — | 2,500 | 1.17 |
| Boys 16 to 18 yrs | — | 3,320 | 1.56 |
| Adult women (reference) | Moderate | 2,130 | 1.00 |
| Adult women (lactating) | Moderate | 2,690 | 1.26 |
| Adult men | Moderate | 2,710 | 1.27 |
For example, a household comprising one adult man, one adult woman, and one child aged 4–6 would have an AFE household size of 1.27 + 1.00 + 0.64 = 2.91, rather than a simple headcount of 3. Dividing total household food consumption by this AFE size yields a per-AFE measure that is comparable across households of different demographic compositions.
Implementation Details
The AFE scale is constructed at the individual level from the HCES person-level roster, which records each household member's age and gender. The assignment proceeds in three steps:
Step 1 — Lactation status imputation. The HCES does not directly record lactation status. We proxy it using the presence of a child under 12 months in the household. If at least one child aged <1 is present, the youngest woman aged 19–49 in the household is classified as lactating and assigned the higher energy requirement of 2,690 kcal/day (= 2,130 base + 560 lactation increment, where 560 is the average of the ICMR-NIN increments of 600 kcal for 0–6 months and 520 kcal for 6–12 months postpartum). If there are n children under 1, then up to n women (ordered youngest-first) are assigned lactating status. Remaining adult women, as well as all women aged 50 and above, receive the non-pregnant, non-lactating requirement of 2,130 kcal/day.
Step 2 — Energy requirement assignment. Each household member is mapped to an age-sex-specific energy requirement from the table above using the following rules: children are classified solely by age (the infant category uses the average of the ICMR 0–6 month and 6–12 month values); adolescents aged 10–18 are further stratified by gender; and all adults are assigned the moderate activity level. Adult men (and transgender individuals coded as gender = 3) receive the adult male requirement of 2,710 kcal/day regardless of age.
Step 3 — AFE conversion and household aggregation. Each member's AFE scale factor is computed as AFEk = Ereqk / 2,130. The household AFE size is then the sum across all members: HH Size(AFE)i = ∑k AFEk. An individual's within-household energy share is sharek = AFEk / HH Size(AFE)i, which is used in subsequent food quantity allocation steps.
The R implementation is shown below:
HCES2023_AFE_energy <- HCES2023_level02 %>%
dplyr::select(hhid, person_sno, gender, age) %>%
arrange(hhid, gender, age) %>%
group_by(hhid) %>%
mutate(under_1_a = ifelse(age < 1, 1, 0)) %>%
ungroup() %>%
group_by(hhid) %>%
mutate(
child_under_1 = max(under_1_a),
n_child_under_1 = sum(under_1_a)
) %>%
ungroup() %>%
mutate(
women_18_50 = dplyr::case_when(
(age >= 19 & age < 50) & gender == "2" ~ 1,
.default = 0
)
) %>%
group_by(hhid, women_18_50) %>%
mutate(seq = row_number()) %>%
ungroup() %>%
mutate(
energy_requirement = dplyr::case_when(
age < 1 ~ 595,
age >= 1 & age <= 3 ~ 1110,
age >= 4 & age <= 6 ~ 1360,
age >= 7 & age <= 9 ~ 1700,
(age >= 10 & age <= 12) & (gender %in% c("1","3")) ~ 2220,
(age >= 10 & age <= 12) & (gender == "2") ~ 2060,
(age >= 13 & age <= 15) & (gender %in% c("1","3")) ~ 2860,
(age >= 13 & age <= 15) & (gender == "2") ~ 2400,
(age >= 16 & age <= 18) & (gender %in% c("1","3")) ~ 3320,
(age >= 16 & age <= 18) & (gender == "2") ~ 2500,
(age >= 19) & (gender %in% c("1","3")) ~ 2710,
(age >= 19) & (gender == "2") & (child_under_1 == 0) ~ 2130,
(age >= 19 & age < 50) & (gender == "2") &
(child_under_1 == 1) & (seq <= n_child_under_1) ~ 2690,
(age >= 19 & age < 50) & (gender == "2") &
(child_under_1 == 1) & (seq > n_child_under_1) ~ 2130,
(age >= 50) & (gender == "2") ~ 2130
)
) %>%
mutate(
AFE_energy = energy_requirement / 2130,
share_energy = AFE_energy / sum(AFE_energy),
.by = hhid
)
2.3 Key Covariates
The models include a continuous income measure and multiple categorical covariates:
| Variable | Description | Role in Model |
|---|---|---|
log_mpce_real_afe |
Log real MPCE per AFE (2011–12 prices) | Continuous — Engel curve driver |
state_code |
Indian state (~36 levels) | Factor-smooth interaction + random effect |
sector |
Rural (1) or Urban (2) | Random effect |
nss_region |
NSS sub-state region (~80+ levels) | Random effect |
reg_sector |
Interaction: NSS region × sector | Random effect |
social |
Social group: ST (1), SC (2), OBC (3), Others (9), residual (0) | Factor-smooth interaction + random effect |
rel |
Religion: Hindu (1), Islam (2), Christian (3), residual (0) | Factor-smooth interaction + random effect |
child |
Presence of children in household | Factor-smooth interaction + random effect |
female_headed |
Female-headed household indicator | Factor-smooth interaction + random effect |
seasonal |
Seasonal/temporary employment indicator | Factor-smooth interaction + random effect (select items) |
2.4 Survey Weights
Weights are rescaled by household AFE size and then normalized:
This ensures that each household’s weight reflects the number of adult female equivalents it represents.
3. The Hurdle Model for Food Consumption
3.1 Why a Hurdle Model?
Many food items are not consumed by all households. For example, flesh foods may be consumed by only 30–40% of households in certain states. A single regression model would conflate the decision to consume with the quantity decision. The hurdle model[3][4] separates these:
Household decision process:
Step 1: Does the household consume item m at all?
→ Modeled by logit (participation model)
Step 2: If yes, how much?
→ Modeled by Gamma or Log-Normal (conditional quantity model)
3.2 Part 1: Participation Model (Logit)
The probability of consuming food item \(m\) is modeled as:
The model has three types of terms:
1. Global Engel curve — \(f(\text{log}\_\text{mpce})\): A smooth function capturing the overall relationship between income and the probability of consuming the item. Estimated as a thin-plate regression spline.
2. Factor-smooth interactions — \(f_k(\text{log}\_\text{mpce}, \; g_k)\): These
allow the Engel curve to vary by group.[7] For example,
\(f(\text{log}\_\text{mpce},
\text{state}\_\text{code})\) lets each state have its own
income–consumption relationship, while shrinking toward the global
curve. Specified as bs = "fs" with first-order penalty
(m = 1).
3. Random intercepts — \(u_j\): Gaussian random effects for sector,
NSS region, region × sector interaction, social group, religion, child
presence, and female headship. Specified as
s(variable, bs = "re").
Item-specific model variation. For food items with
strong seasonal consumption patterns (fruits, other vegetables, green
leafy vegetables, roots & tubers), the model additionally includes a
factor-smooth interaction with seasonal, i.e., \(f(\text{log}\_\text{mpce}, \text{seasonal})\). For
other items, the seasonal effect enters only as a random intercept.
3.3 Part 2: Conditional Quantity Model
For households with positive consumption (\(q_i > 0\)), the quantity is modeled with the same covariate structure as the logit model, but using a continuous response family.
3.3.1 Model Selection: Gamma vs. Log-Normal
Two candidate models are estimated for each food item:
| Model | Response | Family | Link |
|---|---|---|---|
| Gamma | \(q_i\) (raw grams) | \(\text{Gamma}\) | \(\log\) |
| Log-Normal | \(\log(q_i)\) | \(\text{Gaussian}\) | identity |
Selection criterion: The model with the lower AIC is retained, after applying a Jacobian correction to ensure comparability across families. Because the lognormal is parameterised on the log-scale while the Gamma is parameterised on the original scale, their raw log-likelihoods are not directly comparable; the Jacobian of the log-transformation (−Σ log yᵢ) is added to the lognormal log-likelihood before computing its AIC. This corrected, data-driven choice accommodates the fact that different food items have different distributional properties — cereals (consumed daily in large amounts) may suit a different family than oilseeds & nuts (consumed irregularly in small amounts).
Estimation details (common to both candidates):
| Setting | Value | Rationale |
|---|---|---|
| Method | Fast REML (fREML)[8] |
Smoothing parameter selection |
| Discrete method | Enabled | Computational efficiency for large datasets |
| Gamma penalty | \(\gamma = 1.4\) | Guards against overfitting[2] |
| Selection | select = TRUE (logit only)[9] |
Allows unnecessary random effects to shrink to zero |
| Weights | Normalized survey weights | \(w_i^{\text{pc}} = w_i / \bar{w}\) |
3.4 Why Factor-Smooth Interactions?
The bs = "fs" specification is central to the model. It
allows each level of a grouping factor (e.g., each state) to have its
own smooth Engel curve, while penalizing deviations
from the global curve:
where \(f_g(x)\) is the group-specific deviation, subject to a first-order penalty that shrinks it toward zero.
This captures important economic realities:
State-level variation: The relationship between income and milk consumption is fundamentally different in Punjab (strong dairy culture) versus Meghalaya (limited dairy tradition). A single national Engel curve would miss this.
Demographic heterogeneity: Households with children may have a steeper income gradient for fruits and vegetables (as income rises, parents invest more in child nutrition) compared to child-free households.
Automatic shrinkage: States or groups with few observations are shrunk toward the national pattern, preventing noisy estimates.
4. Requirement-Adjusted Shannon Diversity Index
4.1 Motivation
Individual food item models tell us what people eat, but not how balanced their overall diet is. The Shannon diversity index[10] provides a single summary of dietary quality that captures both the variety and the adequacy of food intake.
4.2 Construction (Step by Step)
The index is constructed at the household level through the following steps:
Step 1: Per-AFE intake
→ Divide household-level food quantities by AFE household size
→ Result: grams/day per AFE for each of 9 food groups
Step 2: Cap at requirement
→ a_i = min(intake_i, requirement_i)
→ Eating more than required doesn't inflate "diversity"
Step 3: Compute shares
→ p_i = a_i / Σ(a_i)
→ Shares of capped intake across food groups
Step 4: Shannon entropy
→ H = -Σ(p_i × log(p_i))
→ Higher H = more evenly distributed consumption
Step 5: Adequacy adjustment (A)
→ For each food group: min(intake_i / requirement_i, 1)
→ A = mean across food groups
→ Penalizes diets that are diverse but insufficient
Step 6: Cereal adjustment
→ cereal_adj = max(p_cereal - requirement_share_cereal, 0)
→ Cereal penalty = exp(-3 × cereal_adj)
→ Penalizes diets over-reliant on cereals
Step 7: Final index
→ H_adj = H × A × Cereal_penalty
4.2.1 Mathematical Formulation
Shannon entropy (on capped shares):
and \(q_i\) is intake per AFE, \(r_i\) is the ICMR-NIN requirement[1] for food group \(i\), and \(K = 9\) food groups.
Adequacy score:
Cereal penalty:
Requirement-adjusted index:
4.2.2 Interpreting the Components
| Component | What it captures | Range |
|---|---|---|
| \(H\) (Shannon) | Evenness of consumption across food groups (on capped shares) | \([0,\; H^*]\) |
| \(A\) (Adequacy) | Whether each food group meets its requirement | \([0, 1]\) |
| \(C\) (Cereal penalty) | Penalizes over-dependence on cereals | \((0, 1]\) |
| \(H_{\text{adj}}\) | Overall dietary quality | \([0,\; H^*]\) |
where \(H^* = -\sum_{i=1}^{K} \frac{r_i}{\sum_j r_j} \log \frac{r_i}{\sum_j r_j}\) is the entropy of the requirement-proportional distribution. Based on the ICMR-NIN benchmark[1] for an adult woman (55 kg, moderately active, non-lactating), \(H^* \approx 1.97\). Because intakes are capped at requirements (Step 2), a household meeting all requirements will have shares \(p_i = r_i / \sum_j r_j\), which are not uniform — so \(H\) is bounded above by \(H^* < \log K\). The maximum of \(H_{\text{adj}}\) is attained when all food groups are consumed at exactly their required levels (\(A = 1\)) and the cereal share equals its requirement share (\(C = 1\)).
Example: A household that eats only rice (high quantity, one food group) would have: \(H \approx 0\) (no diversity), \(A\) low (most groups unmet), \(C < 1\) (cereal-heavy). The adjusted index \(H_{\text{adj}}\) would be very low despite adequate caloric intake.
4.3 Modeling the Diversity Index
The adjusted Shannon index \(H_{\text{adj}}\) is bounded in \([0, H^*]\) where \(H^* < \log K\) (see above). To model it with a Gaussian GAM, we apply a sequence of transformations:
Step 1: Normalize to \([0, 1]\):
Step 2: Clip to \((\varepsilon, 1 - \varepsilon)\) with \(\varepsilon = 10^{-6}\):
Step 3: Logit-transform to the real line:
The transformed outcome \(z\) is then modeled with the same GAM structure as the food item models (global Engel curve + factor-smooth interactions + random effects). Two versions are estimated:
- Without cereal adjustment: \(z\) based on \(H \times A\)
- With cereal adjustment: \(z\) based on \(H \times A \times C\)
4.4 Ratio-Based Shannon Diversity Index (Alternative Construction)
The gram-based construction in Section 4.2 caps intake at the requirement level in grams: \(a_i = \min(q_i, r_i)\). This means food groups with large gram requirements (e.g., cereals at ~280 g/day) mechanically dominate the share calculation, while food groups with small requirements (e.g., nuts/seeds at ~40 g/day) contribute proportionally tiny amounts even when fully met. The ratio-based alternative corrects this by normalizing each food group's intake by its appropriate daily requirement, ensuring consistency across food groups with different scales.
The ratio-based alternative addresses this by normalizing each food group's intake to its requirement before computing shares, placing all food groups on a common unitless 0–1 scale.
4.4.1 Construction (Step by Step)
Step 1: Per-AFE intake
→ Same as gram-based (Section 4.2, Step 1)
Step 2: Cap at requirement (ratio-based)
→ a_i = min(q_i / r_i, 1)
→ Each food group's adequacy ratio is capped at 1
→ All food groups now on the same 0–1 scale
Step 3: Compute shares
→ p_i = a_i / Σ(a_i)
→ Shares of adequacy ratios across food groups
Step 4: Shannon entropy
→ H = -Σ(p_i × log(p_i))
→ Higher H = more evenly met requirements
Step 5: Adequacy adjustment (A)
→ A = (1/K) × Σ a_i
→ Since a_i is already the capped adequacy ratio, A is the
mean adequacy across food groups
Step 6: Final index
→ H_adj = H × A
4.4.2 Mathematical Formulation
Shannon entropy (on ratio-based shares):
where \(q_i\) is intake per AFE, \(r_i\) is the ICMR-NIN requirement for food group \(i\), and \(K = 9\) food groups.
Adequacy score:
Requirement-adjusted index:
4.4.3 Key Differences from the Gram-Based Index
| Feature | Gram-based (Section 4.2) | Ratio-based (Section 4.4) |
|---|---|---|
| Capping rule | \(\min(q_i, r_i)\) (grams) | \(\min(q_i/r_i, 1)\) (unitless ratio) |
| Scale of \(a_i\) | Varies by food group (25–300 g) | 0–1 for all food groups |
| Share interpretation | Proportion of capped grams | Proportion of adequacy across groups |
| Cereal dominance | Cereals mechanically dominate shares due to large gram requirement | Fully met cereal contributes 1.0, same as any other fully met group |
| Cereal adjustment | Needed (cereal penalty \(C\)) | Not needed — cereal share only becomes large when other groups are poorly met |
| Maximum Shannon | \(H^* \approx 1.97\) (requirement-weighted) | \(\log K \approx 2.20\) (uniform when all groups fully met) |
4.4.4 Why Cereal Adjustment Is Unnecessary
In the gram-based formulation, cereals dominate shares for nearly every household because the cereal requirement (~280 g/day) vastly exceeds other food groups. The cereal penalty was introduced to correct for this mechanical artefact. In the ratio-based formulation, each food group contributes at most 1.0 to the sum regardless of its gram requirement. The cereal share only becomes disproportionately large when cereals are well-met and other food groups are genuinely under-consumed — precisely the scenario of a nutritionally imbalanced diet. Since the Shannon entropy already penalizes such concentration by construction, the cereal adjustment is redundant and is therefore not applied in the ratio-based version.
4.4.5 Modeling
The ratio-based index \(H_{\text{adj}}^{\text{ratio}}\) is normalized, clipped, and logit-transformed following the same procedure as Section 4.3:
where normalization uses \(\log K\) (the maximum Shannon entropy under uniform shares) rather than \(H^*\), since the ratio-based formulation achieves \(\log K\) when all food groups are fully met. The logit-transformed outcome is then modeled with the same GAM specification as the gram-based version. Only the \(H \times A\) version (without cereal adjustment) is estimated.
4.5 Why Shannon Diversity Rather Than a Simple Adequacy Ratio?
A natural alternative to the Shannon-based index is a simple mean adequacy ratio (MAR):
which is identical to the adequacy component \(A\) described in Sections 4.2 and 4.4. The MAR is widely used in nutrition research as a summary of dietary adequacy. However, it has important limitations that motivate the Shannon-based approach.
4.5.1 The MAR Cannot Distinguish Balance from Concentration
Consider two households, each with \(K = 9\) food groups:
| Household | Diet pattern | MAR | Shannon \(H\) |
|---|---|---|---|
| A | Meets 50% of requirement for every food group | 0.50 | \(\log 9 \approx 2.20\) (maximum) |
| B | Fully meets 4 groups, zero for the other 5 | 0.44 | \(\log 4 \approx 1.39\) |
| C | Fully meets 5 groups, zero for the other 4 | 0.56 | \(\log 5 \approx 1.61\) |
Households A and C have similar MAR values (~0.50 vs. ~0.56), yet their diets are fundamentally different: A spreads intake evenly across all food groups, while C concentrates entirely on five groups and completely neglects four others. The Shannon index captures this distinction — A achieves maximum entropy while C scores substantially lower. From a nutritional standpoint, A's balanced but insufficient diet may require a different policy intervention (increasing quantity across the board) than C's concentrated diet (diversifying into neglected food groups). The MAR alone cannot inform this distinction.
4.5.2 Shannon Captures the Distributional Shape
The MAR is a first moment — it summarizes the average level of adequacy. The Shannon index captures the distributional shape — how evenly that adequacy is spread. These are complementary dimensions:
| Dimension | What it answers | Captured by |
|---|---|---|
| Level | How much of the diet's requirements are met on average? | MAR (\(A\)) |
| Distribution | Are requirements met evenly or concentrated in a few groups? | Shannon (\(H\)) |
| Combined | Is the diet both adequate and balanced? | \(H_{\text{adj}} = H \times A\) |
The adjusted index \(H_{\text{adj}} = H \times A\) is high only when both dimensions are satisfied: the household meets its requirements (high \(A\)) and does so across a broad range of food groups (high \(H\)). Neither component alone is sufficient for policy — a household can have high \(A\) from consuming large quantities of a few food groups, or high \(H\) while consuming very little overall.
4.5.3 Policy Relevance
For food policy, the distinction matters because interventions differ depending on the underlying problem. A household with low \(A\) but high \(H\) (balanced but insufficient) benefits from income support or food subsidies that raise consumption levels across the board. A household with moderate \(A\) but low \(H\) (adequate but concentrated) benefits from diversification interventions — nutrition education, market access for non-staple foods, or targeted subsidies for underconsumed food groups. The Shannon-based index, by separately identifying these two dimensions and combining them multiplicatively, provides richer information for targeting than the MAR alone.
In short, the MAR answers "how much?" while the Shannon index answers "how well distributed?" — and the adjusted index \(H_{\text{adj}}\) requires both questions to be answered favourably.
5. Prediction Grid Construction
Before generating predictions, the analysis constructs standardized prediction grids that control for confounding when comparing groups. This is a critical step: raw comparisons between, say, rural and urban consumption would conflate the true sector effect with differences in the geographic and demographic composition of each sector.
5.1 The Standardization Problem
Consider comparing fruit consumption between rural and urban India. Urban households are disproportionately located in richer states (Maharashtra, Karnataka) and have different demographic profiles (fewer children, different religious composition). A naive rural–urban comparison would mix three effects: the actual sector effect on consumption, geographic composition differences, and demographic composition differences.
The prediction grid solves this by constructing a counterfactual population where all non-focal variables are held at the same national distribution.
5.2 Grid Construction Logic
The grid is constructed differently depending on the type of comparison. In all cases, the approach follows the same principle: hold the focal variable at its actual values, and standardize everything else to a common national distribution.
Grid Construction by Comparison Type:
STATE comparison [c("nss", "state_code")]
→ Keep: actual geographic structure (regions, rural/urban) within each state
→ Standardize: demographics (social, religion, child, female_headed, seasonal)
→ Result: isolates state-level consumption differences net of demographics
NSS REGION comparison [c("nss", "nss_region")]
→ Standardize (optional): sector distribution across regions
→ Standardize: demographics nationally
→ Result: isolates regional effects net of urban/rural mix and demographics
SECTOR comparison [c("nss", "sector")]
→ Standardize: state-region distribution across rural/urban
→ Standardize: demographics nationally
→ Result: isolates PURE rural-urban effect
DEMOGRAPHIC comparison [c("nss", <focal_var>)]
→ Standardize: full geography (state, region, sector)
→ Standardize: other demographics nationally
→ Result: isolates effect of focal demographic variable
Geographic standardization. When comparing sectors or regions, the grid optionally standardizes the geographic distribution so that, for example, rural and urban get the same state–region composition. This isolates the “pure” sector effect from geographic confounding. Without standardization, sector differences would partly reflect the fact that rural populations are concentrated in poorer states.
5.3 Grid Construction Steps (Detailed)
For each comparison type, the grid is built through the following process:
Step 1: Define the core structure. The focal variable × expenditure bin combinations define the “backbone” of the grid. For each cell, the mean MPCE within the bin (from the MPCE models) provides the income value at which predictions are evaluated.
Step 2: Compute geographic weights. For state comparisons, each state keeps its actual nss_region × sector structure. For demographic comparisons, a national geographic distribution is computed:
where \(s\) is state, \(r\) is NSS region, and \(u\) is sector.
Step 3: Compute demographic weights. The national demographic distribution is computed marginally:
where \(\mathbf{d}\) is the vector of non-focal demographic variables (social group, religion, child presence, female headship, and — for seasonal items — seasonal employment).
Step 4: Cross and combine. The geographic and demographic grids are crossed (Cartesian product) and their weights multiplied:
This produces the full prediction grid: every combination of the focal variable, expenditure bin, geographic cell, and demographic cell, with a weight reflecting its representation in the standardized population.
Step 5: Assign grid IDs. Each unique combination of
the grouping variable × expenditure bin receives a grid_id,
used to aggregate predictions.
Step 6: Build “Overall” grid. A separate grid is constructed using the overall (population-weighted) mean MPCE for each group, rather than decile-specific means. This provides the marginal group-level summary.
5.4 Expenditure Binning
Households are assigned to expenditure deciles using group-specific cutpoints from the MPCE models:
| Bin | Label | Definition |
|---|---|---|
| 1 | 0–10 | Below 10th percentile of group-specific MPCE distribution |
| 2 | 10–20 | 10th to 20th percentile |
| … | … | … |
| 10 | 90–100 | Above 90th percentile |
| Overall | Overall | Full population (weighted mean MPCE) |
The cutpoints (\(q_{10}, q_{20}, \ldots, q_{90}\)) are computed within each group (e.g., separately for rural and urban) to ensure that each bin contains approximately 10% of the group’s weighted population. This means the same income level (say ₹2,000/month) may fall in different deciles for different groups.
5.5 Seasonal Model Variation
Food items with strong seasonal consumption patterns (fruits, other
vegetables, green leafy vegetables, roots & tubers) include
seasonal as a covariate. The grid construction respects
this:
- For seasonal items: the demographic expansion
includes the
seasonalvariable - For non-seasonal items: the
seasonalvariable is excluded from the demographic expansion, reducing grid size
6. Posterior Simulation and Uncertainty Propagation
The prediction pipeline propagates two sources of uncertainty through to the final estimates: model parameter uncertainty (from the GAM coefficients) and survey sampling uncertainty (from the complex survey design).
6.1 Overview of the Prediction Pipeline
Step 1: Load pre-fitted models (logit + quantity)
Step 2: Construct standardized prediction grids (from previous section)
Step 3: Draw coefficient vectors from posterior (MVN)
Step 4: Draw dispersion parameters (chi-squared)
Step 5: Compute household-level predictions (chunked for memory)
Step 6: Aggregate to group × decile level (weighted)
Step 7: Compute survey standard errors (svydesign)
Step 8: Inject sampling uncertainty into model draws
Step 9: Summarize (mean, median, 95% credible interval)
6.2 Step 1: Covariance Matrix Validation
Before drawing coefficients, the posterior covariance matrix \(\mathbf{V}_p\) must be positive definite.[2] Numerical issues in large GAM fits can occasionally produce matrices with near-zero or negative eigenvalues. Two repair strategies are available:
| Method | Approach | When used |
|---|---|---|
nearPD |
Finds nearest positive definite matrix (Frobenius norm) | Default; more accurate |
jitter |
Adds \(\varepsilon \cdot \mathbf{I}\) to diagonal | Fallback; faster |
The validation checks all eigenvalues: \(\lambda_{\min}(\mathbf{V}_p) > 10^{-10}\).
6.3 Step 2: Draw Coefficient Vectors
Let \(\hat{\boldsymbol{\beta}}\) denote the estimated coefficients and \(\mathbf{V}_p\) the Bayesian posterior covariance matrix.[2] For each simulation \(m = 1, \ldots, M\):
drawn via MASS::mvrnorm(). Separate draws are made for
the logit model (\(\boldsymbol{\alpha}^{(m)}\)) and the
quantity model (\(\boldsymbol{\beta}^{(m)}\)).
6.4 Step 3: Draw Dispersion Parameters
For the Log-Normal model, the residual variance is drawn from its posterior:
where \(\nu\) is the residual degrees of freedom and \(\chi^2_\nu\) is a chi-squared draw with \(\nu\) degrees of freedom. A bias-correction term is precomputed: \(\delta^{(m)} = \tfrac{1}{2}(\sigma^{(m)})^2\).
For the Gamma model, the dispersion parameter \(\phi\) (inverse of shape) is drawn analogously:
yielding shape \(= 1 / \phi^{(m)}\) for each draw.
6.5 Step 4: Compute Household-Level Predictions (Chunked)
Predictions are computed in memory-efficient blocks (default: 2,000 rows per block) to handle the large prediction grids. For each block of rows in the grid:
4a. Linear predictors via the lpmatrix:
where \(\mathbf{X}_L\) and \(\mathbf{X}_Q\) are the design matrices from
predict(..., type = "lpmatrix"), and \(\mathbf{B}_L\), \(\mathbf{B}_Q\) are the \(M \times p\) matrices of coefficient draws.
The matrix cross-product tcrossprod() yields an \(n_{\text{block}} \times M\) matrix of
predictions in a single operation.
4b. Transform to response scale:
Probability: \(\hat{p}_i^{(m)} = \text{logit}^{-1}(\eta_{L,i}^{(m)})\)
Conditional mean (log-normal): \(\hat{\mu}_i^{(m)} = \exp(\eta_{Q,i}^{(m)} + \delta^{(m)})\)
Conditional mean (Gamma): \(\hat{\mu}_i^{(m)} = \exp(\eta_{Q,i}^{(m)})\)
4c. Expected unconditional quantity:
4d. Posterior predictive draws (full distributional draws, not just means):
For the log-normal case:
For the Gamma case:
The unconditional posterior predictive combines a Bernoulli draw with the conditional draw:
6.6 Step 5: Weighted Aggregation to Group × Decile Level
Within each block, predictions are accumulated into group-level weighted sums. For each group \(g\) (defined by the grouping variable × expenditure bin):
where \(\tilde{w}_i = w_i / \sum_{j \in g} w_j\) are the normalized within-group weights. The same weighted averaging is applied to \(\hat{\mu}_i^{(m)}\), \(\widehat{EY}_i^{(m)}\), \(Y_{\text{pos},i}^{(m)}\), and \(Y_{\text{uncond},i}^{(m)}\).
The result is a set of \(M\) draws for each group × bin combination, stored as list-columns:
| Draw variable | Description |
|---|---|
p_pos_g |
Group-level probability draws (M values) |
mu_pos_g |
Group-level conditional mean draws |
EY_uncond_g |
Group-level unconditional mean draws (\(p \times \mu\)) |
Ypos_ppc_g |
Group-level conditional posterior predictive draws |
Yuncond_ppc_g |
Group-level unconditional posterior predictive draws |
6.7 Step 6: Survey Standard Errors
To quantify sampling uncertainty, plug-in
predictions (\(\hat{p}_i\), \(\hat{\mu}_i\), \(\widehat{EY}_i\) at coefficient point
estimates) are computed for every household in the microdata. These are
then summarized using the survey package with the full
complex survey design (PSU clustering, stratification, survey
weights):
computed via survey::svyby() with
survey::svymean().
Lonely PSU adjustment. The option
survey.lonely.psu = "adjust" is set to handle strata with a
single PSU, ensuring that variance estimation does not fail.
6.8 Step 7: Injecting Sampling Uncertainty
The survey standard errors are injected into the model draws using scale-appropriate transformations. This combines both sources of uncertainty into a single set of draws.
6.8.1 For probability draws (logit scale):
The delta method gives \(\text{SE}(\text{logit}(p)) \approx \text{SE}(p) / [p(1-p)]\). Noise is added on the logit scale:
This respects the \((0, 1)\) bounds on probabilities.
6.8.2 For quantity draws (log scale, mean-preserving):
Multiplicative log-normal noise is applied with a bias correction that preserves the expected value:
where \(\sigma_{\log} = \text{SE}(\bar{q}) / \bar{q}\) (the coefficient of variation). The term \(-\tfrac{1}{2}\sigma_{\log}^2\) ensures:
so the mean of the draws is preserved.
Why two uncertainty sources? Model uncertainty captures the imprecision in estimating smooth Engel curves and random effects. Sampling uncertainty captures the fact that we observe a sample, not the full population. For well-identified parameters (e.g., national-level cereal consumption), model uncertainty dominates. For thin cells (e.g., tribal households in small northeastern states), sampling uncertainty can be substantial.
6.9 Step 8: Final Summary Statistics
The \(M\) survey-adjusted draws for each group × bin are summarized into:
| Statistic | Description |
|---|---|
mean |
Posterior mean of unconditional consumption (grams/day) |
median |
Posterior median |
q2.5 |
2.5th percentile (lower bound of 95% credible interval) |
q97.5 |
97.5th percentile (upper bound of 95% credible interval) |
mean_p |
Posterior mean of probability of consuming |
q2.5_p |
Lower bound of 95% CI for probability |
q97.5_p |
Upper bound of 95% CI for probability |
7. Output Variables
The prediction pipeline produces the following variables for each food item × grouping variable × group level × decile bin:
| Variable | Description |
|---|---|
nss |
Survey round (HCES2011 or HCES2023) |
var_n |
Grouping variable name (e.g., “sector”, “religion”) |
group_n |
Group level name (e.g., “Rural”, “Hindu”) |
item_n |
Food item category |
bin |
MPCE decile bin (0–10, 10–20, …, 90–100, Overall) |
log_mpce_real_afe |
Mean log real MPCE per AFE in the bin |
mpce_real_afe_2011 |
Mean real MPCE per AFE (₹, 2011–12 prices) |
mean |
Posterior mean of unconditional consumption (grams/day) |
q2.5 |
2.5th percentile (lower bound of 95% credible interval) |
q97.5 |
97.5th percentile (upper bound of 95% credible interval) |
mean_p |
Posterior mean of probability of consuming |
q2.5_p |
Lower bound of 95% CI for probability |
q97.5_p |
Upper bound of 95% CI for probability |
mean_ess |
Effective sample size for the group × bin cell |
8. Visualization
8.1 Engel Curve Plots
The primary visualization is a pair of Engel curves — the relationship between household expenditure and food consumption — plotted for each food item across demographic groups:
Top panel: Proportion consuming (%)
→ Y-axis: Estimated share of people consuming the food item
→ X-axis: Monthly per capita expenditure (₹, 2011-12 prices, log scale)
→ Faceted by group level within grouping variable
Bottom panel: Average quantity consumed (grams)
→ Y-axis: Estimated daily consumption per AFE
→ X-axis: Same as above
→ Reference lines at ICMR-NIN daily requirements
Key visual elements:
- Points and lines: Posterior means at each decile bin, connected by smoothed curves
- Ribbons: 95% credible intervals (shaded bands)
- Color: Survey round (2011–12 vs. 2023–24) — enables visual comparison of changes over time
- Annotation: Mean values (overall average) marked with dashed crosshairs and labels showing the exact proportion or quantity
8.2 Cross-Sectional Plots (Single Survey Round)
For state-level comparisons within HCES 2023–24, an alternative visualization displays:
- All states within a geographic region (Northern, Central, Eastern, Northeast, Western, Southern) on a single panel
- Smoothed Engel curves with labeled mean values
- Choice of quantity or proportion on the y-axis
- Choice of log MPCE or decile rank on the x-axis
These are combined into multi-panel grids (3 × 2 for six regions, or
3 × \(n\) for demographic variables)
using the magick package for image composition.
8.3 Suite of Analyses
The batch processing generates figures across all combinations of:
| Dimension | Levels |
|---|---|
| Food items | 10 categories |
| Grouping variables | sector, social, religion, child, female_headed, seasonal, state |
| Metrics | proportion, quantity |
| Survey rounds | 2011–12, 2023–24 (combined) or 2023–24 only |
For state-level analyses, figures are further organized by geographic region.
9. Computational Implementation
9.1 Parallel Model Estimation
Each of the 10 food items requires two models (logit + quantity) for each of two survey rounds, yielding 40 model fits plus 4 Shannon diversity models. These are estimated in parallel:
Architecture:
→ future_lapply() with 4 worker processes
→ Each worker: fits models for one food item × both rounds
→ Models saved to item-specific directories
→ Shannon diversity models estimated separately
Directory structure:
~/data/bam_models/CoAHD/
├── cereal_millets/
│ ├── cereal_millets_model.RData
│ ├── data_prop_quantity_cereal_millets_sector.RData
│ ├── data_prop_quantity_cereal_millets_rel.RData
│ └── ...
├── green_leafy_vegetables/
│ └── ...
├── ...
└── Shannon_Diversity_food/
└── Shannon_Diversity_food_model.RData
Each model .RData file contains a list with:
| Object | Contents |
|---|---|
HCES2011_model |
Models and data for the 2011–12 round |
HCES2023_model |
Models and data for the 2023–24 round |
model_l |
Logit (participation) model |
model_q |
Quantity model (Gamma or Log-Normal, AIC-selected) |
model_info |
String: “Gamma (log link)” or “log normal” |
data |
Analysis dataset used for fitting |
9.2 Prediction Data Generation Pipeline
The figure data generation is the most computationally intensive step. It is orchestrated by a batch processing function that runs across all item × grouping variable combinations in parallel.
Batch Processing Architecture:
Input: 10 food items × 7 grouping variables = 70 jobs
(each job = 2 survey rounds)
For each job (item × grouping_var):
1. Load pre-fitted models for the item
2. Load MPCE decile cutpoints for the grouping variable
3. Construct standardized prediction grid
4. Draw M=1000 coefficient vectors (MVN) and dispersion parameters
5. Compute predictions in blocks of 2000 rows
6. Compute survey standard errors (svydesign)
7. Inject sampling uncertainty
8. Summarize to mean + 95% credible interval
9. Save to item-specific .RData file
Parallelization:
→ future_lapply() with auto-detected workers
→ Worker count: min(cores-1, memory/2GB per job, n_jobs)
→ Each job: ~2GB peak memory, ~3-5 min per item × group × round
9.3 System Resource Management
The pipeline includes adaptive resource management to prevent memory exhaustion:
| Constraint | Calculation | Purpose |
|---|---|---|
| Core limit | detectCores() - 1 |
Leave one core for system |
| Memory limit | (total RAM - 4 GB) / 2 GB per job |
Prevent swapping |
| Job limit | Number of remaining jobs | Don’t oversubscribe |
| Optimal workers | min(core, memory, job) |
Balance throughput and stability |
9.4 Output File Organization
Each food item × grouping variable combination produces a prediction file:
~/data/bam_models/CoAHD/<item>/
├── <item>_model.RData (fitted models)
├── data_prop_quantity_<item>_sector.RData (sector predictions)
├── data_prop_quantity_<item>_rel.RData (religion predictions)
├── data_prop_quantity_<item>_social.RData (social group predictions)
├── data_prop_quantity_<item>_child.RData (child presence predictions)
├── data_prop_quantity_<item>_female_headed.RData (female headship predictions)
├── data_prop_quantity_<item>_seasonal.RData (seasonal predictions)
└── data_prop_quantity_state_<item>_sector.RData (state-level predictions)
All results are also compiled into a single Excel workbook
(data_prop_quantity.xlsx) with two sheets: the full dataset
and a variable descriptions sheet.
10. Summary of Model Assumptions
The key assumptions underlying this framework:
Hurdle structure: The decision to consume and the quantity consumed are modeled as separate processes. This is appropriate when many households report zero consumption for a given food item.
Smooth Engel curves: The relationship between income and consumption is assumed to be a smooth (but potentially nonlinear) function. The factor-smooth specification allows this curve to vary by group while borrowing strength across groups.
Distributional assumption for quantity: The conditional quantity follows either a Gamma or Log-Normal distribution (selected by AIC). Both accommodate positive, right-skewed data typical of consumption quantities.
Gaussian random effects: All categorical covariates enter as Gaussian random effects, implying partial pooling (shrinkage) toward the grand mean.
Posterior normality of coefficients: Coefficients are drawn from a multivariate normal centered at the penalized MLE — the standard Laplace approximation, accurate for large samples.
Additive separability of the Shannon index components: The requirement-adjusted index \(H_{\text{adj}} = H \times A \times C\) assumes that diversity, adequacy, and cereal dependence contribute multiplicatively.
Logit-normality of the Shannon index: After normalization and logit transformation, the diversity index is modeled as conditionally Gaussian — this is an approximation, especially near the boundaries (very low or very high diversity).
References
Appendix: Glossary of Terms
AFE (Adult Female Equivalent): A standardization method to account for different caloric needs based on age and gender. The reference unit is an adult woman weighing 55kg.
AIC (Akaike Information Criterion): A measure of model fit that penalizes complexity. Lower AIC indicates a better trade-off between fit and parsimony. Used here to select between Gamma and Log-Normal quantity models.
bam(): The mgcv package
function for fitting generalized additive models to large datasets,
using discretized covariates and fast REML for efficiency.
Credible Interval: The Bayesian analogue of a confidence interval. A 95% credible interval contains the true parameter value with 95% posterior probability, given the model and data.
Coefficient of Variation (CV): The ratio of the standard error to the mean, \(\text{CV} = \text{SE}/\bar{x}\). Used here to convert survey standard errors to log-scale noise for the mean-preserving multiplicative uncertainty injection.
Decile: Division of population into 10 equal groups based on expenditure levels (Decile 1 = poorest 10%).
Delta Method: A statistical technique for approximating the variance of a transformed variable. If \(g(\hat{\theta})\) is a function of an estimator, then \(\text{Var}(g(\hat{\theta})) \approx [g'(\hat{\theta})]^2 \cdot \text{Var}(\hat{\theta})\). Used here to convert standard errors from the probability scale to the logit scale.
Engel Curve: The relationship between household income (or expenditure) and consumption of a particular good. Named after the 19th-century statistician Ernst Engel.
Factor-Smooth Interaction (bs = "fs"):
A penalized smooth term in mgcv that allows each level of a
factor to have its own smooth curve, with deviations from the global
curve penalized. Achieves group-specific nonlinear effects with
automatic shrinkage.
fREML (Fast REML): Fast restricted maximum
likelihood — the method used for smoothing parameter estimation in
mgcv::bam().
GAM (Generalized Additive Model): A flexible regression framework that allows non-linear relationships and random effects through penalized smooth terms.
Geographic Standardization: A technique for making group comparisons fair by holding the geographic distribution (state, region, sector) constant across groups. Without standardization, rural–urban differences would partly reflect the geographic concentration of rural populations in poorer states.
HCES: Household Consumer Expenditure Survey conducted by the National Sample Survey Office.
Hurdle Model: A two-part model that separately models (a) whether an event occurs and (b) the magnitude conditional on occurrence. Appropriate for data with excess zeros.
ICMR-NIN: Indian Council of Medical Research – National Institute of Nutrition. The body that establishes dietary guidelines for India (see [2]).
lpmatrix (Linear Predictor Matrix): The design
matrix \(\mathbf{X}\) such that \(\mathbf{X}\boldsymbol{\beta}\) gives the
linear predictor. Obtained from
predict(..., type = "lpmatrix") in mgcv.
Essential for posterior simulation because it allows vectorized
computation of predictions across all coefficient draws simultaneously
via matrix multiplication.
Mean-Preserving Noise: A multiplicative noise injection technique where \(\tilde{q} = q \cdot \exp(\varepsilon - \tfrac{1}{2}\sigma^2)\) with \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\). The bias correction term \(-\tfrac{1}{2}\sigma^2\) ensures \(E[\tilde{q}] = E[q]\).
MPCE: Monthly Per Capita Expenditure — total household expenditure divided by household size (or by AFE units).
Posterior Predictive Draw: A simulated observation from the model’s predictive distribution, incorporating both parameter uncertainty (from coefficient draws) and observation-level randomness (from the assumed distribution). Distinguished from “posterior mean draws” which only reflect parameter uncertainty.
Shannon Diversity Index: An information-theoretic measure of diversity, originally from ecology. Higher values indicate more even distribution across categories. Because intakes are capped at ICMR-NIN requirements before computing shares, the effective maximum is \(H^* = -\sum (r_i / \sum r_j) \log(r_i / \sum r_j)\), the entropy of the requirement-proportional distribution, rather than \(\log K\).
Survey Weights: Statistical weights assigned to each sampled household to make the sample representative of the population.
Last updated: February 2026