Methodology Note

Estimating Nutrient Intake and Dietary Composition

A Two-Stage Hurdle Approach Using Semi-Parametric Generalized Additive Models
Dr Shamika Ravi (Member, EAC to PM) & Dr Mudit Kapoor (CECFEE, EPU, ISI-Delhi Center)
February 2026

1. Introduction

This document describes the econometric methodology employed to estimate nutrient intake levels and dietary composition shares from the Indian Household Consumer Expenditure Surveys (HCES) of 2011-12 and 2023-24. The analysis addresses two interconnected questions: (i) how much of a given nutrient does a household consume, and (ii) what share of that nutrient comes from specific food categories such as cereals, beverages, served processed food, and packaged processed food.

Both questions present the challenge of zero-valued observations: some households report no consumption of particular food categories during the recall period. We adopt a two-stage hurdle model framework1 that separates the participation decision (extensive margin) from the conditional outcome (intensive margin), estimated via semi-parametric Generalized Additive Models (GAMs) using the mgcv package in R.2

The methodology is applied uniformly to three macronutrients—calories (kCal), protein (g), and fat (g)—and to the compositional shares of food categories within each nutrient. All models employ a common hierarchical structure that accounts for geographic, demographic, and socioeconomic heterogeneity through a combination of smooth functions, factor-smooth interactions, and random effects.3

2. Data and Sample Construction

2.1 Data Sources

The analysis draws on two rounds of the Household Consumer Expenditure Survey (HCES) conducted by the Ministry of Statistics and Programme Implementation (MoSPI), Government of India:

Both surveys employ a stratified multi-stage sampling design with households as the ultimate sampling unit. The surveys record detailed food consumption over a 30-day recall period, including quantities and expenditures for a comprehensive list of food items.

2.2 Sample Restrictions

The analysis sample is restricted to households that report having a cooking arrangement (cooking_code ≠ 12 for 2023; ≠ 10 for 2011). This filter ensures that the food consumption data reflects actual household dietary choices rather than consumption from external sources only.

2.3 Nutrient Conversion

Reported food quantities are converted to nutrient equivalents (calories, protein, fat) using MoSPI's nutritive value tables. Total household nutrient intake is expressed per adult female equivalent (AFE) unit to account for household composition effects. The AFE scaling factor is based on energy requirements by age and sex.

2.4 Key Variables

VariableDescription
log_mpce_real_afeLog real monthly per capita expenditure (AFE-adjusted), deflated to 2011 prices using a general price index
state_codeState identifier (36 states/UTs, with Telangana mapped to Andhra Pradesh for cross-round comparability)
nss_regionNSS region identifier, providing sub-state geographic stratification
sectorRural (1) or Urban (2)
childIndicator for presence of children in the household
female_headedIndicator for female-headed household
socialSocial group: Scheduled Tribe (1), Scheduled Caste (2), OBC (3), Others (9), Residual (0)
relReligion: Hindu (1), Islam (2), Christian (3), Others (0)
w_pcNormalised survey weights (weights × household size / sample mean)

3. Econometric Framework

3.1 The Two-Stage Hurdle Model

The presence of zero observations in both quantity and share data motivates a two-stage hurdle framework.1 Unlike Tobit models, the hurdle approach allows the participation decision and the conditional outcome to be governed by different processes with different parameter vectors:

Stage 1 (Extensive Margin): A logistic GAM estimates the probability of positive consumption:

P(y > 0 | x) = logit−1[ f(x) ]

Stage 2 (Intensive Margin): Conditional on positive consumption, a second GAM estimates the outcome—either nutrient quantity or compositional share—using the appropriate family and link function.

The unconditional expected outcome is obtained as the product of the two stages:

E[y] = P(y > 0) × E[y | y > 0]

This decomposition provides a natural framework for analysing food demand: changes in a covariate (such as income) can shift the probability of consuming a food category (extensive margin) and/or the amount consumed conditional on participation (intensive margin).4

3.2 GAM Specification

All models share a common semi-parametric structure that combines thin-plate regression splines, factor-smooth interactions, and Gaussian random effects. The general form is:

g(μ) = α + f(log MPCE) + Σ f(log MPCE, z) + Σ u  (random effects)

where the components are:

3.3 Estimation

All models are estimated using mgcv::bam(), which is optimised for large datasets (N > 100,000). Key estimation settings:2

4. Model A: Nutrient Quantity

The first model class estimates total household nutrient intake per AFE unit for calories, protein, and fat.

4.1 Stage 1: Participation

A quasi-binomial GAM with logit link estimates the probability of positive nutrient consumption from a given food group:

logit[ P(quantity > 0) ] = f(x)

where f(x) follows the general GAM specification described in Section 3.2. The quasi-binomial family is used to accommodate potential overdispersion in the binary response under survey weights.

4.2 Stage 2: Conditional Quantity

Conditional on positive consumption, two candidate models are estimated:

(a) Gamma with log link: Models the conditional mean directly on the original scale via E[y | y > 0] = exp(f(x)). This naturally constrains predictions to be positive and accommodates right-skewed data.

(b) Log-normal: Models log(y | y > 0) = f(x) + ε, where ε ~ N(0, σ²). Retransformation to the original scale requires a smearing adjustment.

Model selection between these two candidates is performed via AIC comparison. While AIC values are not strictly comparable across different families, they serve as a practical guide; both models use a log link, making the comparison informative about relative fit on similar scales.

5. Model B: Nutrient Composition Shares

The second model class estimates the share of total nutrient intake attributable to specific food categories (e.g., the share of total calories from packaged processed food). These shares are bounded in [0, 1] and subject to a unit-sum constraint across all categories, making them compositional data.6

5.1 The Compositional Data Challenge

Standard regression on raw shares is problematic for three reasons: shares are bounded, the unit-sum constraint induces spurious correlations, and the sample space is a simplex rather than Euclidean space. The Isometric Log-Ratio (ILR) transformation7 maps the D-part composition to D−1 unconstrained real-valued coordinates that live in standard Euclidean space, making them suitable for regression.

5.2 Two-Part Composition with ILR

Given the high prevalence of zeros in some food categories (e.g., served processed food with ~30% zeros), we estimate each food category separately against an "everything else" residual. For each food category j, the composition is:

c = (s, 1 − s)

where s is the share of category j. The ILR transform of this 2-part composition yields a single coordinate:

ilr = (1/√2) × ln( (1 − s) / s)

This is equivalent, up to a scaling constant, to the additive log-ratio (ALR) transformation. Positive values indicate the food category dominates; negative values indicate it contributes a minority share.

5.3 Zero Handling

Zeros arise in two distinct ways, each requiring different treatment:

Structural zeros in the share (share = 0): Households that do not consume the food category at all. These are handled by Stage 1 of the hurdle model, which separates participants from non-participants.

Boundary values (share = 1, hence share_others = 0): Households that derive all of a nutrient from a single category, making the residual share zero. Among positive-share observations, these create zeros in the composition matrix. We address these using count zero multiplicative replacement (CZM method) from the zCompositions package,8 which replaces zeros with small positive values while preserving the ratios among non-zero components.

5.4 Estimation Procedure

For each food category and nutrient combination, the procedure is:

  1. Stage 1—Participation: A quasi-binomial GAM estimates P(share > 0) on the full sample using the standard specification from Section 3.2.
  2. Positive subsample: Observations with share > 0 are selected. The 2-part composition [share, 1 − share] is constructed.
  3. Zero replacement: If any boundary values exist (share = 1), CZM replacement is applied to the composition matrix.
  4. ILR transform: The composition is mapped to a single ILR coordinate.
  5. Stage 2—Conditional share: A Gaussian GAM on the ILR coordinate estimates E[ilr | share > 0] using the standard specification.

5.5 Back-Transformation and Prediction

Predictions are generated in ILR space and back-transformed to the share scale using the inverse ILR:

ŝ = 1 / (1 + exp(ilr̂ × √2))

The unconditional expected share is then:

E[share] = (share > 0) × ŝ

6. Computational Implementation

The full analysis involves estimating a large number of models:

Model ClassGrid SizeBAMs/JobTotal BAMs
Quantity (Model A)3 × 2 = 6318
Share (Model B)3 × 6 × 2 = 36272
Total90

The grid dimensions for Model B are: 3 nutrients (calories, protein, fat) × 6 food categories (cereals & millets, milk, flesh foods, beverages, served processed food, packaged processed food) × 2 survey rounds = 36 jobs, each fitting 2 BAM models (logit + ILR).

Estimation is parallelised at the finest grain—each survey × nutrient (or survey × nutrient × category) combination is dispatched as an independent job using the future and future.apply packages. Within each job, bam() exploits internal OpenMP threading (nthreads = 2) for basis construction and matrix operations. Data preparation uses data.table for efficient in-memory operations on large survey files.

Memory management is handled by slimming fitted model objects (removing the stored copy of training data via model$model <- NULL) and triggering garbage collection between jobs.

7. Food Categories Analysed

The share models (Model B) are estimated for the following food categories, chosen to capture key dimensions of India's dietary transition:

Food CategoryRationale
Cereals & milletsStaple foods; traditionally dominant calorie source in Indian diets. Declining share is a hallmark of dietary diversification.
Milk & dairyKey source of high-quality protein and fat, especially for vegetarian households. Consumption patterns vary sharply by income and region.
Flesh foodsIncludes meat, poultry, fish, and eggs. Rising share signals the nutrition transition and has implications for environmental sustainability.
BeveragesIncludes tea, coffee, and other non-alcoholic drinks. Often the first category of processed food adopted by low-income households.
Served processed foodRestaurant and street food meals. Reflects the growing role of food away from home (FAFH) in Indian diets.
Packaged processed foodFactory-produced packaged items. Captures the penetration of ultra-processed foods into the dietary pattern.

8. Summary

The methodology combines three key elements: (i) a hurdle framework that cleanly separates the participation and intensity decisions, (ii) compositional data analysis via ILR to respect the bounded, sum-constrained nature of dietary shares, and (iii) semi-parametric GAMs that flexibly capture non-linear income effects and geographic heterogeneity without imposing restrictive functional forms.

The two-stage structure is particularly well-suited to Indian dietary data, where zero consumption of specific food categories is common and carries distinct economic meaning—reflecting both affordability constraints and cultural food preferences.

The resulting model objects serve as inputs to counterfactual simulations of dietary change under income growth and policy scenarios, described separately in the simulation methodology note.

References

[1] Cragg, J.G. (1971). Some statistical models for limited dependent variables with application to the demand for durable goods. Econometrica, 39(5), 829–844.
[2] Wood, S.N. (2017). Generalized Additive Models: An Introduction with R (2nd ed.). Chapman and Hall/CRC.
[3] Gelman, A. (2006). Multilevel (hierarchical) modeling: What it can and can't do. Technometrics, 48(3), 432–435.
[4] Mullahy, J. (1998). Much ado about two: reconsidering retransformation and the two-part model in health econometrics. Journal of Health Economics, 17(3), 247–281.
[5] Li, R. & Shively, T.S. (2008). Variable selection in semiparametric regression modeling. Annals of Statistics, 36(1), 261–286.
[6] Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman and Hall.
[7] Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueras, G. & Barceló-Vidal, C. (2003). Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35(3), 279–300.
[8] Palarea-Albaladejo, J. & Martín-Fernández, J.A. (2015). zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemometrics and Intelligent Laboratory Systems, 143, 85–96.