Methodology Note

Food Expenditure Composition in India

Sankey Diagram Data Pipeline — HCES 2011-12 & 2023-24
Dr Shamika Ravi (Member, EAC to PM) & Dr Mudit Kapoor (CECFEE, EPU, ISI-Delhi Center)
February 2026

1. Introduction

This document describes the data creation pipeline behind the Food Expenditure Composition dashboard, which visualises how Indian households allocate their food budget across broad food categories and individual items using a Sankey diagram. The pipeline processes unit-level household microdata from the Household Consumer Expenditure Survey (HCES)[1] across both the 2011-12 and 2023-24 rounds, producing survey-weighted expenditure share estimates at multiple levels of disaggregation.

1.1 Purpose of the Sankey Dashboard

The Sankey dashboard provides a hierarchical view of food expenditure: total food spending flows first into broad food categories (e.g., Cereals & Millets, Milk & Milk Products, Vegetables) and then into individual food items within each category. Each flow is proportional to the expenditure share, allowing immediate visual comparison of dietary spending patterns across states, sectors, expenditure classes, and survey rounds.

1.2 What the Data Captures

For every combination of geographic unit, sector, and expenditure class, the pipeline produces:

  1. Broad-category shares: the percentage of total food expenditure allocated to each food group
  2. Item-level shares: the percentage allocated to each individual food item within its category
  3. Mean MPCE: the average Monthly Per-Capita Expenditure (in AFE or per-capita terms) for the relevant population subgroup

2. Data Sources & Inputs

2.1 Household Consumption Data

The pipeline draws on two rounds of the NSSO/MoSPI Household Consumer Expenditure Survey:

  1. HCES 2011-12 (68th Round)[1]: approximately 101,000 households
  2. HCES 2023-24[2]: approximately 250,000 households

For each survey round, two pre-processed datasets are loaded:

  1. Composition data (HCES20XX_comp): household-level records containing demographic, geographic, and economic variables — including state code, sector (rural/urban), household size (both head-count and AFE-adjusted), MPCE, expenditure decile classifications (state-level and all-India), survey design variables (PSU, strata, weights), and a general price index for real-value adjustments.
  2. Summary data (HCES20XX_summary): item-level expenditure records with household ID, item code, and expenditure value.

2.2 Item Reference Tables

Each survey round has an associated item reference table (item_code_ref_period.dta) that maps numeric item codes to:

  1. item_desc — human-readable item name (e.g., "Rice — PDS", "Toned milk")
  2. broad_category_f — the broad food group to which the item belongs
  3. ref_period — the recall/reference period over which consumption was recorded (7 days or 30 days), used to standardise expenditure to a monthly basis

2.3 Pre-Computed Analytical Data

The pipeline also loads data_for_models.RData, which contains the state_code lookup table mapping numeric state codes to state names, regions, and state-type classifications (state vs. union territory).


3. Methodology

The pipeline follows a five-step process for each combination of parameters. The overall flow is:

Raw HCES microdata
  ↓ Filter & partition by state × sector
Household-level slices
  ↓ Assign quintile class, AFE/PC, real/nominal
Standardised household records
  ↓ Join item-level expenditure & compute monthly per-capita values
Item × household expenditure matrix
  ↓ Survey-weighted estimation (svyby)
Weighted item-level totals
  ↓ Proportional decomposition → broad category + item shares
Final Sankey dataset

3.1 Step 1: Pre-Process & Partition Data

3.1.1 Column Selection & Filtering

The composition dataset is first filtered to exclude households with no cooking arrangements (cooking code "12" for HCES 2023-24 and "10" for HCES 2011-12). This ensures that data used in the analysis is consistent with data used in other dashboards. A slim subset of columns is retained: household ID, state code, sector, decile classifications (state-level and all-India, in both AFE and per-capita terms), household size measures, MPCE, price index, and survey design variables.

3.1.2 Spatial Partitioning

The filtered data is split into partitions keyed by state code × sector (e.g., "09_1" for Uttar Pradesh Rural). A special all-India partition ("00_1", "00_2") is created by retaining the full sector slice without state filtering. This pre-splitting serves two purposes:

  1. Each parallel worker receives only its own slice, avoiding the overhead of shipping the entire dataset
  2. Empty or invalid state-sector combinations are naturally excluded

3.2 Step 2: Household-Level Standardisation

3.2.1 AFE vs Per-Capita Measurement

The pipeline supports two standardisation modes:

  1. AFE (Adult Female Equivalent): household size and MPCE are computed using energy-based adult female equivalence scales[3], which account for the differing caloric requirements of household members by age and sex. This is the preferred mode for nutritional analysis.
  2. Per-Capita (Non-AFE): household size is taken as the raw head count, and MPCE is total household expenditure divided by family size. This mode provides comparability with official MPCE statistics.

The pipeline selects the appropriate household size, MPCE, and decile classification variables based on the chosen mode.

3.2.2 Expenditure Quintile Classification

Households are grouped into five expenditure quintile bands based on their decile classification. For state-level analysis, state-specific decile cutoffs are used; for all-India analysis, national decile cutoffs apply.

Quintile LabelDecile RangePopulation Share
Bottom 20%Deciles 1–2Poorest 20%
20–40%Deciles 3–4Second quintile
40–60%Deciles 5–6Middle quintile
60–80%Deciles 7–8Fourth quintile
Top 20%Deciles 9–10Richest 20%

When the class parameter is set to "Overall", all households are pooled regardless of quintile.

3.2.3 Real vs Nominal Adjustment

When analysis = "real", MPCE is deflated by a general price index to express expenditure in constant (real) terms, enabling meaningful cross-region and cross-time comparisons. When analysis = "no" (nominal), MPCE is used as reported.


3.3 Step 3: Compute Per-Capita Expenditure

The household composition data is merged with the item-level summary data via household ID. For each household-item pair, monthly per-capita expenditure is calculated as:

Per-capita expenditureh,i = (Valueh,i ÷ HH_Sizeh) × (1 ÷ Ref_Periodi) × 30

where:

  1. Valueh,i is the reported expenditure of household h on item i during the reference period
  2. HH_Sizeh is the household size (AFE-adjusted or head-count)
  3. Ref_Periodi is the recall reference period in days (7 or 30) for item i
  4. Multiplying by 30 converts the daily rate to a monthly figure
Note: Item code "539" (total food expenditure summary row) is excluded to avoid double-counting. Rows with missing values are dropped before estimation.

Survey weights are adjusted to the population level by multiplying the household sampling weight by household size: w_pc = weights × fdq_hh_size.


3.4 Step 4: Survey-Weighted Estimation

All estimates are computed within a complex survey design framework using the survey and srvyr R packages[4], accounting for stratified multi-stage sampling with PSU-level clustering. The option survey.lonely.psu = "adjust" is set to handle strata with a single PSU via centring adjustments[5].

3.4.1 Item-Level Expenditure Totals

Using svyby(), survey-weighted population totals of per-capita expenditure are estimated for each unique combination of survey round, state, sector, expenditure class, and item code. These totals reflect the aggregate monthly per-capita food expenditure attributable to each item in the target population.

3.4.2 Mean MPCE

Separately, the survey-weighted mean of income/MPCE is computed at the household level (after deduplication to avoid inflating the estimate by number of items per household). This provides the average standard of living for the relevant subgroup and is attached to the final output for contextual reference.


3.5 Step 5: Proportional Decomposition

3.5.1 Two-Level Hierarchy

The final step converts absolute expenditure totals into the proportional shares that drive the Sankey visualisation:

  1. Item-level share: each item's expenditure total is divided by the sum of all item expenditures within the same group (survey round × state × sector × class):
    propi = expenditurei ÷ Σ expenditure
  2. Broad-category share: item-level shares are aggregated to their parent food group via the broad_category_f mapping from the item reference table:
    broad_propc = Σi ∈ c propi

Both shares are formatted as percentage strings (e.g., "23.5%", "3.14%") and combined with human-readable labels to produce the final node names used by the Sankey chart — for example, "Cereals & Millets (32.1%)" at the broad level and "Rice — PDS (8.47%)" at the item level.


4. Run Grid & Parallelisation

4.1 Full Factorial Grid

The pipeline constructs a full factorial crossing of all parameter dimensions:

DimensionValues
Survey round (n)HCES2011, HCES2023
State (st)"00" (All-India) + all state codes present in each round
Expenditure class (cl)Overall, Bottom 20%, 40–60%, Top 20%
Standardisation (level)AFE, Non-AFE
Price adjustment (analysis)Real, Nominal
Sector (sect)1 (Rural), 2 (Urban)

4.2 Validity Filtering

Not every state-sector combination exists in every survey round (e.g., newly created states or union territories). The grid is inner-joined against the set of actually observed (state, sector) pairs in each survey, dropping invalid combinations before execution. This avoids wasted computation on empty slices.

4.3 Execution Strategy

To manage memory and maximise throughput, the pipeline processes one survey round at a time:

  1. Prepare: load and partition data for the current survey round
  2. Parallelise: distribute the grid subset across workers using future_pmap() from the furrr package, with explicit globals to avoid shipping the entire environment
  3. Collect: bind non-null results and shut down workers
  4. Repeat for the next survey round

This batch-by-survey design halves peak memory compared to processing both rounds simultaneously, since only one survey's data is in scope at a time.


5. Output Structure

The final dataset contains one row per item × subgroup combination, with the following fields:

ColumnDescription
nssSurvey round identifier (HCES2011 or HCES2023)
classExpenditure quintile label or "Overall"
sector"Rural" or "Urban"
broad_newBroad food category with its expenditure share, e.g., "Cereals & Millets (32.1%)"
item_codeNumeric item code from the HCES schedule
item_newItem name with its expenditure share, e.g., "Rice — PDS (8.47%)"
propItem-level expenditure proportion (numeric, 0–1)
mpceSurvey-weighted mean MPCE for the subgroup
level"AFE" or "Non AFE"
analysis"real" or "no" (nominal)
state_nameState/UT name
regionGeographic region (Northern, Eastern, etc.)
state_typeState or Union Territory classification

The output is saved as both CSV and RDS formats for downstream consumption by the Sankey visualisation dashboard.


6. Limitations

  1. Recall bias: HCES relies on household recall over 7-day or 30-day reference periods, which may under- or over-estimate consumption of infrequently purchased items.
  2. Away-from-home food: Expenditure on cooked meals and beverages consumed outside the home (restaurants, street food, canteens) is explicitly excluded. This means the Sankey diagram reflects only home-consumption food spending, not total food expenditure.
  3. Intra-household distribution: The per-capita (or per-AFE) figures assume equal distribution of food within the household, which may not hold in practice.
  4. Price variation within items: The proportional shares reflect expenditure, not quantity. Two households spending the same share on "milk" may consume very different quantities if prices differ by region or quality grade.
  5. Quintile composition effects: When comparing across survey rounds, the households constituting each quintile may differ in composition due to overall income growth, urbanisation, and demographic shifts.

References

Appendix: Glossary of Terms

Adult Female Equivalent (AFE): A standardisation method that converts household members into equivalent adult females based on age- and sex-specific energy requirements, enabling like-for-like comparisons of per-person food consumption across households of different compositions.

Broad Category: A high-level food group (e.g., Cereals & Millets, Pulses & Beans, Milk & Milk Products) that aggregates multiple individual food items. These form the first-level nodes in the Sankey diagram.

Decile: Division of the population into 10 equal groups based on MPCE. Decile 1 = poorest 10%, Decile 10 = richest 10%.

HCES: Household Consumer Expenditure Survey, conducted by the National Sample Survey Office (NSSO) under the Ministry of Statistics and Programme Implementation (MoSPI).

MPCE: Monthly Per-Capita Expenditure — total household expenditure divided by household size (head-count or AFE-adjusted), expressed per month.

PSU: Primary Sampling Unit — the first-stage unit in the survey's stratified multi-stage sampling design (typically a village in rural areas or an urban block in urban areas).

Reference Period: The recall window over which households report consumption. HCES uses 7-day recall for frequently consumed items (cereals, milk, vegetables) and 30-day recall for less frequently purchased items (spices, processed foods).

Sankey Diagram: A flow visualisation where the width of each link is proportional to the quantity it represents. In this context, total food expenditure flows into broad categories and then into individual items.

Sector: Rural or Urban classification as defined by the Census of India.

Survey Weights: Statistical multipliers assigned to each sampled household to make the sample representative of the target population, accounting for differential selection probabilities and non-response adjustments.


Last updated: February 2026

[1] National Sample Survey Office (2013). Household Consumer Expenditure Survey 2011-12 (68th Round). Ministry of Statistics and Programme Implementation, Government of India.
[2] Ministry of Statistics and Programme Implementation (2024). Household Consumption Expenditure Survey 2023-24. Government of India.
[3] Indian Council of Medical Research – National Institute of Nutrition (2020, updated 2024). Nutrient Requirements for Indians: A Report of the Expert Group. ICMR-NIN, Hyderabad.
[4] Lumley, T. (2023). survey: Analysis of Complex Survey Samples. R package version 4.2. https://CRAN.R-project.org/package=survey
[5] The "adjust" option centres the single-PSU stratum contribution at the grand mean, following the approach recommended in Lumley (2004), Analysis of Complex Survey Samples, Journal of Statistical Software 9(8).