New Features of Population Synthesis: PopSyn III of CT-RAMP

New Features of Population Synthesis: PopSyn III of CT-RAMP Peter Vovsha, Jim Hicks, Binny Paul, PB Vladimir Livshits, Kyunghwi Jeon, Petya Maneva, MAG 1

1. MOTIVATION & STATEMENT OF INNOVATIONS 2

Previous Generation of Population Synthesizers Problem formulation: Create a list of HHs in each TAZ from a sample (PUMS) Match the given controls 3 Steps: ( Balance ) Multidimensional HH distribution in each TAZ ( Dicretize ) List of HHs with controlled variables in each TAZ ( Draw ) Randomly join HHs from PUMS by controlled variables Limitations: No single theoretical framework / no guarantee of unique solution / no method to compare different solutions Difficult to handle both HH and person characteristics Each HH characteristic has to be presented as a distribution; no convenient way to introduce general tendencies 3

State of the Art Analytical methods that balance a list (or sample) of HHs to meet the controls imposed at some level of geography (TAZ) [Bar-Gera et al all, 2012, Ye et al, 2009]: PopSyn III belongs to this family Combinatorial methods based on a random swapping on HHs between TAZs if the fit measures can be improved [Abraham et al, 2012; Harland et al, 2012] 4

Our Contribution General formulation of convergence of the balancing procedure with imperfect (i.e. not fully consistent) controls: Guarantees unique repeatable solution Screens inconsistencies and addresses a differential degree of confidence in different controls Optimized discretizing of the fractional outcomes of the balancing procedure to form a list of discrete households: Enhanced spatial resolution and growing number of controls make rounding errors substantial w/simple bucket rounding Linear Programming (LP) approach in order to optimize the discretized weights and preserve the best possible match to the controls Eliminates Monte-Carlo simulation error, all procedures are analytical and repeatable Multiple levels of geography where the controls can be set: Important demographic and socio-economic trends can only be translated into more aggregate controls than a TAZ-level control In new generation of ABMs all location choices are modeled at the level of Micro-Analysis Zones (MAZs) nested within TAZs 5

2. CORE LIST BALANCING 6

List of Individual HHs HH ID HH size Person age HH 1 2 3 4+ 0-15 16-35 36-64 65+ initial weight i 1 i 2 i 3 i 4 i 5 i 6 i 7 8 i n n 1 1 1 20 n 2 1 1 1 20 n 3 1 1 2 20 n 4 1 2 2 20 n 5 1 1 3 2 20. Control 100 200 250 300 400 400 650 250 7

Basic Formulation of List Balancing w/fixed Controls - Preserve initial weights as much as possible - Meet all controls Convex mathematical program with linear constraints Solution can be found by forming the Lagrangian and equating partial derivatives to zero (necessary conditions) Conventional matrix balancing or table balancing are particular cases 8

Solution Does not guarantee existence of the solution (feasibility of constraints) Scale k is incorporated in balancing factors only if the total weight (number of HHs) is predefined by controls If constraints are feasible and total weight (number of HHs) is predefined, a solution exists, it is unique, and independent of the scale of initial weights Can be found by Newton-Raphson but a simple balancing method also works well 9

Relaxation of Controls Objective function: Match relaxed controls: HH weights and relaxation factors: Importance factors for controls: Large value of 1,000 to ensure match if feasible 1,000,000 for total number of HHs 10

Solution w/relaxation Guarantees existence of the solution (regardless of feasibility of constraints) If constraints are feasible and importance factors are large the solution is equivalent to solution of the problem w/o relaxation Newton-Raphson method to calculate balancing factors efficiently Relaxation of constraints included in the loop with adjustment of weights 11

Iterative Application of Newton- Raphson Method For each iteration For each control //Step 1: Calculate balancing factors // Step 2: Update HH weights: // Step 3: Update relaxation factors: End of loop over controls // Step 4: Check for convergence End of loop over iterations 12

How Relaxation Works If the controls are consistent: Algorithm performs exactly as balancing algorithm w/fixed constraints and yields the same solution If the controls are internally inconsistent: Balancing w/o relaxations does not converge at all Balancing w/relaxations produces a unique convergent solution w/controls satisfied to the extent possible Degree of the necessary relaxation of each control is inversely related to the importance weight 13

3. DISCRETIZING 14

Discretizing is Not Trivial Discretizing is not a trivial problem: Population is synthesized at a fine level of spatial resolution (30,000-40,000 MAZs) Balancing results in many small fractional numbers Simple rounding may cause substantial deviations from the controls: Accumulated across multiple MAZs If rounding is forced to match controls exactly it may cause significant deviation from the distribution of initial weights Discretizing problem can be formulated as replacing the fractional household weights with integer weights that: Preserves controls as well as possible and Achieve uniformity of HH weights to the maximum extent 15

Discretizing as LP Problem Objective function discrete weights as close as possible to original fractional weights: min n y n y ln x 0 n n, if if y y n n 1 0 S.T. constraints matching residual controls: ani yn Ai 0, 1 n y n max n y ln x 16 n n

4. MULTIPLE LEVEL OF GEOGRAPHY 17

Multiple Levels of Geography Needed for Setting Controls Important demographic & socio-economic trends: Can only be translated into more aggregate controls than TAZ-level Handled by upward meta-balancing New generation of CT-RAMP ABMs operate with enhanced level of spatial resolution: Location choices are modeled at the level of Micro-Analysis Zones (MAZs) nested within TAZs Handled by downward allocation 18

Workers-Jobs Balance Generated workers by industry should correspond to job segmentation by industry: Regional level Discrepancies eliminated: (Standard way) regional normalization of #jobs by industry to match #workers (Suggested way) adding workers-by-industry meta-control to PopSyn 19

5. UPWARD META- BALANCING 20

Decomposition for Meta- Balancing Meta-controls can be written rigorously as extension of the core List Balancing procedure: HH weights optimized simultaneously for all TAZs in the region accounting for controls at TAZ level as well as upper levels of geography Useful for theoretical analysis but impractical due to huge dimensionality Thus, the problem has to be decomposed 21

Worker distribution by industry for each TAZ Meta-Balancing HH distributions by size, income, #workers for each TAZ Balance individual HHs in each TAZ Workers by industry for each TAZ Worker distribution by industry for each county Balance workers by industry for each TAZ in county 22

5. DOWNWARD ALLOCATION FROM PUMA TO TAZ & FROM TAZ TO MAZ 23

Allocation Procedure Allocate HH weights generated from any upper level of geography to the lower level of geography: PUMA to TAZs TAZ to MAZs Balancing-and-discretizing procedure applied sequentially to each MAZ in the TAZ: MAZs from smallest to biggest in terms of #HHs TAZ-level HH weights as initial weights Residual weights are adjusted w/o replacement Total weight summed across the MAZs is matched to the original TAZ-level weight for each HH 24

Balancing & Discretizing (PUMA-Meta) Controls by geography Sample of HHs List balancing Meta balancing HHs from PUMA w/replacement PUMA: HH size (1,2,3,4+) HH income (5 quintiles) Housing type (1,2) #university students County (MAG, PAG): Workers by industry HHs balanced for PUMA with fractional weights HHs balanced for PUMA with fractional weights HHs discretized for PUMA with weights 1 HHs discretized for PUMA with residual weights<1 by LP # workers by industry for each PUMA Balanced # workers by industry for each PUMA 25

Balancing & Discretizing (PUMA-TAZ) TAZ: HH size (1,2,3,4+) HH income (5 quintiles) Housing type (1,2) #university students HHs from PUMA w/o replacement HHs balanced for TAZ with fractional weights HHs discretized for TAZ with weights 1 HHs discretized for TAZ with residual weights<1 by LP TAZs within PUMA are processed from smallest to biggest 26

Balancing & Discretizing (TAZ-MAZ) HHs from TAZ w/o replacement MAZ: HH size (1,2,3,4+) HH income (5 quintiles) #university students HHs balanced for MAZ with fractional weights HHs discretized for MAZ with weights 1 HHs discretized for MAZ with residual weights<1 by LP MAZs within TAZ are processed from smallest to biggest 27

PopSyn III: MAG Input Highlights Region: MAG (4 Counties) No of PUMAs in Modeling region: 24 No of TAZs: 3,009 No of MAZs: 26,231 Seed Sample: PUMS (ACS 2006-10, 5% sample) Max Expansion Factor: 5 Controls: Total no of HHs, very high importance, MAZ HH size categories (1,2,3,4+), med imp, MAZ Income quintiles, med imp, MAZ Housing type (single/multi-family), med imp, MAZ Person age categories (0-18,19-35,36-65,66+), med imp, MAZ #Workers by industry type, med imp, META (district) 28

Uniformity of HH Expansion Factors Ensured by initial balancing of HH weights at PUMA level: Subsequent allocation to TAZ and MAZ w/o replacement preserves expansion Cannot be achieved by independent balancing of HH weights for each MAZ: Results in very lumpy weights 29

6. POPSYN VALIDATION 30

Dimensions for PopSyn Validation PopSyn Input: Controls vs. Sample (PUMA): Substantial discrepancies can start here and not necessarily wrong PopSyn Output: Matching controls (MAZ, TAZ, PUMA, Meta) Uniformity of HH expansion factors (PUMA) Uncontrolled variables vs. PUMA/Census 31

7. CONSISTENCY OF POPSYN INPUT 32

Percentage Control vs. PUMS PUMA 103 HH Size 0.35 0.3 0.25 0.2 0.15 PUMA (weighted) Controls 0.1 0.05 0 33

Percent Meta-Controls vs. PUMS MAG Region 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 MAG Region PUMS Sample Controls Industry 34

Reasons for Discrepancy Different years: PUMS is multi-year sample Controls set for a single base year Sampling error: PUMS is 5% sample Controls reflect independent sources and/or LU model forecasts for the entire population Implications: From the very beginning we cannot expect a full match, moreover we may intentionally skew synthetic population 35

8. MATCHING CONTROLS 36

Matching Controls at Regional Level Variable Output Control Difference Population 4,126,093 4,127,042-0.023% Total HH 1,540,148 1,540,148 0.000% HHsize1 398,856 398,970-0.029% HHsize2 491,707 491,661 0.009% HHsize3 242,039 242,037 0.001% HHsize4+ 407,546 407,480 0.016% Income1 267,742 267,719 0.009% Income2 286,940 286,879 0.021% Income3 305,579 305,551 0.009% Income4 328,931 328,961-0.009% Income5 350,956 351,038-0.023% Age0018 1,169,137 1,169,202-0.006% Age1935 1,004,237 1,004,570-0.033% Age3665 1,515,752 1,516,213-0.030% Age66+ 436,967 437,057-0.021% Single Family 1,278,184 1,278,227-0.003% Multi Family 261,964 261,921 0.016% 37

Meta-Controls at Regional Level NAICS Industry Codes Variable Output Control Difference Agriculture 7,553 7,552 0.013% Extraction 2,455 2,455 0.000% Utility 13,095 13,091 0.031% Construction 135,067 135,002 0.048% Manufacturing I 8,775 8,774 0.011% Manufacturing II 26,010 26,003 0.027% Manufacturing III 89,961 89,913 0.053% Wholesale 83,227 83,233-0.007% Retail I 140,999 140,967 0.023% Retail II 74,694 74,678 0.021% Transportation 35,508 35,485 0.065% Transportation - Postal 15,382 15,377 0.033% Information 30,599 30,588 0.036% Finance I 95,247 95,194 0.056% Finance II 49,209 49,191 0.037% Professional I 152,171 152,157 0.009% Professional II 4,085 4,087-0.049% Professional III 145,610 145,619-0.006% Education 91,847 91,778 0.075% Medical 122,123 122,032 0.075% SCA 36,814 36,805 0.024% Entertainment 141,768 141,749 0.013% Services 71,053 71,018 0.049% Administrative/Military 220,967 221,089-0.055% 38

PUMA Level Matching Example (PUMA 101) Variable Output Control Difference Total HH 72,776 72,776 0.000% HHsize1 25,547 25,558-0.043% HHsize2 36,926 36,919 0.019% HHsize3 4,672 4,671 0.021% HHsize4+ 5,631 5,628 0.053% Income1 14,681 14,680 0.007% Income2 16,522 16,518 0.024% Income3 16,721 16,720 0.006% Income4 15,103 15,106-0.020% Income5 9,749 9,752-0.031% Age0018 15,259 15,259 0.000% Age1935 11,892 11,899-0.059% Age3665 45,908 45,933-0.054% Age66+ 67,308 67,334-0.039% Single Family 69,225 69,227-0.003% Multi Family 3,551 3,549 0.056% 39

TAZ Level Matching Example (TAZ 108) Variable Output Control Difference Total HH 821 821 0.000% HHsize1 229 229 0.000% HHsize2 279 279 0.000% HHsize3 176 176 0.000% HHsize4+ 137 137 0.000% Income1 252 252 0.000% Income2 164 164 0.000% Income3 195 195 0.000% Income4 138 138 0.000% Income5 72 72 0.000% Age0018 394 395-0.253% Age1935 126 126 0.000% Age3665 922 922 0.000% Age66+ 531 531 0.000% Single Family 734 734 0.000% Multi Family 87 87 0.000% 40

TAZ Level Matching Example (TAZ 2191) Variable Output Control Difference Total HH 4,910 4,910 0.000% HHsize1 797 843-5.457% HHsize2 3,602 3,669-1.826% HHsize3 248 248 0.000% HHsize4+ 263 150 75.333% Income1 571 539 5.937% Income2 686 643 6.687% Income3 1,109 1,076 3.067% Income4 1,182 1,161 1.809% Income5 1,362 1,491-8.652% Age0018 725 368 97.011% Age1935 701 398 76.131% Age3665 3,645 3,887-6.226% Age66+ 4,939 4,983-0.883% Single Family 4,659 4,657 0.043% Multi Family 251 253-0.791% 41

Why Cannot Controls be Matched for TAZ 2191? The reason is always a structural inconsistency between the controls themselves as well as between the controls and sample proportions: Controls want a few large HHs and a few children Controls want more retired HHs Controls want more high-income HHs at the same time These controls cannot be fully reconciled and PopSyn finds the best compromise solution 42

Frequency - (Control - Model) TAZ Level Matching Scatter Plot : [Output-Control] Total # HHs Control 3500 3000 2500 2000 1500 1000 500 0-400 -300-200 -100 0 100 200 300 Difference between TAZ Control and Model Output 43

Frequency - (Control - Model) TAZ Level Matching Scatter Plot : [Output-Control] Age: 19-35 years Control 2000 1800 1600 1400 1200 1000 800 600 400 200 0-400 -300-200 -100 0 100 200 300 Difference between TAZ Control and Model Output 44

9. UNIFORMITY OF HOUSEHOLD EXPANSION 45

0-0.5 0.5-1 1-1.5 1.5-2 2-2.5 2.5-3 3-3.5 3.5-4 4-4.5 4.5-5 5-5.5 5.5-6 6-6.5 6.5-7 7-7.5 7.5-8 8-8.5 8.5-9 9-9.5 9.5-10 10-10.5 10.5-11 11-11.5 11.5-12 Percentage Expansion Factor Distribution PUMA 102 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% Expansion Factor Range 46

10. UNCONTROLLED VARIABLES 47

Percentage Output vs. PUMS (Uncontrolled HH Distribution by #Workers) PUMA 102 40.0% PUMS vs SynPop - Uncontrolled Variable (# workers) 35.0% 30.0% 25.0% 20.0% PUMS SynPop Output 15.0% 10.0% 5.0% 0.0% 0 HH Workers 1 HH Worker 2 HH Workers >=3 HH Workers 48

Why Cannot we Match #Workers Exactly? #workers is correlated with HH size and income and these controls do not match PUMS exactly Meta-controls by worker industry were intentionally set differently from PUMS 49

Output vs. Census, Uncontrolled Joint Distribution by HH size and #Workers, Regional Level HH Size = 1 HH Size = 2 HH Size = 3 HH Size >= 4 Variable Output Census Difference Total HH 1,540,010 1,535,588 0.288% HH Workers = 0 396,440 381,978 3.786% HH Workers = 1 620,011 623,439-0.550% HH Workers = 2 423,135 435,286-2.791% HH Workers >= 3 99,941 94,885 5.329% HH Workers = 0 177,942 172,196 3.337% HH Workers = 1 220,626 229,828-4.004% HH Workers = 0 164,172 164,468-0.180% HH Workers = 1 156,122 171,034-8.719% HH Workers = 2 170,992 192,442-11.146% HH Workers = 0 24,693 20,395 21.074% HH Workers = 1 88,894 83,316 6.695% HH Workers = 2 98,303 91,591 7.328% HH Workers >= 3 29,892 29,249 2.198% HH Workers = 0 29,497 24,919 18.372% HH Workers = 1 154,041 139,261 10.613% HH Workers = 2 153,626 151,253 1.569% HH Workers >= 3 70,025 65,636 6.687% 50

Detailed Analysis at Census Tract Level Full output is available in the spreadsheet format We intentionally contrast the best and worst cases of match: Best cases are not necessarily right Worst cases are not necessarily wrong and the explanation is discrepancy between the controls and sample itself 51

Output vs. Census, Uncontrolled Joint Distribution by HH size and #Workers, Census Tract: 4021000307 Variable Output Census Difference Total HH 1,030 1,036-0.579% HH Workers = 0 246 159 54.717% HH Workers = 1 414 411 0.730% HH Workers = 2 300 345-13.043% HH Size = 1 HH Size = 2 HH Size = 3 HH Size >= 4 HH Workers >= 3 70 121-42.149% HH Workers = 0 94 83 13.253% HH Workers = 1 131 147-10.884% HH Workers = 0 106 76 39.474% HH Workers = 1 147 140 5.000% HH Workers = 2 137 179-23.464% HH Workers = 0 37 - - HH Workers = 1 86 71 21.127% HH Workers = 2 110 58 89.655% HH Workers >= 3 35 21 66.667% HH Workers = 0 7 - - HH Workers = 1 49 53-7.547% HH Workers = 2 53 108-50.926% HH Workers >= 3 35 100-65.000% 52

Conclusions It is more debugging and/or analysis of controls and sample than PopSyn validation itself The procedure is analytical, there is no mystery or random outcome in it Very good match is comforting but not necessarily right, sometimes you intentionally want to skew the distribution If match is not good it is not necessarily wrong but the subsequent analysis is very important: Are controls set inconsistently? (most frequent) Is there structural discrepancy between controls and sample and was it intentional or derived? (also frequent) 53

Most Useful Next Step PopSyn is a mandatory component of any ABM It is also useful for supporting 4-step Many MPOs (ARC, BMC, MAG, NMPO, Ottawa Trans) singled out Population Synthesizer as a first step before full ABM development 54