Tree-based and GA tools for optmal samplng desgn The R User Conference 2008 August 2-4, Technsche Unverstät Dortmund, Germany Marco Balln, Gulo Barcarol Isttuto Nazonale d Statstca (ISTAT)
Defnton of the problem () In a survey, the optmalty of a stratfed sample can be defned n terms of both the followng elements: total cost (unt cost per ntervew, product the sample sze); planned accuracy (epected samplng varance related to target estmates). A sample desgn s acceptable f epected samplng errors are below pre-defned lmts, and costs are sustanable.
Defnton of the problem (2) Bethel (985) proposed an algorthm allowng to determne total sample sze and allocaton of unts n strata, so to mnmse costs under the constrants of defned precson levels of estmates, n the multvarate case (more than one estmate). Under ths approach, populaton stratfcaton,.e. the partton of the samplng frame obtaned by cross-classfyng unts by means of stratfcaton varables, s gven. But stratfcaton has a great mpact on samplng varance and, n general, t should not be consdered as gven, but determned on the bass of the survey requrements.
Defnton of the problem (3) Our proposal s: gven a populaton frame, wth p X aulary varables, and a sample survey, wth specfc constrants on the accuracy of g Y target varables, then jontly determne: 2. the best stratfcaton (partton by means of aulary varables) of ths frame, and 3. the mnmum sample sze and allocaton of unts n strata, requred to satsfy constrants on estmates accuracy. Ths can be done by usng search technques (tree or genetc algorthm) to eplore the possble solutons,.e. the dfferent possble stratfcatons, that are evaluated by means of the Bethel algorthm.
Bethel algorthm The optmal multvarate allocaton problem can be defned as the search for the soluton of the mnmum (wth respect to n h ) of lnear functon C under the conve constrants V ( Yg ) U g g,..., G / nh f nh Bethel suggested that by ntroducng the varable h otherwse the problem s equvalent to search the mnmum of the conve functon C(,..., H ) under the set of lnear constrants H h N 2 h S 2 h, g h An algorthm, that s proved to converge to the soluton (f t ests), s provded by Bethel (and Chromy) by applyng Lagrange multplers method to ths problem. N h S 2 h, g U g
Optmal stratfcaton: the tree-based approach () The tree-based approach has been deated by Benedett, Espa, Lafratta: A tree-based approach to form strata n mult-purpose busness surveys, Dscusson Paper n.5/2005, Unverstà degl Stud d Trento. The proposed procedure searches the best stratfcaton by generatng a tree wth a splttng rule such that, at any gven level, the generatng node s chosen n such a way that the decrease of the overall sample sze from one level to the other, s mamsed.
Optmal stratfcaton: the tree-based approach (2) Gven p aulary varables n the frame, wth doman sets D { },..., m (,..., p) we can represent a soluton by means of a vector p of cardnalty M whose elements v j k m k X,..., X p can assume or 0 values. [ v v,..., v M If we set j ( m k ) + q k then we have v j f the q - th value of the - th varable s actvated 0 otherwse
Optmal stratfcaton: the tree-based approach (3) The tree-based algorthm s a sequence of four dfferent steps. Step 0 (ntalsaton): the node assocated to the stratfcaton charactersed by a unque stratum, concdng wth the whole populaton, s the root of the tree (level k 0), and s set as generatng node. Step : from the generatng node at level k, chld nodes of level (k+) are generated, by on turn actvatng a sngle value of the vector v [ v,..., v M among those not yet actvated..
Optmal stratfcaton: the tree-based approach (4) Step 2: at level (k+), the overall sample sze n s calculated wth the Bethel-Chromy algorthm for each node n the level. The node wth the mnmum n s set as generatng node. Step 3 ( stoppng rule): steps and 2 are repeated untl (c) the mamum acceptable number of strata has been reached (the actvaton of new values n X s domans ncreases the number of resultng strata) (d) the gan n terms of reducton of the overall sample sze becomes neglgble. Best soluton s then selected by consderng the one assocated to the generatng node of the prevous level.
Optmal stratfcaton: the tree-based approach (5) [,..., m [0,,0 Level 0 [,0,0, [0,..,,..,0 [0,..,,0 [0,0,, mn n [,0,0,, [,0,0, 0,, [,0,0, 0,, [,0,,0, mn n Level Level 2 [,0,0,,,0,0, mn n Level q
Optmal stratfcaton: the tree-based approach (6) Basc strata strata Tree Bethel Precson constrants on estmates Parameters of eecuton Soluton Output strata
Optmal stratfcaton: the evolutonary approach () The applcaton of the tree-based algorthm, prevously ntroduced, allows to obtan a (relatvely) fast soluton. Ths approach, however, may be subject to local mnma. It s therefore convenent to verfy (and possbly mprove) the resultng soluton by sequentally applyng a dfferent algorthm, whch s of the evolutonary type,.e. based on the genetc algorthm.
Optmal stratfcaton: the evolutonary approach (2) To be appled, a genetc algorthm requres two basc elements to be defned: a genetc representaton of the soluton doman; a ftness functon to evaluate each soluton. In our problem, each soluton can be represented by the v [ v,..., v M vector already ntroduced n the tree-based approach, that dentfes a partcular stratfcaton (partton) of the populaton frame. The ftness of any gven soluton s evaluated by means of the Bethel algorthm, and t s gven by the mnmum sample sze requred to satsfy precson constrants to samplng estmates.
Optmal stratfcaton: the evolutonary approach (3) The mplemented genetc algorthm makes use of genalg package (Wllghagen 2005), and s based on the followng steps. Step 0 (ntalsaton): an ntal set of t ndvduals (possble solutons) are randomly generated, possbly contanng (as a suggeston ) the soluton found by the tree-based approach; the ftness of each ndvdual s evaluated. Step : the net generaton of ndvduals s generated by selectng the fttest ones of the current generaton, and by applyng the genetc operators crossover and mutaton Step 2 (stoppng rule): step s terated k tmes, then the best soluton (the fttest,.e the one wth the mnmum sample sze) s outputted
Optmal stratfcaton: the evolutonary approach (4) crossover : gven two parents, a subset of chromosomes are echanged between them mutaton: gven the probablty that an arbtrary chromosome may change from ts orgnal state to another (mutaton chance), for each chromosome n an ndvdual, a random value s drawn n order to decde to change or not Mutaton s very mportant to decde the rapdty of the convergence: too rapd, rsk of local mnma
Optmal stratfcaton: the evolutonary approach (5) generaton j [,,0,...,0,,... [... [0,,0,...,,0,... [... [0,,0,...,,0,... [... [0,,0,...,,0,... [ m t m j m m s s s s selecton wth probablty proportonal to ftness [0,,0,...,,0,... [ [0,,0,...,,0,... [ m j m s s mutaton + crossover [0,,0,...,0,,... [... [,,0,...,,,... [... [0,,,...,,0,... [... [,,0,...,0,,... [ m t m j m m s s s s generaton j+
Optmal stratfcaton: the evolutonary approach (6) Tree-based soluton genalg package Basc strata nformaton strata Genalg Bethel Precson constrants on estmates Parameters of eecuton Soluton Output strata nformaton
An applcaton: the Italan Farm Structure Survey The samplng frame used for the selecton of FSS sample contans 2,53,70 farms, each one charactersed by the followng X varables: provnces (03 dfferent values); legal status (2 values); sector of economcal actvty (9 values); dmenson n terms of producton (3 values); dmenson n terms of agrcultural surface (3 values); dmenson n terms of owned cattle (3 values) altmetry class (5 values). 4 dfferent Y varables have been consdered as the man target of FSS, on whch requred precson (n terms of mamum coeffcent of varaton) has been fed at regonal levels (domans of nterest).
() Current sample sze (2) Tree-based soluton % dff. Itala 52,73 29,726-43.6 Pemonte 3,560,546-56.57 Valle d A. 409 384-6. Lombarda 5,25 2,237-56.35 Bolzano 687 540-9.94 Trento 667 638-4.35 Veneto 3,873 2,299-40.64 Frul V.G.,262 69-50.95 Lgura,327 777-4.45 Emla R. 3,7,966-36.93 Toscana 2,833,34-52.67 Umbra,363 858-37.05 Marche,88 508-57.24 Lazo 3,70 2,620-29.38 Abruzzo,222 950-22.26 Molse,83 867-26.7 Campana 3,63 2,54-3.90 Pugla 6,595 2,326-64.73 Baslcata 965 684-29.2 Calabra 2,846 2,080-26.9 Scla 5,0 3,82-36.50 Sardegna 2,607,40-36.50
(2) Tree-based soluton (3) evolutonary soluton % dff. Itala 29,726 28,955-2.59 Pemonte,546,546 0.00 Valle d A. 384 376-2.08 Lombarda 2,237 2,237 0.00 Bolzano 540 540 0.00 Trento 638 638 0.00 Veneto 2,299 2,38-7.00 Frul V.G 69 69 0.00 Lgura 777 657-5.44 Emla R.,966,933 -.68 Toscana,34,30-2.3 Umbra 858 858 0.00 Marche 508 498 -.97 Lazo 2,620 2,620 0.00 Abruzzo 950 876-7.79 Molse 867 79-7.07 Campana 2,54 2,040-5.29 Pugla 2,326 2,272-2.32 Baslcata 684 684 0.00 Calabra 2,080 2,072-0.38 Scla 3,82 3,82 0.00 Sardegna,40,40 0.00
Conclusons In a sample survey desgn, the jont adopton of a consoldated algorthm for determnng best sample sze and unts allocaton, together wth search technques, as tree-based and genetc algorthm, to eplore dfferent possble stratfcatons, can be very convenent n stuatons where many dfferent stratfcatons of a samplng frame are possble. A lmtaton of ths approach s n the constrant on the nature of aulary varables X, that must be categorcal. An open problem s n the treatment of contnuous X varables.