Third General Conference of the International Microsimulation Association Stockholm, June 8-10, 2011 The Dynamic Cross-sectional Microsimulation Model MOSART Dennis Fredriksen, Pål Knudsen and Nils Martin Stølen Statistics Norway ABSTRACT: MOSART is an acronym for Model for microsimulation of Education, Labour supply and Social security. The model uses either the entire or a representative sample of the population in a base year and simulates the further life course for each person. In addition to research projects in Statistics Norway, The Ministry of Finance and The Ministry of Labour are the main users of the model. MOSART has extensively been used in the recent process of reforming the Norwegian public pension system. This paper provides a brief overview of the model, with emphasis on technical aspects and the base population. Address: Research Department, Unit for Public Economics Statistics Norway P.O. Box 8131 Dep. N-0033 Oslo, Norway Email: dennis.fredriksen@ssb.no, pal.knudsen@ssb.no, nils.martin.stolen@ssb.no 1
OBJECTIVE OF THE MODEL MOSART is a dynamic microsimulation model with a cross-section of the Norwegian population and a comprehensive set of characteristics. The model starts with either the entire population or a representative sample of the population in a base year (currently 2005) and simulates the further life course for each individual in this initial population. Transition probabilities depending on individual characteristics are estimated from observed transitions in a recent period. Events included in the simulation are migration, deaths, births, household formation, educational activities, retirement, labour force participation, income and wealth. Public pension benefits are calculated from the simulated labour market earnings and other characteristics included in the simulation according to an accurate description of the public pension system (the National Insurance Scheme Folketrygden ). The pensions covered by the model include old age pensions, disability pensions, survival pensions and early retirement benefits. Changes in the pension system may be analysed by calculating several pension systems in parallel while keeping the stochastic events constant. TARGET AUDIENCE MOSART is operated at Statistics Norway due to technical obstacles, restrictions from the Data Authorities regarding the merged administrative registers and because understanding the full meaning of changing a parameter requires detailed knowledge of the model. In addition to analyses requested by internal research projects, the main users are the Ministry of Finance and the Ministry of Labour. Users in these ministries are either former model developers themselves or economists with a realistic sense of how economic models work. They are therefore critical and capable users of the simulation results, and this has proved beneficial to the development and validation of the model. For this reason we can also transfer the results with a low degree of preparation, often as simple tables supported by some verbal explanations. Other public institutions, private organisations and media use the results from the MOSART model occasionally. In these cases the results are handed over with a higher degree of preparation. BASE POPULATION The base population has recently been updated and now includes the entire Norwegian population. The base year is currently 2005. To be able to compute benefits for surviving 2
spouses and inheritance, diseased and emigrated persons are included. The total number of people in the base population is 7.16 million. For convenience we have generated random samples of 0.1, 1 and 10 per cent of this population. These samples are mostly used for debugging and testing purposes, especially the two smallest. All samples are stratified by gender, age, birth histories and household status. The samples include both spouses from all married couples and from cohabitating couples with children. The data is collected from various administrative registers in the Directorate of Taxes, the National Insurance Administration and Statistics Norway. The underlying demographic assumptions of the model are based on public population projections from Statistics Norway. The information is represented as annual data going back as far as possible, Table 1 itemises the various data sources along with which variables are gathered and the earliest possible start for the time series. Table 1: Data sources for the base population. Source Variable Start Directorate of Taxes Gender, year of birth, spouse, 1964 mother and father, marital status, country of birth, year of migration (if any), home address. National Insurance Degree of disability. 1991 Administration National Insurance Pension status, time for 1967 Administration disability. Directorate of Taxes Labour income, wealth. 1967 Statistics Norway Educational activities, completed education. 1974 In addition to being the starting point of the simulation, the initial population is also used to estimate the transition probabilities. These probabilities may be adjusted to make the expected number of simulated events equal to some external constraints, for example the historical number of events in the same year. The underlying assumptions are generally kept up to date by using adjustment factors from the last year with historical data at an aggregate level. This is the case for aggregate observations regarding migration, periodic life expectancy at birth by gender, number of births, number of pupils and students by gender and age group, number of early retirees, retirement age, number of persons in the labour force and man-years by gender, 3
total labour market earnings by gender, the basic pension unit, and rules for calculating pension entitlements and benefits. At present the model is calibrated to annual data from 2009. When calibrating to new annual data we assume that the effects from different explanatory variables (gender, age, education etc.) on the transition probabilities are the same as estimated from the initial population, and that the adjustment factors capture the interesting part of time variation. The model is extensively documented in Fredriksen (1998). METHOD AND PLATFORM Being programmed in C# the model is truly multi-platform, as compilers for C# exists for virtually every operating system. This makes it possible to run the model on any available hardware. We run the model on both Linux and Microsoft Windows. On Linux we use the compiler provided by the Mono project. Mono is an open source project providing software to develop and run.net applications. Our experiences with the services from this compiler have been excellent. On the Microsoft Windows platform we use the free compiler and development tool Visual C# Express. This tool includes access to the MSDN library, which is very beneficial when programming large applications. Both compilers support version 4.0 of the.net framework. As the size of the base population is relatively large, a powerful computer is required. When transfer from disk to memory is completed it occupies approximately 30 GB of RAM in the base year, growing to 60 1 GB in year 2200. We are currently using a Linux-based server (conveniently named Amadeus) with 16 processors and 256 GB RAM. This enables exploiting the benefits of multi-threading, to be discussed later. As illustrated in Figure 1, the three main stages of the application running the model are: 1. Read data files and transition probabilities. Set up tables and data structures. 2. Perform calculations based on transition and event probabilities. 3. Print results for the year simulated. Advance to next year and resort lists. 1 This depends on the assumptions, especially regarding population growth and number of pension systems. 4
Figure 1: Logical data flow. During a simulation Step 1 above is performed only once, while Steps 2 and 3 are repeated every simulation-year. The input data and the transition probabilities are provided as space-delimited ASCII-files. This makes it straightforward for the user to verify the contents of the files. In addition to input-files there are a few parameter-files where the user can set global variables for the simulation. This is information like i.e. the end year of the simulation, mortality and fertility rates and pension rules. These files are also space-delimited ASCII-files. The output from a simulation consist of extensive self-documentation (making the user able to find errors in the results afterwards), a set of standard tables produced by the simulation programme with aggregated figures covering most frequently asked questions and an option to produce a model population consisting of an ASCII-file with one record per selected person per selected year with selected variables. To produce special tables from this file one has to use a suitable table production programme like SAS. RECENT TECHNICAL ADVANCEMENTS New computers have multiple cores, i.e. the ability to perform calculations simultaneously. MOSART was originally programmed in a traditional style where only one event or calculation was handled at a time. Multiple cores did not reduce runtime with this approach. 5
The new base population included 10-100 times as many persons as the former 2, and this made runtimes matter. For this reason we shifted MOSART towards multithreading. Each step of the simulation is now split into fixed set of 'jobs', e.g. the simulation of disability by groups of gender and birth year. It is mandatory that each such 'job' have no interactions whatsoever with any of the other 'jobs' 3. This requires a tidy programming style. A special problem is that each 'job' must have its own random-seed for the chosen number of cores to avoid multithreading to influence the simulation result, and make it impossible to reproduce a simulation by using identical random-seed. If the splitting into 'jobs' can be done at a higher level, the effect on the source code is moderate. This also implies that simulation steps which include multiple repeated interactions within the entire population are of no use to multithread (e.g. household formation). The simulation is carried out by specifying the number of threads (i.e. the number of cores in the computer, if this simulation is the only task at the moment). Each thread (core) will at each simulation step pick up the next 'job' in line, and repeat this until no more 'jobs' are available at the present step. With several more 'jobs' than threads, this will engage all threads fairly efficient. With a large population the run time for most multithreaded simulation steps are reduced with a factor close to the number of threads. E.g., tax calculations respond efficiently to 12 threads, and is a simulation step which is easily split into separate jobs (no interactions between tax units, i.e. households), little allocation of new memory and many trivial calculations. Some simulation steps do however not respond to multithreading at all, or they may respond only to 2-3 threads and thereby gain very little from a large number of threads. Simulation steps involved in household formation is one clear category; they are both cumbersome to multithread due to often subtle interactions between individuals, and with little or none effect on runtime. Another category is simulation steps with heavy allocation of memory, especially those which triggers memory management. A major problem is semi permanent arrays, lists 2 Prior to multithreading MOSART, the standard simulation included 1 per cent of the Norwegian population, even though we for special purposes used 12 per cent. 3 Interaction may be handled through special synchronization primitives, e.g. locks, but the general effect is very often that reductions in runtimes are lost. 6
and objects. We are still working on these aspects, searching for an understanding of what constitutes efficient programming in a multicore environment. Another approach is tasks, which handles all the administration of generating threads and assigning 'jobs' (each iteration is a 'job'). The effect on the source code is minimal (easy to implement). We are currently experimenting with this, either as an alternative to traditional multithreading or as a supplement. Our major problem so far is keeping the random-generator unaffected by the number of threads. One example is that while tasks are efficient at adding up individual variables, the sum is unfortunately affected. This is the case where the sum and each item in the sum have the same precision level, because the order of adding up will affect the rounding process. The effect is not large, but still sufficient to affect the random generator at some stage. We solved this by rounding all individual variables before adding, and then the sum itself afterwards. Multithreading has reduced the runtime with a factor of 4-5 when 6 threads are employed. Increasing the number of threads further has shown far less effect. Some parts of the simulation are not multithreaded, and their relative importance increases with the number of threads. Another problem is as mentioned memory management. We have also experimented with memory-mapped files. Due to the large size of the base population, it is inconvenient to load it into memory every time a simulation run is to be performed. By keeping it permanently stored in memory we avoid reading from disk, which is very slow compared to a memory-to-memory transfer. This approach significantly reduces the time used to initiate a simulation. A memory-mapped file can easily be shared among different simulations and it rarely changes. POLICY ENVIRONMENT In the last couple of years the MOSART model has been intensively used in analysing effects from reforms of the Norwegian National Insurance Scheme. As in many other countries the pension system in Norway is rather complicated, including non-linearities regarding the accumulation of pension entitlements. A microsimulation model including demographic characteristics, labour supply and an accurate description of the pension system therefore seems to be the most appropriate tool to obtain precise estimates of the direct effects on 7
individual benefits, government expenditures and the future pension burden. Some of the experiences from using the model to analyse these effects can be found in Fredriksen and Stølen (2007), where it is shown that results from the MOSART model have had a direct impact in the design of the new pension system. REFERENCES Fredriksen D (1998) Projections of Population, Education, Labour Supply and Public Pension Benefits. Social and Economic Studies 101, Oslo: Statistics Norway. Fredriksen D and Stølen N M (2007) Effects of Demographic Developments, Labour Supply and Pension Reforms on the Future Pension Burdon in Norway, in Harding A and Gupta A (Eds.), Modelling our future: Population ageing, social security and taxation, Oxford: Elsevier, 81-106. 8