HEALTH ACTUARIES AND BIG DATA What is Big Data? The term Big Data does not only refer to very large datasets. It is typically understood to refer to high volumes of data, requiring high velocity of ingestion and processing and involving high degrees of variability in data structures (Gartner, 2016). Given this, Big Data involves large volumes of both structured and unstructured data (the latter referring to free text, images, sound files, and so on), from many different sources, requiring advanced hardware and software tools to link, process and analyse such data at great speed. [Ian s comment: The term big data is misleading and I don t like using it. After all, in healthcare, we are used to using very large datasets my sense is bigger than those used by life, pension or GI actuaries. Some other practice areas refer to big data sets that are relatively small by healthcare standards. For this reason I prefer structured vs. unstructured. The second issue that arises is the tension between those who (like me) are practitioners of traditional statistical approaches and models, and those that practice machine learning. I suspect more actuaries fall into the first camp rather than the second. A visitor from Stanford who gave a seminar here last year referred to the statistical approach as: propose a hypothesis; search for data; test hypothesis; depending on results, refine hypothesis. The machine learning approach: find data; hook up machine; spin through the data; develop a hypothesis to explain the findings. The problem with the second approach is its replicability. Emile s comment: it seems as if definitions are getting clearer in the literature I d be reluctant to confine it only to unstructured data. My understanding: it is the combination of a much higher volume of structured and unstructured data from multiple sources linked in a database that can handle queries at high speed, to a much greater extent than what was possible even a few years ago. The definition of machine learning is actually a lot less clear, in my experience. Some people classify normal linear regression as a form of machine learning although that is probably not what most people understand by it! I agree that there are some machine learning applications that are opaque, and where results are hard to explain (leading to the temptation to develop a hypothesis to explain the findings), but this is by no means, in my experience, the only way to apply the technology. Some Big Data techniques involve traditional statistical approaches, but just with much more unstructured data incorporated, and with rapid and multiple iterations of model fitting with complete transparency on why you get the answers you re getting.] Data is rapidly increasing in volume and velocity because of developments in technology, involving many more sensors, which constantly generates streams of data. Examples of this include fitness or wellness tracking devices, car tracking devices and medical equipment. People also contribute to the rapid expansion of data, primarily in the form of social media and online interactions. Furthermore, IBM estimates that at least 80% of the world s data is unstructured (Watson, 2016), in the form of text, images, videos and audio. This data may contain valuable unique insights for an organisation, enabling it to more effectively meet customers needs and answer queries in real time, among other applications. However, such a data environment is very different to the sets of structured data in tables that actuaries and other analysts are used to analysing, and requires investments in hardware, software and skills. 1
[Ian s comment: The volume of sensor-based data creates major problems for modelers in terms of distinguishing signal from noise. I suppose that, as actuaries, we have training (and our experience) to rely on that helps us to do this with traditional data (we are able to distinguish, for example, when conditions warrant changing pricing or reserving assumptions). The problem with the streams of data is that we lack the types of algorithms to organize granular data into something that is reportable or understandable. (Good example; in the 1990s modelers developed grouper algorithms to group together the 15,000 diagnosis codes into more manageable and useful condition categories. But we lack this type of grouping for a lot of the clinical data that are generated, and models to determine when a trend in a clinical observation is becoming critical (as opposed to the achievement of a particular value). Emile s comment: this is rapidly changing I ve seen what appears to be very powerful tools to interpret, analyse and understand unstructured clinical data (e.g. doctor s notes, as well as pathology and radiology reports). This may have been an issue a while ago, but it is no longer the case, in my view.] The increase in volume, velocity and variability have increased the demand for processing power and Big Data typically cannot be stored and be analysed in traditional systems. To handle Big Data, organisations typically have to introduce large scale parallel processing systems. This allows organisations to store vast amounts of data of all types on low cost commodity hardware, and query and analyse the data in near real-time by parallelising operations that were previously done on a single processor. In addition, the software required for this is often open source and freely available, for example Apache Hadoop, Apache Spark and Cloudera. This reduces the cost of storing large volumes of data and reduces the barriers to entry from a direct cost perspective. Distributed parallel computing allows a single task to be performed on multiple computers, which reduces the processing time, as shown in the following diagram. 2
(Code Project, 2009) If an organisation implements systems that enable it to access and store large quantities of data, that is, however, only the first step. According to Gary King, Big data is not about the Data (King, 2016) - while the data may be plentiful, the real value and complexity emerges from the analysis of this data, and, beyond that, a responsive operational environment that allows for the application of analytical insights. At the same time, an analyst cannot ignore the complexities of the data: how it was generated, how it is coded, what types of coding errors and missing values are included and how to address any data problems. Understanding the data itself, and its sources and limitations remains critical in understanding the outputs of any modelling exercise. Actuaries role in Big Data The ability to analyse and interpret unstructured data requires advanced analytical and programming skills. The term data scientist refers to an individual possessing specific skills in analysing and delivering 3
actionable insights from Big Data. In particular, Drew Conway defines data scientists as people with skills in statistics, machine learning algorithms and programming, who also have domain knowledge in the field (Drew, 2013). Machine learning automates analytical model building by using algorithms that iteratively learn from data. This allows computers to find hidden insights without being explicitly programmed where to look. (SAS, 2016) Actuaries have a rich grounding in traditional statistics and its correct application in the evaluation of insurance and other financial risks. Actuaries also have deep knowledge on the insurance and financial services environment. These two skills coupled with the ability to solve a variety of problems have earned actuaries a niche role in modelling and analysing data in insurance. However, for actuaries to enter into and compete in the world of Big Data, they require new programming and non-traditional analytical skills and techniques, beyond the traditional areas of survival models, regression, GLM, and data mining techniques. Actuaries will therefore be required to develop these skills themselves, or be familiar with the tools and their applications and work in multidisciplinary teams where their domain knowledge can be applied with the most advanced data science tools. Either way, some familiarity with the power of new data handling technologies (particularly in respect of unstructured data) will help actuaries to understand and identify the opportunities that Big Data provides. Why is Big Data particularly relevant to healthcare actuaries? Actuaries within the healthcare industry have access to many potential sources of data which could provide insight into risks and opportunities, much of which weren t available before. These new sources of data, in addition to claims and demographic data, include data generated by fitness devices, wellness devices, medical equipment (including diagnostic devices), as well as social media. This may be generated by policyholders, patients, health providers (e.g. doctor notes written on an Electronic Health Record), or by diagnostic or other medical equipment (e.g. x-rays, MRIs, blood test results). Some sources of data did not exist before, such the mapped genomes of patients, in the context of personalised medicine. This data can have a variety of applications in health insurance, but of course also raises many questions about the way in which insights flowing from such data are applied, and the risks posed by the mere existence of it. (Feldman, Martin, & Skotnes, 2012) 4
Healthcare actuaries are closely involved in the management of healthcare risks. Historically, healthcare actuaries have managed this risk through of a combination of underwriting, pricing, benefit design and contracting with providers. However, through the use of Big Data, actuaries are starting to develop unique insights into how behavioural factors affect healthcare outcomes. For example, the success rate of a particular treatment may be dependent on the genetic profile of a patient and their level of fitness. The personalisation of medicine requires new data to enter electronic health records, with the aim of choosing far more appropriate treatment for individual patients, and hence potentially significantly improving health outcomes and therefore mortality and morbidity. (insert reference to our Personalised Medicine paper when available) For instance, knowledge of an individual s genome allows doctors to better match the most effective cancer drug with the individual patient (Garman, Nevins, & Potti, 2007). This may lead to considerable savings in the healthcare industry and reduce wastage on incorrect treatment. In some environments, health insurers are the custodians of electronic health records. To the extent that the information mentioned above enters the health record, it would, in theory, be available to health insurers. If this is the case, it could be applied in very effective ways to make relevant information available to treating doctors, and hence improve health outcomes. On the other hand, such data is of course very sensitive and privacy considerations are very important. However, to the extent that new sources of medical data are not available to insurers, either because they are legally prevented from requesting it, or, even if they ask for it, it is withheld by potential policyholders, there are clear risks of adverse selection in purchasing health or life insurance. In some jurisdictions, it is not clear that insurers would have any rights to access genetic information, or other health record information that may be relevant to underwriting, and this may create significant risk. It is also relevant that much of this data can be used to drive behaviour change in the interest of better health outcomes. For instance, capturing more data on clinical outcomes and augmenting it with geolocation data of the insured and provider, allow for high quality provider networks to be created, and insured patients may be incentivised or directed to use healthcare providers who provide higher quality treatment. At a member level, any data on wellness activities (whether in the form of preventative screenings, exercise or nutrition) may be used to incentivise and reward wellness engagement, which in turn reduces healthcare costs for those that respond to such incentives. Determining the optimum level of rewards and wellness activity is an actuarial problem which can be solved if multiple sources of wellness and health data is shared with an insurer. Text mining doctors notes on claims or health records can also provide additional information, over and above the procedure and ICD codes that would typically be obtained from the claim. This will provide additional information on the complexity of the procedure and the stage of the disease, which will assist in analysing the success rate of treatment provided. It may also be used to determine the case mix of patients visiting a provider, which may be used in the context of provider profiling, and which in turn gives insights into quality and efficiency of treatments provided. Big Data can also be used to provide insight into the incidence and spread of disease within a population, perhaps even before individuals access healthcare facilities. For example, Google have used the number and type of searches to produce current estimate of Flu and Dengue fever in a particular area (Google Flu Trends, 2016), although with varying rates of success. The initial model built by Google failed to account for shifts in people s search behaviour and therefore became a poor predictor over 5
time. Further work has been done by Samuel Kou which allows the model to self-correct for changes in how people search and this has led to more accurate results (Mole, 2015). This data can provide an understanding of the spread of disease within a population, which can potentially be used as an early warning to identify a potential increase in claims and demand for healthcare resources before it occurs. Healthcare actuaries have unique domain knowledge, which means that they are in a position to practically apply these non-traditional data sources to solve problems and seek opportunities. Big Data has the potential to enhance the healthcare industry, through enabling wellness programmes to operate effectively, personalising treatments, and improving the allocation of healthcare resources to reduce wastage in the system. Actuaries also tend to have a better understanding of financial risk than other professionals, and hence their understanding of risk is critical to finding the correct application of Big Data tools in insurance. There are many concerns about privacy, data security and the ways in which data is used, that must be addressed before data is applied in practice. Patient and doctor permission, depersonalisation of data for analytical purposes, failsafe access control to sensitive data, and an ethics and governance framework for evaluating the application of insights to practical problems, must all be in place. Health actuaries need to evaluate the regulatory requirements and the ethics of Big Data applications. At the same time actuaries should also consider the risk implications of their organisations not having access to data that exists, and how these risks can be managed. So what should healthcare actuaries do? Healthcare actuaries need to identify the importance and value of Big Data within their organisations and invest in the appropriate technology infrastructure, analytical tools and skills. Investing in the data may include the purchasing of data from external providers, systems development to extract and collect the data that an organisation currently has access to, as well as classifying the data within the system so that it can be used in analysis. Technology required to process and analyse this data includes both a parallel processing hardware system as well as the software required to operate this system. Most of the software required is open source and is thus freely available, however the organisation will likely not have the necessary skills to set up the system and will therefore require the use of an external provider. The organisation will also need to invest in the skills required to interpret this data, either by encouraging actuaries to develop the skills, or by employing multi-disciplinary teams involving data scientists. With improvements in technology and techniques to store, process and extract value from Big Data, it is clear that Big Data is very relevant to healthcare actuaries, whether such data is available to their organisations or not. The many ethical and legal questions that this environment gives rise to will also have major implications for actuarial risks, and actuaries should therefore be active participations in debates and finding solutions to the complex issues arising from it. 6
References Code Project. (2009, April 19). Retrieved from Code Project: http://www.codeproject.com/articles/35671/distributed-and-parallel-processing-using-wcf Drew, C. (2013, March 26). Retrieved from http://drewconway.com/zia/2013/3/26/the-data-sciencevenn-diagram Feldman, B., Martin, E., & Skotnes, T. (2012, 10). Retrieved from https://www.scribd.com/document/107279699/big-data-in-healthcare-hype-and-hope Garman, K., Nevins, J., & Potti, A. (2007). Genomic strategies for personalized cancer therapy. Human Molecular Genetics, 226-232. Gartner. (2016, October 18). Retrieved from http://www.gartner.com/it-glossary/big-data/ Google Flu Trends. (2016, October 18). Retrieved from https://www.google.org/flutrends/about/ King, G. (2016). Retrieved from http://gking.harvard.edu/files/gking/files/prefaceorbigdataisnotaboutthedata_1.pdf Mole, B. (2015, September 11). Retrieved from http://arstechnica.com/science/2015/11/new-flutracker-uses-google-search-data-better-than-google/ SAS. (2016, October 25). Retrieved from http://www.sas.com/en_us/insights/analytics/machinelearning.html Watson. (2016, May 25). Retrieved from https://www.ibm.com/blogs/watson/2016/05/biggest-datachallenges-might-not-even-know/ 7