Fundamentals of Machine Learning for Predictive Data Analytics

Similar documents
Key IRS Interest Rates After PPA

XML Publisher Balance Sheet Vision Operations (USA) Feb-02

Executive Summary. July 17, 2015

Spheria Australian Smaller Companies Fund

Department of Public Welfare (DPW)

Pension Switch Report. Drawdown Provider Comparison

Key IRS Interest Rates After PPA

Business & Financial Services December 2017

WESTWOOD LUTHERAN CHURCH Summary Financial Statement YEAR TO DATE - February 28, Over(Under) Budget WECC Fund Actual Budget

Status of the Unemployment Trust Fund and Related Issues. Commission on Unemployment Compensation. Ellen Marie Hess, Commissioner.

Principal Civil Service Pension Scheme

Isle Of Wight half year business confidence report

Cost Estimation of a Manufacturing Company

Anti Money Laundering. Contents are subject to change. For the latest updates visit Page 1 of 7

LOAN MARKET DATA AND ANALYTICS BY THOMSON REUTERS LPC

Comparative Annuity Reports Your guide to comparing data about Single Premium Immediate Annuity programs

Review of Registered Charites Compliance Rates with Annual Reporting Requirements 2016

OTHER DEPOSITS FINANCIAL INSTITUTIONS DEPOSIT BARKAT SAVING ACCOUNT

Algo Trading System RTM

200 Years Of The U.S. Stock Market

Performance Report October 2018

TIGER: Tracking Indexes for the Global Economic Recovery By Eswar Prasad and Karim Foda

DBS Asia Treasures Membership

Financial & Business Highlights For the Year Ended June 30, 2017

August 2018: Monthly Data Update

Using data mining to detect insurance fraud

Big Walnut Local School District

CreditMark. Corporate Loan Transparency: Transitioning From Accrual Accounting To Mark-To-Market Valuation

When Dashboards are Stupid Presented by John Alber and Chris Emerson #ORG4

1: Product Profitability Analysis - Exercise

TERMS OF REFERENCE FOR THE INVESTMENT COMMITTEE

MEDICAID FEDERAL SHARE OF MATCHING FUNDS

ACCELERATOR- ES HYPOTHETICAL PERFORMANCE CAPSULE - Trading One Lot. Jul- 09. Jul- 10. Jan- 10. Jan- 11

Division of Bond Finance Interest Rate Calculations. Revenue Estimating Conference Interest Rates Used for Appropriations, including PECO Bond Rates

Complex Medical Data Call Reporting Concepts. Objectives

FOR RELEASE: MONDAY, MARCH 21 AT 4 PM

Chartered Society of Physiotherapy. CSP Membership (as at 1 st March 2018) & NHS Data (2009 to 2017) UK/England /N Ireland/Scotland/Wales

SAMPLE REPORT. Pension Transfer Report. Including Transfer Value Comparator (TVC)

PRESS RELEASE. LABOUR FORCE SURVEY: October 2018 HELLENIC REPUBLIC HELLENIC STATISTICAL AUTHORITY. Piraeus, 10 January 2019

Mid-South Regional Travel Surveys & Model Update

INCREASING INVESTIGATOR EFFICIENCY USING NETWORK ANALYTICS

PRESS RELEASE. Securities issued by Hungarian residents and breakdown by holding sectors. January 2019

From Suspicion to Invaluable Transition of Two Project Managers

PRESS RELEASE. LABOUR FORCE SURVEY: November 2016 HELLENIC REPUBLIC HELLENIC STATISTICAL AUTHORITY. Piraeus, February 9, 2017

BTP Stop and Search Data - August 2012

PRESS RELEASE. LABOUR FORCE SURVEY: January 2018 HELLENIC REPUBLIC HELLENIC STATISTICAL AUTHORITY. Piraeus, 12 April 2018

CIGNA FUNDING OPTIONS

Capturing equity gains whilst protecting portfolios

1.2 The purpose of the Finance Committee is to assist the Board in fulfilling its oversight responsibilities related to:

EMPLOYER MUNICIPAL QUARTERLY WITHHOLDING BOOKLET

PHOENIX ENERGY MARKETING CONSULTANTS INC. HISTORICAL NATURAL GAS & CRUDE OIL PRICES UPDATED TO July, 2018

Smart Metering Entity (SME) Licence Order Working Group

PRESS RELEASE. LABOUR FORCE SURVEY: October 2017 HELLENIC REPUBLIC HELLENIC STATISTICAL AUTHORITY. Piraeus, 11 January 2018

PRESS RELEASE. LABOUR FORCE SURVEY: August 2017 HELLENIC REPUBLIC HELLENIC STATISTICAL AUTHORITY. Piraeus, 9 November 2017

Factor Leave Accruals. Accruing Vacation and Sick Leave

Common stock prices 1. New York Stock Exchange indexes (Dec. 31,1965=50)2. Transportation. Utility 3. Finance

Business Cycle Index July 2010

Fiscal 2014 Q4 Results

Effective January 1st, 2013, the following promotional payments will be available to appropriate independent distributors:

Multidimensional Futures Rolls

Revised October 17, 2016

PRESS RELEASE. The Hellenic Statistical Authority announces the seasonally adjusted unemployment rate for August 2015.

HUD NSP-1 Reporting Apr 2010 Grantee Report - New Mexico State Program

ACTUARIAL SOCIETY OF HONG KONG CONTINUOUS PROFESSIONAL DEVELOPMENT ( CPD ) FREQUENTLY ASKED QUESTIONS

EMBARGOED FOR RELEASE: Thursday, March 19 at 6:00 a.m. ET

Project CONNECT Executive Steering Committee Update. February 26, 2014

Arkansas Works Overview. Work And Community Engagement Requirement

Risk Management for Cattle Feedlots: Futures Buy and Sell Signals

DART Fare Structure Programs

Revenue Estimating Conference Tobacco Tax and Surcharge Executive Summary

(WEBSITE ONLY) EMPLOYEE SHARE PLANS

HIPIOWA - IOWA COMPREHENSIVE HEALTH ASSOCIATION Unaudited Balance Sheet As of July 31

HIPIOWA - IOWA COMPREHENSIVE HEALTH ASSOCIATION Unaudited Balance Sheet As of January 31

Pension Transfer Report. Including Transfer Value Comparator (TVC)

Work Program Integration Initiative (WPII)

Consumer Perceptions of Chip Cards (EMV) in the United States Vantiv, LLC. All rights reserved.

Implementation considerations related to a National Injury Insurance Scheme (NIIS)

Count Balance $0.00 $0.00 $0.00 Current Delinquent Other 0 0 0

Internet Appendix for: Change You Can Believe In? Hedge Fund Data Revisions

QUESTION 2. QUESTION 3 Which one of the following is most indicative of a flexible short-term financial policy?

Finland. Annual Observance Report of the Special Data Dissemination Standard for 2008 I. INTRODUCTION II. SDDS UNDERTAKINGS

Commercial Real Estate Program 2012 Impact Analysis- Add On Analysis

The Reliability of Voluntary Disclosures: Evidence from Hedge Funds Internet Appendix

K Road RETAIL CENTRE AUCKLAND COUNCIL QUARTERLY REPORTING. Quarterly Market Activity Report for the 3 month period ending 30 September 2016

11/6/2018. Why Paid Family and Medical Leave. Rollout Timeline. Position WA as a leader in a globally competitive economy.

Russell 2000 Index Options

BSCI Invesco BulletShares 2018 Corporate Bond ETF

Historical Pricing PJM COMED, Around the Clock. Cal '15 Cal '16 Cal '17 Cal '18 Cal '19 Cal '20 Cal '21 Cal '22

Phase III Statewide Evaluation Team. Addendum to Act 129 Home Energy Report Persistence Study

3/25/2008 EMPLOYMENT TRENDS IN ILLINOIS. Nonfarm Employment Change in Nonfarm Employment by Decade

Japan Securities Finance Co.,Ltd

Annual Investment Review Alief Independent School District Board Meeting January 17, 2017

FINANCIAL MANAGEMENT STRATEGY REPORT ON OUTCOMES FOR THE YEAR ENDED MARCH 31, 2016

PSRS/PEERS Update April 2018

London Borough of Barnet Pension Fund. Communication Strategy (2018)

FY 2017 Presentation

Big Walnut Local School District

Fraud Detection and Prevention for the Insurance Industry

Hedging Potential for MGEX Soft Red Winter Wheat Index (SRWI) Futures

EMBARGOED FOR RELEASE: Thursday, May 5 at 1:00 p.m.

Transcription:

Fundamentals of Machine Learning for Predictive Data Analytics Chapter 2: Data to Insights to Decisions John Kelleher and Brian Mac Namee and Aoife D Arcy john.d.kelleher@dit.ie brian.macnamee@ucd.ie aoife@theanalyticsstore.com

1 Converting Business Problems into Analytics Solutions Case Study: Motor Insurance Fraud 2 Assessing Feasibility Case Study: Motor Insurance Fraud 3 Designing the Analytics Base Table Case Study: Motor Insurance Fraud 4 Designing & Implementing Features Different Types of Data Different Types of Features Handling Time Legal Issues Implementing Features Case Study: Motor Insurance Fraud 5 Summary

Converting Business Problems into Analytics Solutions

Converting a business problem into an analytics solution involves answering the following key questions: 1 What is the business problem? 2 What are the goals that the business wants to achieve? 3 How does the business currently work? 4 In what ways could a predictive analytics model help to address the business problem?

Case Study: Motor Insurance Fraud Case Study: Motor Insurance Fraud In spite of having a fraud investigation team that investigates up to 30% of all claims made, a motor insurance company is still losing too much money due to fraudulent claims. What predictive analytics solutions could be proposed to help address this business problem?

Case Study: Motor Insurance Fraud Potential analytics solutions include: Claim prediction Member prediction Application prediction Payment prediction

Assessing Feasibility

Evaluating the feasibility of a proposed analytics solution involves considering the following questions: 1 Is the data required by the solution available, or could it be made available? 2 What is the capacity of the business to utilize the insights that the analytics solution will provide?

What are the data and capacity requirements for the proposed Claim Prediction analytics solution for the motor insurance fraud scenario?

What are the data and capacity requirements for the proposed Claim Prediction analytics solution for the motor insurance fraud scenario? Case Study: Motor Insurance Fraud [Claim prediction] Data Requirements: A large collection of historical claims marked as fraudulent and non-fraudulent. Also, the details of each claim, the related policy, and the related claimant would need to be available. Capacity Requirements: The main requirement is that a mechanism could be put in place to inform claims investigators that some claims were prioritized above others. This would also require that information about claims become available in a suitably timely manner so that the claims investigation process would not be delayed by the model.

Designing the Analytics Base Table

The basic structure in which we capture historical datasets is the analytics base table (ABT) Descrip(ve Features Target Feature - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Figure: The general structure of an analytics base table descriptive features and a target feature.

Problems to Solutions Assessing Feasibility ABT Design Designing & Implementing Features Summary Figure: The different data sources typically combined to create an analytics base table.

The prediction subject defines the basic level at which predictions are made, and each row in the ABT will represent one instance of the prediction subject the phrase one-row-per-subject is often used to describe this structure. Each row in an ABT is composed of a set of descriptive features and a target feature. Defining features can be difficult!

A good way to define features is to identify the key domain concepts and then to base the features on these concepts.

Analytics Solution Domain Concept Domain Concept Target Concept Domain Subconcept Domain Subconcept Domain Subconcept Domain Subconcept Target Feature Feature Feature Feature Feature Feature Feature Feature Feature Figure: The hierarchical relationship between an analytics solution, domain concepts, and descriptive features.

There are a number of general domain concepts that are often useful: Prediction Subject Details Demographics Usage Changes in Usage Special Usage Lifecycle Phase Network Links

Case Study: Motor Insurance Fraud Motor Insurance Claim Fraud Prediction Policy Details Claim Details Claimant History Claimant Links Claimant Demographics Fraud Outcome Claim Types Claim Frequency Links with Other Claims Links with Current Claim Figure: Example domain concepts for a motor insurance fraud claim prediction analytics solution.

Designing & Implementing Features

Three key data considerations are particularly important when we are designing features. Data availability Timing Longevity

Different Types of Data Ordinal Ordinal Categorical ID NAME DATE OF BIRTH GENDER CREDIT RATING COUNTRY SALARY 0034 Brian 22/05/78 male aa ireland 67,000 0175 Mary 04/06/45 female c france 65,000 0456 Sinead 29/02/82 female b ireland 112,000 0687 Paul 11/11/67 male a usa 34,000 0982 Donald 01/12/75 male b australia 88,000 1103 Agnes 17/09/76 female aa sweden 154,000 Textual Interval Binary Numeric Figure: Sample descriptive feature data illustrating numeric, binary, ordinal, interval, categorical, and textual types.

Different Types of Features The features in an ABT can be of two types: raw features derived features There are a number of common derived feature types: Aggregates Flags Ratios Mappings

Handling Time Many of the predictive models that we build are propensity models, which inherently have a temporal element For propensity modeling, there are two key periods: the observation period the outcome period

In some cases the observation and outcome period are measured over the same time for all predictive subjects. 2012$ 2013$ Jun$ Jul$ Aug$ Sep$ Oct$ Nov$ Dec$ Jan$ Feb$ Mar$ Apr$ May$ Observa(on*Period* Outcome*Period* (a) Observation period and outcome period 2012% 2013% Jun% Jul% Aug% Sep% Oct% Nov% Dec% Jan% Feb% Mar% Apr% May% (b) Observation and outcome periods for multiple customers (each line represents a customer) Figure: Modeling points in time.

Handling Time Often the observation period and outcome period will be measured over different dates for each prediction subject. 2012% 2013% Jun% Jul% Aug% Sep% Oct% Nov% Dec% Jan% Feb% Mar% Apr% May% ObservaCon%Period% Outcome%Period% 6% 5% 4% 3% 2% 1% 1% 2% 3% (a) Actual (b) Aligned Figure: Observation and outcome periods defined by an event rather than by a fixed point in time (each line represents a prediction subject and stars signify events).

Handling Time In some cases only the descriptive features have a time component to them, and the target feature is time independent. 2013% Jan% Feb% Mar% Apr% May% Jun% Jul% Aug% Sep% Oct% Nov% Dec% Observa=on%Period% 12% 11% 10% 9% 8% 7% 6% 5% 4% 3% 2% 1% (a) Actual (b) Aligned Figure: Modeling points in time for a scenario with no real outcome period (each line represents a customer, and stars signify events).

Handling Time Conversely, the target feature may have a time component and the descriptive features may not. Year% 2002% 2003% 2004% 2005% 2006% 2007% 2008% 2009% 2010% 2011% 2012% 2013% Outcome%Period% 1% 2% 3% 4% (a) Actual (b) Aligned Figure: Modeling points in time for a scenario with no real observation period (each line represents a customer, and stars signify events).

Legal Issues Data analytics practitioners can often be frustrated by legislation that stops them from including features that appear to be particularly well suited to an analytics solution in an ABT. There are significant differences in legislation in different jurisdictions, but a couple of key relevant principles almost always apply. 1 Anti-discrimination legislation 2 Data protection legislation

Legal Issues Although, data protection legislation changes significantly across different jurisdictions, there are some common tenets on which there is broad agreement which affect the design of ABTs The collection limitation principle The purpose specification principle The use limitation principle

Implementing Features Implementing a derived feature, however, requires data from multiple sources to be combined into a set of single feature values. A few key data manipulation operations are frequently used to calculate derived feature values: joining data sources filtering rows in a data source filtering fields in a data source deriving new features by combining or transforming existing features aggregating data sources

Case Study: Motor Insurance Fraud Case Study: Motor Insurance Fraud What are the observation period and outcome period for the motor insurance claim prediction scenario?

Case Study: Motor Insurance Fraud Case Study: Motor Insurance Fraud What are the observation period and outcome period for the motor insurance claim prediction scenario? The observation period and outcome period are measured over different dates for each insurance claim, defined relative to the specific date of that claim.

Case Study: Motor Insurance Fraud Case Study: Motor Insurance Fraud What are the observation period and outcome period for the motor insurance claim prediction scenario? The observation period and outcome period are measured over different dates for each insurance claim, defined relative to the specific date of that claim. The observation period is the time prior to the claim event, over which the descriptive features capturing the claimant s behavior are calculated

Case Study: Motor Insurance Fraud Case Study: Motor Insurance Fraud What are the observation period and outcome period for the motor insurance claim prediction scenario? The observation period and outcome period are measured over different dates for each insurance claim, defined relative to the specific date of that claim. The observation period is the time prior to the claim event, over which the descriptive features capturing the claimant s behavior are calculated The outcome period is the time immediately after the claim event, during which it will emerge whether the claim is fraudulent or genuine.

Case Study: Motor Insurance Fraud What features could you use to capture the Claim Frequency domain concept? Motor Insurance Claim Fraud Prediction Policy Details Claim Details Claimant History Claimant Links Claimant Demographics Fraud Outcome Claim Types Claim Frequency Links with Other Claims Links with Current Claim Figure: Example domain concepts for a motor insurance fraud prediction analytics solution.

Case Study: Motor Insurance Fraud What features could you use to capture the Claim Frequency domain concept? Motor Insurance Claim Fraud Prediction Claimant History Claim Frequency Number of Claims in Claimant Lifetime Number of Claims by Claimant in Last 3 Months Average Claims Per Year by Claimant Ratio of Avg. Claims Per Year to Number of Claims in last 12 Months Derived Aggregate Derived Aggregate Derived Aggregate Derived Ratio Figure: A subset of the domain concepts and related features for a motor insurance fraud prediction analytics solution.

Case Study: Motor Insurance Fraud What features could you use to capture the Claim Types domain concept? Motor Insurance Claim Fraud Prediction Policy Details Claim Details Claimant History Claimant Links Claimant Demographics Fraud Outcome Claim Types Claim Frequency Links with Other Claims Links with Current Claim Figure: Example domain concepts for a motor insurance fraud prediction analytics solution.

Case Study: Motor Insurance Fraud What features could you use to capture the Claim Types domain concept? Motor Insurance Claim Fraud Prediction Claimant History Claim Types Number of Soft Tissue Claims Derived Aggregate Ratio of Soft Tissue Claims to Other Claims Derived Ratio Unsuccessful Claim Made Derived Flag Diversity of Claim Types (measured using entropy) Derived Other Figure: A subset of the domain concepts and related features for a motor insurance fraud prediction analytics solution.

Case Study: Motor Insurance Fraud What features could you use to capture the Claim Details domain concept? Motor Insurance Claim Fraud Prediction Policy Details Claim Details Claimant History Claimant Links Claimant Demographics Fraud Outcome Claim Types Claim Frequency Links with Other Claims Links with Current Claim Figure: Example domain concepts for a motor insurance fraud prediction analytics solution.

Case Study: Motor Insurance Fraud What features could you use to capture the Claim Details domain concept? Motor Insurance Claim Fraud Prediction Claim Details Injury Type Raw Claim Amount Raw Claim to Premium Paid Ratio Derived Ratio Accident Region Derived Mapping Figure: A subset of the domain concepts and related features for a motor insurance fraud prediction analytics solution.

Case Study: Motor Insurance Fraud Case Study: Motor Insurance Fraud The following table illustrates the structure of the final ABT that was designed for the motor insurance claims fraud detection solution. The table contains more descriptive features than the ones we have discussed The table also shows the first four instances. If we examine the table closely, we see a number of strange values (for example, 9 999) and a number of missing values we will return to these in Chapter 3.

Table: The ABT for the motor insurance claims fraud detection solution. MARITAL NUM. INJURY HOSPITAL CLAIM ID TYPE INC. STATUS CLMNTS. TYPE STAY AMT. 1 CI 0 2 Soft Tissue No 1 625 2 CI 0 2 Back Yes 15 028 3 CI 54 613 Married 1 Broken Limb No -9 999 4 CI 0 3 Serious Yes 270 200.. NUM. AVG. AVG. NUM. % TOTAL NUM. CLAIMS CLAIMS CLAIMS SOFT SOFT ID CLAIMED CLAIMS 3 MONTHS PER YEAR RATIO TISSUE TISSUE 1 3 250 2 0 1 1 2 1 2 60 112 1 0 1 1 0 0 3 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0.. CLAIM CLAIM UNSUCC. AMT. CLAIM TO FRAUD ID CLAIMS REC. DIV. PREM. REGION FLAG 1 2 0 0 32.5 MN 1 2 0 15 028 0 57.14 DL 0 3 0 572 0-89.27 WAT 0 4 0 270 200 0 30.186 DL 0..

Summary

Predictive data analytics models built using machine learning techniques are tools that we can use to help make better decisions within an organization, not an end in themselves. It is important to fully understand the business problem that a model is being constructed to address this is the goal behind converting business problems into analytics solutions

Predictive data analytics models are reliant on the data that is used to build them the analytics base table (ABT). The first step in designing an ABT is to decide on the prediction subject. An effective way in which to design ABTs is to start by defining a set of domain concepts in collaboration with the business, and then designing features that express these concepts in order to form the actual ABT.

Features (both descriptive and target) are concrete numeric or symbolic representations of domain concepts. It is useful to distinguish between raw features that come directly from existing data sources and derived features that are constructed by manipulating values from existing data sources. Common manipulations used in this process include aggregates, flags, ratios, and mappings, although any manipulation is valid.

The techniques described here cover the Business Understanding, Data Understanding, and (partially) Data Preparation phases of the CRISP-DM process. Business Understanding Data Understanding Data Prepara1on Deployment Data Modeling Evalua1on Figure: A diagram of the CRISP-DM process.

Business Understanding Understand Business Problem Propose Analy5cs Solu5ons Explore Data (1) Assess Analy5cs Solu5ons Choose Analy5cs Solu5on Agree on Analy5cs Goals Data Understanding Design Domain Concepts Brainstorm Domain Concepts Review Domain Concepts Explore Data (2) Design Features Review Features Data Prepara5on Build ABT Clean & Prepare Data Figure: A summary of the tasks in the Business Understanding, Data Understanding, and Data Preparation phases of the CRISP-DM process.

1 Converting Business Problems into Analytics Solutions Case Study: Motor Insurance Fraud 2 Assessing Feasibility Case Study: Motor Insurance Fraud 3 Designing the Analytics Base Table Case Study: Motor Insurance Fraud 4 Designing & Implementing Features Different Types of Data Different Types of Features Handling Time Legal Issues Implementing Features Case Study: Motor Insurance Fraud 5 Summary