Variable selection for heavy-duty vehicle battery failure prognostics using random survival forests

Variable selecion for heavy-duy vehicle baery failure prognosics using random survival foress Sergii Voronov, Daniel Jung, and Erik Frisk 3,,3 Deparmen of Elecrical Engineering, Linköping Universiy, Linköping, 8 83, SWEDEN sergii.voronov@liu.se daniel.jung@liu.se erik.frisk@liu.se ABSTRACT Prognosics and healh managemen is a useful ool for more flexible mainenance planning and increased sysem reliabiliy. The applicaion in his sudy is lead-acid baery failure prognosis for heavy-duy rucks which is imporan o avoid unplanned sops by he road. There are large amouns of daa available, logged from rucks in operaion. However, daa is no closely relaed o baery healh which makes baery prognosic challenging. When developing a daa-driven prognosics model and he number of available variables is large, variable selecion is an imporan ask, since including noninformaive variables in he model have a negaive impac on prognosis performance. Two feaures of he daase has been idenified, ) few informaive variables, and ) highly correlaed variables in he daase. The main conribuion is a novel mehod for idenifying imporan variables, aking hese wo properies ino accoun, using Random Survival Foress o esimae prognosics models. The resul of he proposed mehod is compared o exising variable selecion mehods, and applied o a real-world auomoive daase. Prognosic models wih all and reduced se of variables are generaed and differences beween he model predicions are discussed, and favorable properies of he proposed approach are highlighed.. INTRODUCTION Prognosics and healh managemen are imporan pars o preven unexpeced failures by more flexible mainenance planning. The purpose is o replace a failing componen before i fails, bu avoid changing i oo ofen. Coarsely, here are wo main approaches in prognosics, daa-driven and modelbased echniques, bu also hybrid approaches ha combine he wo are possible. Model-based prognosics uses a model of he moniored sysem and he faul o monior o predic he degradaion rae and Remaining Useful Life (RUL), see Sergii Voronov e al. This is an open-access aricle disribued under he erms of he Creaive Commons Aribuion 3. Unied Saes License, which permis unresriced use, disribuion, and reproducion in any medium, provided he original auhor and source are credied. for example (Daigle & Goebel, ). Saisical daa-driven mehods (Si, Wang, Hu, & Zhou, ) generae a predicion model based on raining daa o predic RUL. One relevan applicaion is lead-acid sarer baery prognosis for heavy-duy rucks. Heavy-duy rucks are imporan for ransporing goods, working a mines, or consrucion sies, and i is vial ha vehicles have a high degree of availabiliy. Unplanned sops by he road can resul in increased cos for he company due o he delay in delivery, bu can also lead o damaged cargo. One cause of unplanned sops is a failure in he elecrical power sysem, and in paricular he lead-acid sarer baery. The main purpose of he baery is o power he sarer moor o ge he diesel engine running, bu i is also used o, for example, power auxiliary unis such as heaing and kichen equipmen. The main conribuion in his work is a daa-driven mehod for variable selecion when esimaing a baery failure prognosics model for auomoive lead-acid baeries based on Random Survival Foress (Ishwaran, Kogalur, Blacksone, & Lauer, 8). In paricular, wo key properies of he applicaion daa se are addressed ) he number of informaive variables is assumed o be small, and ) he daa conains highly correlaed variables. Boh aspecs make building a prognosics model more difficul and are he main moivaing facors for he proposed approach. Furher, variable selecion is also imporan o beer undersand which facors ha are correlaed wih baery failure rae and also wha is causing i. This work is a coninuaion of (Voronov, Jung, & Frisk, 6), where he main focus was o analyze he auomoive applicaion case sudy. Here, he main conribuion is an exended analysis of he variable selecion problem ha resuls in an augmenaion of he decision space wih an exra dimension. Furher, characerisics of exising variable selecion mehods for Random Survival Foress are analyzed and compared o he proposed mehod, in paricular for he case where here are many correlaed variables in he daa se. In addiion, a basic variable selecion mehodology is proposed.

. PROBLEM FORMULATION The main objecive in his work is o use Random Survival Foress (RSF) (Ishwaran e al., 8) o idenify, from daa, which variables are relevan for building RSF models for survival analysis. The problem of idenifying imporan variables is usually referred o as variable selecion and is a relevan opic in daa-driven prognosics and machine learning in general (Guyon & Elisseeff, 3). The prognosic problem sudied here is o esimae he baery lifeime predicion funcion based on recorded vehicle daa. The lifeime predicion funcion is defined as B V (; ) = P (T > + T, V) () where T is he random variable failure ime of he baery and V he vehicle daa a ime = when daa is submied ino he model, in our case when a vehicle comes o he workshop. The funcion B V (; ) is a funcion of and gives he probabiliy ha he baery will funcion a leas ime unis afer. The daa V is recorded operaional daa for a specific vehicle... Operaional daa In his work a vehicle flee daabase is provided by an indusrial parner, where one snapsho of daa is available from each vehicle including informaion regarding how he ruck has been used and he configuraion of he specific ruck. There is also informaion if he baery has failed or no. The daabase conains los of informaion from he ruck, no always relaed o baery degradaion, ing ha i is no known wha available informaion is relevan for his specific ask. Therefore, i is relevan o idenify which variables are relevan for baery lifeime predicion. Previous works considering his vehicle daa se are presened in (Frisk & Krysander, ) and (Frisk, Krysander, & Larsson, 4). The main characerisics of he daabase can be summarized as follows: 3363 vehicles from EU markes A single snapsho per vehicle 84 variables sored for each vehicle snapsho Heerogeneous daa, i.e., i is a mixure of caegorical and numerical daa Availabiliy of hisogram variables Censoring rae more han 9 percen Significan missing daa rae A main characerisic of he daabase is ha here are no ime series available for a vehicle. I s ha here is only one snapsho V of he variables in he daabase from each vehicle. Informaion describing how he vehicle has been used is sored as hisogram daa represening how ofen specific sensor daa is measured wihin differen inervals. As an example, here is a hisogram describing how much ime he vehicle has been subjeced o differen ambien emperaures. Due o he non-specific purpose of he daabase, i is probable ha only a small number of variables from se V influence predicion of he baery failure rae. Thus, idenifying he imporan variables in order o remove irrelevan variables, should improve he performance of a baery prognosis model... Moivaion for variable selecion There are several reasons why variable selecion is imporan when working wih daa-driven models. Firs, i is possible o improve predicion performance by reducing he number of variables. The second moivaion is beer inerpreabiliy of he resuls by clearly undersanding which facors are imporan for baery failure. The hird moivaion is o reduce model generaion and predicion ime by reducing he number of variables used for generaing he RSF. An example why he qualiy of predicor may become bad if he number of noisy (non-imporan) variables is significanly large is given below. Synheic daa is creaed wih he following properies. Le h be a consan nominal hazard rae (Cox & Oakes, 984) for baery failures. The hazard rae h() = lim d P ( T < + d T ) d represens he probabiliy of a baery failure a a paricular ime. In his example, he hazard rae does no change wih ime and he nominal hazard rae corresponds o an expeced years of baery life. I is assumed ha here is one variable v wih an impac on baery hazard rae h as h, if v = h = h, if v = (3) 3 h, if v = 3 where h is he nominal hazard rae. Daa for 3 vehicles is generaed wih a censoring rae abou 8 percen. Differen numbers of noisy variables are included in he synheic daa o observe how hey change he RSF oupu. Firs, only wo noisy variables are added in addiion o v. In he second case, noisy variables are added. All noisy variables are sampled from a normal disribuion wih zero and uniy variance. Afer generaing wo RSF models, one for each se of variables, he reliabiliy funcions (Cox & Oakes, 984) R() = P (T ) (4) compued by he wo RSF models are compared wih he heoreical values of he reliabiliy as shown in Figure. One vehicle from each of he hree classes was chosen and submied o he fores o receive he predicions. I is shown in Figure (a) ha predicions from RSF for he case of ()

noisy variables, dashed blue curves, are following he heoreical reliabiliy funcions, red solid curves, beer han he case wih noisy variables, see Figure (b). However, he error rae, which is a common performance measure for he RSF, is similar for boh cases. This s ha he error rae is no a good measure in prognosic erms. I is worh o noice ha in he simulaion environmen, informaion abou he rue reliabiliy curves is available. However, his is no he case for he vehicle flee daabase. R().9.9.8.8.7.7..4.6.8 (a) noisy variables R().9.9.8.8.7.7..4.6.8 (b) noisy variables Figure. Predicions from RSF wih differen number of noisy variables. The example moivaes he relevance of finding he imporan variables and a he same ime removing noisy ones, especially if number of imporan is small, in a se of daa as expeced in he vehicle flee daabase. The qualiy of he esimaed reliabiliy funcion from he RSF is significanly improved when he noisy variables are removed. 3. RANDOM SURVIVAL FORESTS A brief descripion of Random Survival Foress and wo sandard mehods for evaluaing variable imporance are presened. For a more deailed descripion, he ineresed reader is referred o, for example, (Ishwaran e al., 8) and (Ishwaran, Kogalur, Chen, & Minn, ). The difference beween an ordinary decision ree classifier and a random fores is ha here is randomness of wo kinds injeced ino he process of esimaing he model. The firs source is he usage of a boosrap procedure. Each ree is grown using is own bag of cases which are sampled from he raining se. Second, for each node in a ree, spliing variables are seleced from a randomly sampled subse. RSF exends he RF approach o righ-censored survival daa, i.e., objecs in he sudy wihou experienced failure. RSF is a daa-driven mehod ha can be used for compuing maximum-likelihood esimaes of he reliabiliy funcion (4). I can be used o rewrie he lifeime predicion funcion () as B V (; ) = P (T > + T, V) = RV ( + ) R V ( ) The oupu from each ree T in he RSF is he Nelson-Aalen () esimae of he cumulaive hazard rae, see (Cox & Oakes, 984). Le T < T <... < T N be N disinc even imes when failures of objecs under sudy occur. Then, he Nelson- Aalen esimae for ree T and vehicle (daa) V is Ĥ T ( V) = T j f j,ni s j,ni (6) where f j,ni and s j,ni are number of failures and survived objecs in erminal node n i of a ree T a even ime T j respecively. Terminal node n i is deermined by dropping vehicle V down hrough he fores. The cumulaive hazard esimae Ĥ( V) for he whole fores is received by averaging over all Ĥ T ( V). Finally, he reliabiliy funcion R V () from () is obained from he fac (Cox & Oakes, 984) R V () = e Ĥ( V) (7) and hen B V (; ) can be compued from (). One measure of predicion error of RSF models proposed in (Ishwaran e al., 8) is based on pair-wise evaluaion of non-censored daa, called concordance index (Harrell, Califf, Pryor, Lee, & Rosai, 98). In shor, he measure akes ino consideraion if he RSF model correcly predics which of he wo samples ha will fail firs. However, noe ha i does no ake ino consideraion how accurae he predicion is wih respec o he acual failure ime. Therefore, he error raes of he wo models in Figure urn ou o be more or less equal even hough he model wih fewer variables is visibly more accurae. 3.. Variable selecion using VIMP One inuiive measure of variable imporance is o measure he increase in predicion error when ignoring a variable in he RSF. This is done by randomizing he sample variable value when used as a spliing variable in he fores (Ishwaran e al., 8). The idea is ha a large increase in predicion error indicaes ha a variable is imporan while a low increase (or a decrease) indicaes ha he variable is no imporan. This variable imporance mehod is called VIMP and is a candidae ool for variable selecion by selecing a subse of he variables wih he highes VIMP values. However, previous works, for example (Ishwaran e al., ), have shown ha VIMP can have problems when here are many correlaed variables. If several imporan variables are correlaed hey will share imporance and VIMP will be low even if he variables are imporan. Thus, here is a risk ha imporan variables will be los and resul in degraded predicion performance. 3.. Variable selecion using Minimal deph As an alernaive o VIMP, a candidae measure called minimal deph for variable selecion in RSF has been proposed, see (Ishwaran e al., ) or (Ishwaran, Kogalur, Gorodeski,

Minn, & Lauer, ). The minimal deph for variable v is defined as he average disance from he roo o he closes node where i appears in he RSF. Imporan variables should have a higher probabiliy o be seleced as spliing variables, compared o noisy variables, a low levels close o he roo when he rees are generaed. Thus, he minimal deph for imporan variables in he fores should be lower compared o noisy variables. To idenify imporan variables using minimal deph, a hreshold ha disinguishes imporan variables from noisy variables is derived in (Ishwaran e al., ) based on he disribuion for minimal deph D v of noisy variables as P (D v = d v is noisy variable) = ( = ) [ Ld ( ) ] ld, d D(T ) p p where D(T ) is he ree deph, l d is number of nodes a deph d, L d = l +l +...+l d and p is number of candidae variables chosen from when generaing he spliing rule in a node. The hreshold can be seleced as he value for he variable disribuion (8). If he minimal deph measure of a variable value is less han he hreshold, i is reaed as imporan, oherwise as noise. The minimal deph measure is evaluaed in (Ishwaran e al., ) and (Ishwaran e al., ) where i is shown o be successful for finding imporan variables in problems wih few imporan variables and large number of noisy ones, even when he daa samples are relaively small. 4. VARIABLE DEPTH DISTRIBUTION METHOD VIMP and minimal deph are he sandard mehods for variable selecion in RSF models. However, here are problems conneced wih hem. If many correlaed variables are presen in he daabase, as expeced in our case, variables share VIMP beween each oher and i could happen ha imporan variables will be los if a VIMP based variable selecion procedure is applied. The second reason why VIMP can have problems is ha i is associaed wih error rae. As illusraed in Secion, low error rae does no always correspond o good prognosic performance. Minimal deph does no depend on error rae and he variable selecion approach has shown good resuls when applied o differen daabases in medical applicaions, see (Ishwaran e al., ) and (Ishwaran e al., ). However, i will be shown laer ha i did no work well when applied o he vehicle daabase. Taking ino accoun aforemenioned reasons, a new mehod for variable selecion called Variable Deph Disribuion (VDD) is proposed. The VIMP and minimal deph measures are applied o he vehicle daabase and he resuls are shown in Figure and Figure 3, respecively. As a reference, hree variables, only conaining Gaussian noise, are included in he daa se. The compued VIMP is posiive for half of he variables, bu he VIMP curve (8) VIMP -4-4 6 8 4 6 8 4 6 8 variable Figure. VIMP of variables in vehicle daabase sored in ascending order. nd order minimal deph 4 3 3 3 3 Mean minimal deph Figure 3. Minimal deph analysis of vehicle daa. Black crosses correspond o 3 variables wih highes VIMP and red riangles o added noise variables. Red dashed line is a hreshold. Noisy variables should be locaed o he righ of he red dashed line. sars o flaen ou afer he firs 3 variables wih highes VIMP indicaing ha approximaely % of he variables are expeced o be relevan for baery lifeime predicion. The resul of he minimal deph measure is presened in Figure 3 where he x axis is a value of he firs appearance of he variable in he fores, y axis is a value of he second appearance of he variable in he fores, and he red dashed line is he hreshold compued based on (8). The figure shows ha mos variables are idenified as imporan, including he added noisy variables. Since he noisy variables are idenified as imporan, i is an indicaion ha minimal deph is no a suiable mehod for he vehicle daabase. Due o he limiaions using he VIMP, as discussed above, and he evaluaion of he minimal deph measure in Figure 3, a new measure of variable imporance is proposed. The principle of he proposed measure is similar o minimal deph, bu considers he probabiliy of a spliing variable being used a differen levels of a ree. An imporan variable should be used more ofen as a spliing variable a lower ree levels, close o he roo, and less a higher ree levels as illusraed in Figure 4. If noisy variables are seleced as spliing variables he probabiliy should no change as much beween differen ree levels, maybe increase slighly for higher levels.

P (v is used as spliing variable in node a level l) Probabiliy Imporan variable Noisy variable Pv(d).3.3... Kichen equipmen BaVol SarMoorTime Road slope Noise Tree level Figure 4. Illusraive example of he probabiliy ha a given spliing variable is used in a node a differen ree levels... 3 3 4 4 Tree level Le d =,,..., max(d(t )), where D(T ) is he ree deph, be all possible ree levels in a RSF and v ν is a spliing variable. Consider wo random evens, namely, choosing a random level d in a ree and picking a variable v as spliing in a ree. Firs even is similar o he problem of drawing a one ball from he boxes of enumeraed balls. Firs, define P (v, d) which describes he join probabiliy ha v is seleced as a spliing variable in a node a a ree level d. Then, according o Bayes rule P (v d)p (d) P (d v) = (9) P (v) where, P (v d) denoes he condiional probabiliy ha v is seleced as a spliing variable in a node given ree level d. The probabiliy P (d) is a prior probabiliy o selec a specific level in a ree, independen of spliing variable, and P (v) is he probabiliy of selecing v as a spliing variable for he whole ree. I is assumed ha here is no prior knowledge of P (d), herefore, he probabiliy is se equal for all levels, i.e., P (d) = max(d(t )), d. The condiional probabiliy P (d v) can be inerpreed as he a poserior probabiliy of selecing a ree level given ha v is used as a spliing variable. The poserior disribuion (9) is here considered a relevan measure of he imporance of he spliing variable v in he RSF. The measure avoids he problem ha, for example, VIMP has where he imporance will be shared beween he correlaed variables. This is because (9) considers he probabiliy of selecing differen ree levels condiioned ha a spliing variable is seleced and does no depend on he probabiliy of selecing v which is reduced if variables are correlaed. The condiional probabiliy (9) will be used as a variable imporance measure. However, he rue probabiliy is no known because i depends on many differen facors, for example, he parameers when generaing he RSF. I can be noiced from (9) ha P (d v) P (v d) and if P (v d) is known he value P (d v) could be found as well. Afer growing he fores, P (v d) can be esimaed by firs compuing T l d,v l d φ v (d) = # of rees in RSF Figure. Examples of he esimaed P v (d) for five differen variables including one known noisy variable. where l d,v is number of nodes a level d where v is spliing variable. Equaion () is hen used o compue he esimae P v (d) = φ v(d) k φ v(k). () which will be used when analyzing he RSF. An example of differen disribuions P v (d) are shown in Figure. Four variables from he vehicle daa and one added noise variable are analyzed how hey are used in a RSF generaed from he vehicle flee daabase. The disribuion P v (d) of he noise variable is almos evenly disribued beween levels 3 o 3, while variables relaed o baery usage, such as, if here is kichen equipmen in he ruck and informaion abou he baery volage are significanly skewed o he lef, indicaing ha hese variables are imporan for prognosics of he baery healh. The sarer moor ime variable has a higher probabiliy mass a lower ree levels compared o he noisy variable bu no as much as he kichen equipmen and baery volage variables. The real daa in Figure resembles Figure 4 and he level of imporance appears o increase wih increased probabiliy mass a lower ree levels. Insead of comparing he whole disribuion P v (d) for each variable v, wo represenaive feaures are considered, and, µ d = E Pv [d] () [ (d ) ] 3 µd γ d = E Pv () σ d () According o Figure 4 and Figure, an imporan variable should have high posiive value of and low value of. These wo feaures can be used alone o idenify which variables ha are imporan. There is one drawback wih his approach, namely, i is possible ha a noisy variable will be seleced by random a low level of a ree. I s i will have values of and as imporan variable. However,

for a noisy variable, his is likely o be a rare even. Therefore, inroducing informaion abou how ofen a variable is used as a hird dimension can help o filer ou noisy variables in he area where imporan one should reside. Two possible candidaes o express his informaion are: The probabiliy ha v is used as a spliing variable in each node P (v). The probabiliy ha v is used as a spliing variable in a ree. The firs candidae can be esimaed by couning he fracion of nodes a variable is used in a ree and aking he average over he whole fores. The second feaure only considers if a variable is used a all in a ree and can be esimaed by couning he number of rees in he fores where a variable is used. As i is shown below, he hird dimension, which ake ino accoun how ofen variable is seleced, can help idenify imporan variables more efficienly han if only and is used as in (Voronov e al., 6). 4.. Real daa case sudy The resul of applying he firs candidae o he vehicle daa as he hird dimension ogeher wih and () is shown in Figure 6 where each do represens one variable. For comparison in he analysis, he 3 mos imporan variables according o VIMP, are highlighed as black crosses and variables rejeced by minimal deph are highlighed as green riangles poining up. Also for he analysis he hree added noisy variables are highlighed as red riangles poining righ. Noe ha he 3 variables wih highes VIMP have similar properies in Figure 6. They have low, high, and are used in a relaively large fracion of he nodes. This can be inerpreed as variables wih high VIMP are used as spliing variables in many nodes close o he roo of each ree. The noisy variables are also used in many of he nodes, bu are locaed furher away from he roo node, hus having high and low. There is also a number of variables wih low and high bu are used in a smaller fracion of he nodes. Some of hese variables are binary, ing ha hey canno be used as spliing variables more han once in a branch. Thus, hey can be relevan for he problem bu will no be used in many nodes. Noe ha he variables ha are only used in a low fracion of nodes are variables rejeced by minimal deph in Figure 3. Comparing wih he resuls using minimal deph in Figure 3 he resuls in Figure 6 looks promising because i is possible o find hreshold o separae mos of he imporan variables given by VIMP from noisy ones. Here, i is assumed ha here are imporan variables among he 3 bes given by VIMP, bu i does no ha all are imporan. The minimal deph mehod maps mos of he variables below he hreshold, including he known noisy variables, which indicaes ha i has difficulies wih his daa se. Noe ha Figure 6 clearly Fracion of nodes used -3 6 4 3-4 Figure 6. Skewness and of () of vehicle daa combined wih fracion of nodes. Black crosses correspond o 3 variables wih highes VIMP, green riangles are variables rejeced by minimal deph, and red riangles are added noise variables. illusraes wha properies are imporan in his case sudy according o VIMP and Minimal deph.. ANALYSIS Before coninuing he analysis of he vehicle flee daa using he new variable selecion mehod in he prognosic algorihm, he properies of he proposed measure in Secion 4 are furher analyzed. As menioned in Secion 4, here is no knowledge which variables are imporan in he vehicle daabase. There is an inuiion ha some of hem could be informaive, bu i is no clear how many hey are and wha heir influence is on he baery hazard rae. In his secion, wo case sudies are performed, namely, undersanding he properies of he VDD mehod in a simulaed environmen and how o selec imporan variables using an ad-hoc hreshold based on simulaions. Firs, a simple model is considered where only one imporan variable influences he life of he baery. Then, anoher example wih a large number of correlaed variables is considered. A hird example using he simulaed environmen shows when he VDD mehod can be more advanageous han VIMP. Finally, he VDD mehod is applied o he vehicle flee daabase where a se of imporan variables is seleced using on he proposed mehodology... Case sudy in simulaed environmen To analyze he properies of he measure discussed in Secion 4, simulaed baery failure daa is generaed which should resemble he general characerisics of he real vehicle daabase. Similar o he example from Secion, i is assumed ha he average baery lives for years which is defined by a consan hazard rae h. One imporan variable v changes he

-3 Number of rees used. 999. 999-4 Figure 7. Simulaed daa from generaed vehicles wih censoring rae 8 percen. The imporan variable is marked wih a cross. VIMP - 4 6 8 Variables Figure 8. Compued VIMP of simulaed daa from generaed vehicles wih censoring rae 8 percen. The imporan variable is marked wih a cross. hazard rae h by a facor h defined as, if v =., if v = h =., if v = 3.9, if v = 4 3.4, if v = () Thus, he hazard rae for a randomly generaed vehicle would be h h. Afer generaing hazard raes for all vehicles, simulaed baery lifeimes are generaed sampling from an exponenial disribuion wih µ = h h. Censoring is done by sampling censored imes from a gamma disribuion, wih shape parameer k = 7 h and scale parameer θ =, and comparing achieved ime values wih failure ones. If he baery lifeime is less han he censored ime he baery experienced failure, oherwise i is censored. The seleced gamma disribuion gives a censoring rae of approximaely 8 percen which is similar o he vehicle daabase. In he firs example, daa from vehicles is generaed and one hundred noisy variables are added o simulae nonimporan variables. Half of hem are normally disribued wih zero and uni variance and he oher half are discree uniformly disribued numbers from o. The resul of applying he proposed mehod is shown in Figure 7. The known imporan feaure is highlighed as a black cross and noisy variables are shown as blue dos. Figures 8 and 9 show he resuls for he same problem, bu using VIMP and minimal deph respecively. Using VIMP, i is easy o idenify he imporan variable, herefore, VIMP and he mehod proposed in he paper gives similar resuls in his case. In Figure 9, he red dashed line is he hreshold ha separaes imporan and non-imporan variables according o (Ishwaran e al., ). Variables o he lef of he hreshold should be imporan and variables o he righ are no. Figure 9 shows ha he specific hreshold is no able o disinguish imporan variables in his case. However, i is visible ha i is nd order minimal deph 4 3 3 Mean minimal deph Figure 9. Minimal deph of simulaed daa from generaed vehicles wih censoring rae 8 percen. The imporan variable is marked wih a cross. possible o manually selec a hreshold ha could o ha. When using VIMP, correlaed variables will share imporance. Therefore, here is a risk ha hey will be missed when choosing a se of imporan variables since heir individual imporance will be low. In he proposed mehod, and of srongly correlaed variables should be similar o each oher, because hey should be chosen in a ree a he same levels. To illusrae his, correlaed variables o he imporan one from he previous example are added o he simulaed daabase. The number of vehicles in he simulaed daabase is kep unchanged as well as number of noisy variables and censoring rae. Resuls are presened in Figures -. Noe ha he gap beween imporan and non-imporan variables using VIMP has almos vanished compared o he previous example in Figure 8. A he same ime, and of he imporan variables are similar o he single variable case in Figure and Figure 7, respecively. The main difference is ha he number of rees where each variable is chosen has decreased. The minimal deph approach fails in his case and is reaing all imporan variables as non-imporan, see Figure which is consisen wih he observaion in Figure 6. The proposed VDD mehod, as can be seen above, does no

Number of rees used 9 8 7 6 3 3-4 Figure. Simulaed daa from generaed vehicles wih censoring rae 8 percen. The imporan srongly correlaed variables are marked as crosses. -4 nd order minimal deph 4 3 3 Mean minimal deph Figure. Minimal deph of simulaed daa from generaed vehicles wih censoring rae 8 percen. The imporan srongly correlaed variables are marked as crosses. VIMP - 4 6 8 Variables Figure. Compued VIMP for simulaed daa from generaed vehicles wih censoring rae 8 percen. The imporan srongly correlaed variables are marked as crosses. suffer of problems wih correlaed variables like VIMP do. An example showing why he VDD mehod could be more advanageous in some siuaions wih respec o VIMP is presened below. The case of one imporan variable and correlaed is considered. The number of vehicles was reduced o bu keeping censoring rae unchanged. The number of noisy variables is also increased o 4, equally spliing beween discree and coninuous noise. Resuls are shown in Figure 3 - Figure. Noe ha he added hird dimension helps o separae imporan variables from noise in Figure 3. VIMP performs worse han VDD, see Figure 4, where he level of imporance for some noisy variables is higher han for imporan ones. The Minimal deph sill have problems idenifying he imporan variables as shown in Figure... Sraegy for variable selecion As i was shown above, i is possible o se up a hreshold ha separaes imporan variables from noisy, however, i is no sraighforward. Furher sudies are required o undersand how informaion conained in he hree dimensions could be used o build a consisen and auomaic algorihm for variable selecion. However, i is possible, using he resuls in he paper and experience from simulaed daa, o sugges an ad- Number of rees used 4 3 6 4 8 6 - Figure 3. Simulaed daa from generaed vehicles wih srongly correlaed imporan variables, 4 noisy variables, and censoring rae 8 percen. Imporan variables are highlighed wih crosses. hoc sraegy. Variables from he vehicle daabase are ploed in Figure 6 where he number of rees a variable is used in is used as he hird dimension. Imporan variables should be used in mos of he rees. Therefore, selecing a hreshold ha sors ou variables ha are no used in many rees, for example 8, should give a firs se of candidaes of imporan variables. I could be he case ha imporan variables are used less if here are many correlaed variables, however, in ha case i is expeced ha and would be similar for hose variables, Secion.. Then, here should be variables ha are grouped in he - plane which is no observed for he variables wih values of number of rees less han 8. Therefore, i is assumed ha here are no imporan variables in ha area. I was shown in Secion. ha for some difficul cases, noisy variables are used as spliing variables more ofen han imporan variables, see Figure 3. Seing up a hreshold wih aforemenioned sraegy will no work for ha case. This siuaion is no considered in his case sudy, bu a more general sraegy for selecing he hreshold is required for a final version of he variable selecion algorihm.

-3. VIMP.. -. 3 3 4 Variables Figure 4. Compued VIMP of simulaed daa from generaed vehicles wih wih srongly correlaed imporan variables, 4 noisy variables, and censoring rae 8 percen. Imporan variables are highlighed wih crosses. Number of rees used 8 6 4 Threshold - 4 Figure 6. Seing up hreshold for he vehicle daabase. x and y axis are and of () respecively, and z axis is he number of rees in fores variable was chosen. Red poins are candidaes for imporan variables, blue dos - noisy variables, gray plane - hreshold value. nd order minimal deph 4 3 3 4 6 7 8 9 Mean minimal deph Figure. Minimal deph of simulaed daa from generaed vehicles wih wih srongly correlaed imporan variables, 4 noisy variables, and censoring rae 8 percen. Imporan variables are highlighed wih crosses. 4 8 6 4 8 Threshold 6-3 4 Figure 7. Seing up hreshold for he vehicle daabase. x and y axis are and of () respecively. Red poins correspond o imporan variables, blue dos - o noisy. The second sep is o projec candidae imporan variables ino he - plane and o se up a new hreshold o remove noisy variables. This sep is illusraed in Figure 7. Imporan variables should have high posiive value of and low value of. The hreshold is manually seleced o rejec he cloud of variables which are reaed as noisy. This sep is similar o approach in (Voronov e al., 6). However, number of variables ha are considered o be imporan is less han in he previous paper due o he augmenaion of wo dimensional space wih he exra dimension. The mehodology for variable selecion could be summarized in he following seps:. Se up hreshold in he number of rees dimension o filer ou noisy variables which are seldom used.. Projec remaining variables in he - plane and se up a hreshold ha disinguishes imporan variables as he subse of variables wih high posiive value of and low value of. 6. CASE STUDY: BATTERY FAILURE PROGNOSTICS Using he manually chosen hresholds as described in Secion and demonsraed wih he s of Figures 6-7, 34 of he variables, i.e. abou percen, are seleced and reaed as imporan. The performance of he RSF using he reduced se of variables is compared o using all variables. The performances of he generaed RSF models are evaluaed using error rae. However, as discussed earlier in Secion, he error rae is no an opimal measure since he wo models in Figure achieves similar error raes while heir predicion qualiy is significanly differen. An RSF is generaed wih rees and a minimal erminal node size of for boh variable ses, he 34 seleced variables and all variables. The error rae for he case wih all variables is., and for he reduced se,.77, which are comparable in magniude. I is worh o emphasize ha node size is here used for growing he fores for predicive purposes and node size for variable selecion.

For he analysis, vehicles wih baery failures and wihou are seleced randomly as validaion daa. These vehicles are hen used as inpus in he RSF o compue he lifeime predicion funcions B V (; ) and he resuls are shown in Figures 8 and 9, respecively, for vehicles wih baery problems and healhy ones. B V (; ) B V (; ).9.8.7.6.... 3 3..9.8.7.6 (a) RSF using all variables B V (; ) B V (; ).9.8.7.6.... 3 3..9.8.7.6 (a) RSF using all variables.... 3 3. (b) RSF using seleced subse of variables Figure 9. Lifeime predicion funcion B V (; ) for censored vehicles..... 3 3. (b) RSF using seleced subse of variables Figure 8. Lifeime predicion funcion B V (; ) for vehicles wih baery failures. In Figure 8 (b), vehicles are clearly more grouped compared o Figure 8 (a) where mos vehicles have faser decaying lifeime predicion. The resul seems reasonable since lifeime of he baeries of grouped vehicles wih fas decaying lifeime predicion funcions B V (; ) in Figure 8 (b) are wihin o 3 ime unis which is quie long life for baeries. Therefore, fas decaying lifeime predicion funcions for hose vehicles should be expeced. Baery lifeime of he vehicle corresponding o he purple curve in Figure 8 is abou.4 ime unis. However, he vehicle failed early and value of lifeime funcion would no allow o predic he failure, bu i is possible ha he cause of he baery problem is no so common in he vehicles from he daabase. In general, i could be seen ha vehicles ha lived longer are well separaed from he ones ha lived shorer. Of cause, i could no be used as he evaluaion of he mehod, bu as a posiive sign. Noe ha he lifeime funcion of he vehicle which corresponds o he green curve in Figure 8 has changed significanly beween he wo figures. This vehicle operaed for abou. ime unis. I has no ye failed, bu should be likely o fail soon. Tha is why he lifeime predicion funcion decays faser han for he oher vehicles. I should be noiced ha we need a measure for assessing predicive performance of RSF, and, when i is available, more can be said abou he influence of variable selecion on prognosic capabiliies of he model. 7. CONCLUSIONS A mehod for variable selecion and variable imporance analysis using random survival foress is proposed and analyzed. Main moivaing facors for he approach are ) small number of informaive variables, and ) highly correlaed variables in he daa se. Analyzing he feaure space in Figure 6 indicaes ha i is possible o disinguish how VIMP and Minimal deph deermines which variables ha are considered imporan and his should be analyzed furher. The proposed mehod is evaluaed in he indusrially relevan problem of heavy-duy vehicle baery failure prognosics and evaluaed using real vehicle flee daa and simulaed daa. Simulaed daa shows ha imporan variables can be disinguished from noisy variables even in difficul cases. The case sudy using real daa shows ha a prognosis model wih % of he available variables achieves comparable error-rae wih using all variables. ACKNOWLEDGMENT The auhors acknowledge Scania and VINNOVA (Swedish Governmenal Agency for Innovaion Sysems) for sponsorship of his work. REFERENCES Cox, D. R., & Oakes, D. (984). Analysis of survival daa (Vol. ). CRC Press. Daigle, M., & Goebel, K. (). A model-based prognosics approach applied o pneumaic valves. Inernaional Journal of Prognosics and Healh Managemen Volume (color), 84.

Frisk, E., & Krysander, M. (). Treamen of accumulaive variables in daa-driven prognosics of lead-acid baeries. In Proceedings of ifac safeprocess. Paris, France. Frisk, E., Krysander, M., & Larsson, E. (4). Daa-driven lead-acide baery prognosics using random survival foress. In Proceedings of he annual conference of he prognosics and healh managemen sociey. For Worh, Texas, USA. Guyon, I., & Elisseeff, A. (3). An inroducion o variable and feaure selecion. The Journal of Machine Learning Research, 3, 7 8. Harrell, F., Califf, R., Pryor, D., Lee, K., & Rosai, R. (98). Evaluaing he yield of medical ess. Jama, 47(8), 43 46. Ishwaran, H., Kogalur, U., Blacksone, E., & Lauer, M. (8). Random survival foress. The Annals of Applied Saisics, 84 86. Ishwaran, H., Kogalur, U., Chen, X., & Minn, A. (). Random survival foress for high-dimensional daa. Saisical Analysis and Daa Mining: The ASA Daa Science Journal, 4(), 3. Ishwaran, H., Kogalur, U., Gorodeski, E., Minn, A., & Lauer, M. (). High-dimensional variable selecion for survival daa. Journal of he American Saisical Associaion, (489), 7. Si, X., Wang, W., Hu, C., & Zhou, D. (). Remaining useful life esimaion a review on he saisical daa driven approaches. European Journal of Operaional Research, 3(), 4. Voronov, S., Jung, D., & Frisk, E. (6). Heavy-duy ruck baery failure prognosics using random survival foress. In Proceedings of Advances in Auomoive Conrol, (Acceped for publicaion). Norrköping, Sweden.