Software reliability modeling for test stopping decisions - binomial approaches

Size: px

Start display at page:

Download "Software reliability modeling for test stopping decisions - binomial approaches"

John Davidson
5 years ago
Views:

1 Software reliability modeling for test stopping decisions - binomial approaches Lisa Gustafsson Department of Computer Science Lund University, Faculty of Engineering September 11, 2010

2 Contact information Author: Lisa Gustafsson Supervisor: Per Runeson Lund University, Department of Computer Science Examiner: Martin Höst Lund University, Department of Computer Science

3 Abstract When developing a software, a part in the process is testing the software, i.e. evaluating a piece of software by executing test cases. The purpose of this is to reveal defects and evaluate for instance reliability. Reliability is how well the software works under stated conditions for a specified period of time. The decision when the software is good enough to stop testing is based on the quality of the software, which can be expressed in terms of reliability. This master s thesis has been done in collaboration with a company that produces embedded software devices and needed help with analyzing their test results. In this company there are several levels of testing, and at every level a decision about when to stop the testing needs to be taken. The company requested a more statistically correct way of analyzing the test result than the way used today. Five methods that investigate software reliability by analyzing test results have therefore been developed and analyzed in this master s thesis. The first developed method parameterizes a binomial distribution to the data and interesting probabilities are then calculated from this estimated distribution. Two other methods that are used in this master s thesis are based on the reliability growth models; Goel-Okumoto model and Jelinski-Moranda model. These two methods make it possible to calculate useful reliability measurements. A fourth method was developed that builds on the method used at the company. As a fifth and last method the results from two different levels of testing are compared. The aim with this is to find connections between them that make it possible to say something about the result from the second level by analyzing the result from the first level. After analyzing the results, the recommendation to the company is to use the method that is based on their own analysis.

4 Acknowledgements I would like to thank my supervisor, Per Runeson, for all the help he has provided me throughout this master s thesis. He has been a great source of inspiration and a very good support the whole time. I would also like to thank my examiner Martin Höst for sharing his ideas concerning the modeling. The people at the company where the data for this master s thesis is from, thank you for being generous with the data and for your ideas concerning the modeling. My family and my closest friends, thank you for your never ending support.

5 Contents 1 Introduction Background Outline Background and related work Introduction to software testing Operational profile Software reliability Reliability growth models Reliability measurements Parallelization of reliability testing Applications of SRE The situation at the company today 12 4 Methodology Data collection Modeling Threats against validity Results Parameterizing a Binomial distribution Goel-Okumoto estimation Jelinski-Moranda estimation Combinations of duts from different racks Comparison between product- and live testing Validation Discussion 43 7 Conclusions 53 8 Future work 55 1

6 CONTENTS 2 A Histograms 58 A.1 The histograms B Matlab code 75 B.1 The Matlab-code

7 Chapter 1 Introduction 1.1 Background The main idea behind this master s thesis was to monitor the many feedback cycles that are involved in software project management, especially in the testing part. This master s thesis has been done in collaboration with a company that produces embedded software devices and needed help to analyze their test results. In this company there are several levels of software testing, spread out over the lifetime of the project. At every level of testing a decision needs to be made whether the software needs further testing or if it is good enough to stop the testing. The company requested a more statistically correct way of analyzing the test result than the way used now. 1.2 Outline In chapter 2 a theoretical background is given, to give greater understanding to readers that are not so familiar of this area. It also contains work that is related to this master s thesis. In chapter 3 the situation at the company today is presented. This includes a short presentation of the company together with information about the test environment and test analysis at the company. In chapter 4 the modeling for this master s thesis is presented. It includes what has been done to create the different methods and a description of all of them. In chapter 5 the results from the methods in chapter 4 are presented. In chapter 6 the result is discussed. In chapter 7 conclusions are drawn from the result and the discussion. In chapter 8 future work is recommended. 3

8 Chapter 2 Theoretical background and related work 2.1 Introduction to software testing The software development process includes testing, which evaluates a piece of software by executing test cases. The purpose of this is to reveal defects and evaluate reliability, security, usability and correctness of the software [1]. A test case contains information about; test inputs, execution conditions and expected outputs. A group of test cases that are executed together are called a test suite[2, pg. 22]. Software testing is usually executed at four levels; unit test, integration test, system test and acceptance test [2, pg. 133]. In this master s thesis the focus is on system and acceptance test. System testing takes place when all the components have been successfully integrated and the focus is on evaluating the quality [2, pg. 134]. The goal with acceptance test is to show the clients that their requirements are met [2, pg. 135]. The company under study is testing the product to assure quality and the management wants to make sure that the testing is as alike the real usage of the product as possible, therefore an operational profile is used. 2.2 Operational profile When applying quality control to a product, operational profiles are used to sample the input space according to usage patterns. A definition of an operational profile is: An operational profile is a quantitative characterization of how a software system will be used in its intended environment [2, pg. 399]. The company has been given a defined test suite from a head customer; this is the operational profile for the company. 4

9 CHAPTER 2. BACKGROUND AND RELATED WORK Software reliability Software reliability is the capability of the software product to maintain a specified level of performance when used under specified conditions[1]. More simply said, software reliability is how well the software works under stated conditions for a specified period of time. For testers and test managers a key issue is the determination of when the software is reliable enough to release, i.e. when testing can be stopped. The decision of when to stop the testing is critical, stopping too soon could allow high-severity defects to remain in shipped software, lead to customer dissatisfaction and high costs of repair to operational software etc. Stopping too late can be a waste of resources, delay time-to-market and increase costs[2, pg. 415]. The decision is based on the quality of the software, which can be expressed in terms of reliability. 2.4 Reliability growth models The decision whether to stop or continue testing can be based on results from reliability growth models. Reliability growth models are based on the relation between successive failure rates and describe the behavior between two failures [3, pg. 59]. Common for these models is that they assume that the reliability increases with the versions of a software. There are two different classes of reliability growth models; concave and s-shaped. The concave models assume a pattern of decreasing failure rate and the s-shaped models assume that the failure detection rate is smaller in the beginning and at the end of the process than in the middle of the process[4, pg. 164]. Two concave reliability growth models are Goel-Okumoto and Jelinski-Moranda and since they are the most common they are used in this master s thesis. The Goel-Okumoto model: In this model Goel and Okumoto assume that a software system is subject to failures 1 at random times caused by faults 2 present in the system. Letting M(t) be the cumulative number of failures observed at time t, they proposed that M(t) can be modeled as a nonhomogeneous Poisson process, i.e., as a Poisson process with a time dependent failure rate. Special for this model is that the number of faults to be detected is treated as a random variable, for the other reliability growth models the number of faults is a fixed unknown constant. The observed values of the number of faults depend on the test and other environmental factors[5, pg. 1415]. For this model the following assumptions are made: 1 A failure is the inability of a software system or component to perform its required functions within specified performance requirements[2, pg. 20]. 2 A fault is introduced into the software as a result of an error. It is an anomaly in the software that may cause it to behave incorrectly, and not according to its specification. An error is a mistake, misconception, or misunderstanding on the part of a software developer[2, pg. 20].

10 CHAPTER 2. BACKGROUND AND RELATED WORK 6 Every fault has the same chance of being encountered within a severity class as any other fault in that class. The failures, when the faults are detected, are independent. The faults are corrected at once when they are detected, and no new faults are introduced. The cumulative number of failures at time t, M(t), follows a Poisson process with mean value function µ(t). The mean value function is such that the expected number of fault occurrences for any time t to t + t is proportional to the expected number of undetected faults at time t. The number of faults (f 1, f 2,..., f n ) detected in each of the respective intervals [(t 0 = 0, t 1 ), (t 1, t 2 ),..., (t i 1, t i ),..., (t n 1, t n )] is independent for any finite collection of times, t 1 < t 2 < < t n. The assumptions that the faults are corrected at once when they are detected and no new faults are introduces means that one fault correspond to one failure. The data required for this model are: The fault counts in each of the testing intervals, i.e. the f i s. The completion time of each period that the software is under observation, i.e. the t i s. Lyu [3, pg. 81] says that the mean value function for the cumulative number of failures is µ(t) = N(1 e bt ), where N is the expected total number of faults to be detected and b is a constant. Both N and b are greater than zero. The failure intensity function is λ(t) = Nbe bt. The estimation of the accumulated number of faults detected at time t equals the mean value function µ(t). The maximum likelihood estimations of the model parameters N and b are gained by solving the following equations: n i=1 ˆN = f i 1 e ˆbt n and t n e ˆbt n n i=1 f i 1 e ˆbt n = n f i (t i e ˆbt i t i 1 e ˆbt i 1 i=1 e ˆbt i 1 e ˆbt i The model parameters N and b can be used to predict reliability. The Jelinski-Moranda model: The following section about the Jelinski-Moranda model is based on a book edited by Lyu [3, pg ]. It is said that this is one of the earliest reliability growth models. In this model the elapsed time between failures is assumed to follow an exponential distribution with a parameter that is proportional to the number of remaining faults in the software. That means that the mean time

11 CHAPTER 2. BACKGROUND AND RELATED WORK 7 1 between failures at time t is φ (N (i 1)), where t is a point of time between the occurrence of the (i 1):th and the i:th fault. The variable φ is a proportionality constant which stands for the intensity for each fault and N is the total number of faults in the software from the time when the software first is observed. The total failure intensity after the (i 1):th and before the i:th failure has occurred is λ i = φ (N (i 1)). As each fault is discovered and corrected the hazard rate, i.e. λ i, is reduced by the proportionality constant φ. This indicates that the impact of each fault removal is the same. A number of assumptions are made: The rate of fault detection is proportional to the current fault content of the software. The fault detection rate remains constant over the intervals between fault occurrences. A fault is corrected instantaneously without introducing new faults into the software. Every fault has the same chance of being encountered within a severity class as any other fault in that class. The failures, when the faults are detected, are independent. Since it is assumed that a fault is corrected at once when it is detected, one fault corresponds to one failure. The required data for this model are either time between failures or the actual times when the software failed. The time between failure occurrences are X i = T i T i 1, i = 1,..., n. The X i s are independent exponentially distributed random variables with mean= 1 φ(n (i 1)). The mean value function for the cumulative number of failures is µ(t) = N(1 e φ t ) and can be interpreted as the estimate of the accumulated number of faults detected at time t. The maximum likelihood estimates of φ and N are gained by solving the following equations: and ˆφ = n ˆN ( n i=1 X i) n i=1 (i 1)X i n 1 i=1 ˆN (i 1) = n ( ) ˆN 1 n ( n i=1 Xi i=1 (i 1)X i) The model parameters N and φ can be used to predict reliability. 2.5 Reliability measurements To predict reliability, there are a number of appropriate measurements. These are described below.

12 CHAPTER 2. BACKGROUND AND RELATED WORK 8 Time To Failure, TTF Time to failure is a measurement used at the company where this master s thesis is done. TTF is the time that elapses between two consecutive failures. Mean Time To Failure, MTTF Assume that i 1 failures have been observed, then there are i 1 interfailure times, i.e. the times that elapsed between the failures. These times are referred to as t 1, t 2,..., t i 1, where t 1 is the elapsed time until the first failure occurs. MTTF is the average of these times[2, pg. 413]: MT T F = Mean Time Between Failures, MTBF i 1 k=1 t k i 1. MTBF=MTTF+MTTR, where MTTR is mean time to repair[2, pg. 414]. The situations that are treated in this master s thesis all have MTTR=0, i.e. MTBF=MTTF. This is because error correction for software systems is assumed to take place directly after a crash, unlike error correction for hardware systems where the product often needs to be replaced on site far from repairers and spare parts. Detected faults so far Detected faults so far is a measurement that is used when you have a model that gives an estimation of the total number of faults in the product. If the ratio r = detected faults so far total amount of faults is above a specified threshold the software is accepted and testing stops. The total amount of faults is unknown and hence needs to be predicted. Time to next failure, TTNF Time to next failure is an estimate that is used when a number of failures have occurred and the time for the next failure is sought. If the time to next failure is larger than a specified threshold the software is accepted and testing stops. The threshold is a logical value that differs between different types of software. 2.6 Parallelization of reliability testing When testing is highly time demanding, it is desired to shorten the runtime by using more devices in parallel and run shorter time on each of them, which correspond to the total runtime that would have been on one single device.

13 CHAPTER 2. BACKGROUND AND RELATED WORK 9 Musa et al. [7, pg. 163] states that if copies of a software are used with the same operational profile, the runs for these copies may be different but the probability of a given run for any copy is identical. Hence, is doesn t matter if execution time is experienced from one copy or many, the time periods may be grouped together. [7, pg. 162] The way the execution time is calculated is shown in Figure 2.1. The times produced by these calculations are time between two consecutive crashes. Figure 2.1: Calculation of execution time with parallel devices. 2.7 Applications of Software Reliability Engineering (SRE) Software reliability has been investigated for example at AT&T and Nokia. A brief explanation about how it was done and what it was used for in these cases can be seen below. AT&T Donnelly et al. [3, pg. 219] discuss the practice of SRE at AT&T, that provides means to predict, estimate and measure the rate of failure in software. By using SRE in the context of software engineering it is for example possible to: Analyze, manage and improve the reliability of the software products. Determine when a software is good enough to release to customer. Avoid unnecessarily large time to market due to over testing. The practice of SRE can be summarized in six steps: 1. Quantify product usage, i.e. make an operational profile. 2. Define quality requirements quantitatively with customers.

14 CHAPTER 2. BACKGROUND AND RELATED WORK Employ product usage data and quality objectives. 4. Measure reliability of reused software and acquired software components delivered by suppliers, as an acceptance requirement. 5. Track reliability during test and use this information to guide product release. 6. Monitor reliability in field operation and use results to guide new feature introduction. Donnelly et al. [3, pg ] also give an example of an SRE success story at AT&T. The story is about when AT&T s International DEFINITY R PBX started a new quality program that included doing SRE along with other proven quality methodologies. The SRE part of the program included: Defining an operational profile based on customer modeling. Generating test cases automatically based on frequency of use reflected in the operational profile. Delivering software in increments to system test with quality factor assessments. Employing clean-room development techniques together with feature testing based on operational profile. Testing to reliability objectives. The quality improved a lot from previous release: A factor-of-10 reduction in customer-reported problems. A factor-of-10 reduction in program maintenance costs. A factor-of-2 reduction in the system test interval. A 30 percent reduction in new product introduction interval. There was a significant improvement in customer satisfaction. The reliability improvement and the sales plan pushed sales to 10 times those for the previous version. The items contributing to these successes were using reliability as a release criterion and using operational-profile-driven testing. Nokia Daya Perera [6, pg ] describes the development of a new measure, Reliability Index, to measure and monitor the product maturity and to estimate the product field failure rate during the research and development phase, done at Nokia. As mobile phones became more complex the deviation between the predicted and actual field failure rates increased. This increased the demand

15 CHAPTER 2. BACKGROUND AND RELATED WORK 11 to develop a method to estimate the field failure rate of a product before it is launched to the field. The estimate of the field failure rate is used to compare with field failure rate targets, make estimates of the budget, identify warranty support requirements, improve product competitiveness and to get an understanding that customer expectations are likely to be met. When developing a product, several samples are made before the product launch. These samples are aimed to improve; testability, manufacturability, performance, process capability, quality and reliability. In product testing the sample is subject to several tests, for example reliability tests. Each test got a weight based on the effectiveness of the test to simulate field typical failures. The ratio of the achieved test outcome to the maximum possible test outcome as a percentage is named as the Reliability Index RI%. It is assumed that products with higher RI% values are more reliable than those with lower RI% values. This implies that products with higher RI% values have a lower field failure rate than those with lower RI% values and that (100-RI)% would be proportional to the field failure rate. At Nokia, the correlation of (100-RI)% with the field failure rate was studied with regression models and they found the result from that encouraging. The tool was effective in predicting the reliability of products at the launch and a very good correlation between predicted and actual field failure rate was observed. It is though stated that the method is a way of measure hardware reliability and that it is hard to model software.

16 Chapter 3 The situation at the company today The company, that produces devices with embedded software has a specified test suite. This test suite originates from the test suite given from a head customer, i.e. the operational profile for the product. The company has a number of different test levels, e.g. live testing, platform testing and product testing. There is a generic platform from which several products origin. Platform testing is system testing (mentioned in section 2.1) done on this generic platform and product testing is acceptance testing (mentioned in section 2.1) done on the different products, using an operational profile. Live testing is system testing performed on pilot products with actual use, i.e. not according to an operational profile. The devices being tested in all the test levels can communicate with each other and at the live testing level the devices can also communicate with central systems. Since the test suite is highly time demanding, parallel execution is adapted. The testbench for platform testing and product testing can be seen in Figure 3.1. Figure 3.1: A realization of the testbench at the company. A rack contains j Devices Under Test, duts. There is a total of i racks and i j duts. The same test suite is executed on all of the duts over and over again. When a failure has occurred on a dut, i.e. when a device crashes, that device is not restarted. 12

17 CHAPTER 3. THE SITUATION AT THE COMPANY TODAY 13 For platform testing and product testing the reliability of the product is based on the following calculations: 1. For each rack calculate the total runtime until failures have occurred on two duts. 2. Delete the lowest and highest value of the total runtimes. 3. Calculate the average of the remaining total runtimes. If the average is above a specified threshold the company stops the testing at the current level. There are two stages of acceptance testing and corresponding criteria, that need to be fullfilled before the client accepts the delivery at each stage. The testbench is similar in the two stages except that there is a larger number of duts per rack in the later stage. In the first stage j = 5 and the threshold is 400 hours. In the second stage j = 7 and the threshold is 800 hours. The live testing is not performed in the same way as platform testing and product testing. Instead there are several duts that runs until it crashes and then is restarted, independently from the other duts. At the company, the software is continuously updated, which leads to a new version of the software. Differences between two consecutive versions can be for example corrections or new functionality. The discussion in this master s thesis will be built up around these versions. The company believes that their current method is not correct enough and therefore they ask for a more statistically correct method to use. They also want to find a way to lower the runtimes, i.e. to shorten the time it takes to test the software. If this could be done in a good way, the company would get lower costs and shorter time-to-market.

18 Chapter 4 Methodology The methodology for this master s thesis is constituted of the following steps; data collection, building the model and validating the model [8]. 4.1 Data collection Data have been given from the company, in form of excel sheets. The data was sorted in racks and versions and included a lot of information. All the excel sheets were checked and the information that was useful and therefore saved was; total runtime, time when the dut crashed and in those cases that the dut did not crash, the current runtime for when the data was gathered was saved. If a dut did not crash, it was still ongoing or manually stopped. The test level that first was examined was the platform test. Since the data from this level was very poor it was not further examined. The next level to be examined was the product testing. Here the data was informative enough to start modeling. The third test level to be a part of this master s thesis was the live testing, since there was a massive amount of data for this level, not all versions were examined. Six versions were chosen to be investigated for the live testing level; 1.0.A.0.9, 1.0.A.0.11, 1.0.A.0.12, 1.0.A.1.5, 1.0.A.1.12 and 1.0.A Modeling To improve the analysis of the test result for the company, a number of methods are developed. To lower the risk to end up with a method that does not work, it was desired to develop more than one method. In total, five methods are developed and analyzed, since five interesting and applicable areas were found. The methods that are developed and analyzed are; Parameterizing a Binomial distribution, Goel-Okumoto estimation, Jelinski-Moranda estimation, Combination of duts from different racks and Comparison between product testing and live testing. The methods are described in the sections below. 14

19 CHAPTER 4. METHODOLOGY 15 It is assumed that the time between failures are independent, i.e. whenever the observation starts, the expected time until a failure occurs is the same. This also means that there is no dependence between the failures, i.e. the fact that one failure occurs does not impact if the same or another failure occurs. Also, parallelization of the testing, see section 2.6, is adapted. The modeling is done in Matlab and the code can be seen in Appendix B Parameterizing a Binomial distribution It is assumed that the number of failures in each run belongs to a binomial distribution, X, with parameters n and p, i.e. every test case fails with a probability p and there are n test cases that are run. A simplification considering the parameter n is made since the exact information about execution time for each test case is unknown. It is assumed that each time step (hour) corresponds to one test case, i.e. the number of test cases in a run is simplified to the runtime, in hours. Since it is assumed that the number of failures in each run belongs to a binomial distribution, it is also assumed that the parameter p is constant. The parameter p is unknown and can be estimated by p = x n, where x is an observation of X. Note that the estimation of p equals 1 MT T F, as mentioned in section 2.5. Some of the duts did not crash, instead they were stopped or still running. These duts need to be given an estimated crash time which is done by using the estimated parameters n previous and p previous from the previous version. With this it is assumed that version i is at least not worse than version i-1. The number of crashes occurring on n previous hours are gained by drawing a random number from a binomial distribution with the parameters n previous and p previous. The estimated crash time is then: n previous divided by the drawn random number of crashes plus the time until the dut was stopped. Since parallel execution, as mentioned in section 2.6, is adapted, the final time to crash is calculated in the following way: 1. When all duts have been given a crash time (t c ), either as an actual crash time or as an estimated crash time, sort them in ascending order. 2. Calculate the time between the crashes, i.e. calculate t i = t c (i) t c (i 1). 3. Get the final time to crash by multiplying each t i with the corresponding number of duts running, that number is: the total number of duts i + 1. How the final time to crash should be calculated can be seen in Figure 4.1. The total runtime, i.e. the value of the parameter n is then the sum of the final times to crash. The estimated value of p is the number of occurred failures divided by the value of n above. When the parameters are estimated, the following calculations are made for each version as a measure of the reliability, based on the company s existing practice:

20 CHAPTER 4. METHODOLOGY 16 Figure 4.1: Calculation of final time to crash. Probability that two, or less, failures occur in 200 hours. Probability that two, or less, failures occur in 400 hours. Probability that two, or less, failures occur in 800 hours Goel-Okumoto estimation By using the reliability growth model Goel-Okumoto, presented in section 2.4, the following is presented for each version: The rate of detected faults, r G O = actual number of detected faults estimated total number of faults. The estimated time to when the next failure occurs. A plot that includes the actual number of occurred failures and the estimated total number of failures. Since parallel execution, as mentioned in section 2.6, is adapted, the final time to crash is calculated in the following way: 1. Sort the times (t c ) given from the company in ascending order. 2. Calculate the time between the crashes, i.e. calculate t i = t c (i) t c (i 1). 3. Get the final time to crash by multiplying each t i with the corresponding number of duts running, that number is: the total number of duts-i+1. How the final time to crash should be calculated can also be seen in Figure 4.1. Since this model assumes that faults are corrected without introducing any new faults, the corrected faults from older versions should be taken in account when estimating parameters for a newer version. This is done by calculating the time stamp for each crash as the time stamp for the previous crash plus the final time to crash for the current crash. For each version the time stamp for the first crash is calculated as the time stamp for the last crash in the previous version plus the final time to crash for the first crash in the current version.

21 CHAPTER 4. METHODOLOGY 17 The rate of detected faults For the Goel-Okumoto model the ratio, r G O, of how large part of the faults that has been detected is calculated by function (4.1). r G O = n N (4.1) Where n is the actual number of detected faults and N is the estimated total number of faults. The estimated time to when the next failure occurs, TTNF To find the time when the next failure occurs with the Goel-Okumoto model, the following steps are performed: 1. Set µ(t) = n Insert the estimated values of N and b 3. Dissolve t The time when the next failure occurs is given by: t n+1 = 1 ( b ln 1 n + 1 ). N The time to next failure is then T T NF = t n+1 t n, where t n is the time when the most recent failure occurred Jelinski-Moranda estimation By using the reliability growth model Jelinski-Moranda, presented in section 2.4, the following is presented for each version: The rate of detected faults, r J M = number of detected faults estimated total number of faults. The estimated time to when the next failure occurs, TTNF. A plot that includes the actual number of occurred failures and the total number of failures. Since parallel execution, as mentioned in section 2.6, is adapted, the final time to crash is calculated in the following way: 1. Sort the times (t c ) given from the company in ascending order. 2. Calculate the time between the crashes, i.e. calculate t i = t c (i) t c (i 1).

22 CHAPTER 4. METHODOLOGY Get the final time to crash by multiplying each t i with the corresponding number of duts running, that number is: the total number of duts-i+1. The final time to crash is the input to this model, i.e. the times between two consecutive crashes. How the final time to crash should be calculated can also be seen in Figure 4.1. Since the following assumption is made: A fault is corrected instantaneously without introducing new faults into the software, corrected faults from older versions should be taken in account when estimating parameters for a newer version. This is done by using all the previous versions final times to crashes together with the current version s final times to crashes as input to the model. The final times to crashes are grouped in versions and the first version s final times to crashes are given first in the input, the second version s final times to crashes are given second in the input, and so on. The rate of detected faults For the Jelinski-Moranda model the ratio, r J M, of how large part of the faults that has been detected is calculated by function (4.2). r J M = n N (4.2) Where n is the actual number of detected faults and N is the estimated total number of faults. The estimated time to when the next failure occurs To find the estimated time to when the next failure occurs with the Jelinski- Moranda model, compute: TTNF = 1 λ n Combination of duts from different racks This method consists of the following steps, performed for each version; 1. Combine all possible duts used in the version, to find all the possible racks. 2. Observe the distribution of crash times for these combinations. The idea behind this method is that it should not matter which duts that are combined to form a rack. Therefore all possible combinations of racks are examined. Since there are two stages of product testing (as mentioned in section 3) there are two different numbers of duts per rack. For the versions that are a part of the earlier stage, the number of duts per rack is 5 and for the versions in the later stage, the number of duts per rack is 7. This means that when the versions in the earlier stage are examined, all possible racks containing 5 duts are examined and for the versions in the later stage all possible racks containing

23 CHAPTER 4. METHODOLOGY 19 7 duts are examined. Since there is also different thresholds for the two stages this is also taken in account in this method. The following versions are a part of the first stage of testing, i.e. when each rack contains 5 duts; 1.0.A.0.9, 1.0.A.0.11, 1.0.A.0.12, 1.0.A.0.13, 1.0.A.1.5, 1.0.A.1.10, 1.0.A.1.12, 1.0.A.1.17, 1.0.A.1.20, 1.0.A The following versions are a part of the second stage of testing, i.e. when each rack contains 7 duts; 1.0.A.1.36, 1.1.A.0.1, 1.1.A.0.5, 1.1.A.0.8, 1.2.A.0.6, 1.2.A All the different adjustments of the model are used on all the versions. It is of course most interesting to look at the result from the best fitted model to each stage, but all digits are presented, for possible future use. Some of the duts did not crash, instead they were stopped or still running. In this method two different ways of estimating these duts crash times are used. The first alternative means that the crash time is equal to the time when the dut was stopped. The second alternative builds on the assumption from the method in section 4.2.1, i.e. that the number of failures belongs to a binomial distribution. The stopped duts are given an estimated crash time by using the estimated parameters n previous and p previous from the previous version. The number of crashes occurring on n previous hours are estimated by drawing a random number from a binomial distribution with the parameters n previous and p previous. The estimated crash time is then: n previous divided by the drawn random number of crashes plus the time when the dut stopped running. Since the purpose of this method is to give a result that can be compared to the one that comes from the analysis at the company today, the crash time for each dut is what is used at the company, i.e. the crash times given in the data from the company. It is desired to avoid to have two events occurring in the same time interval (the length of the interval is arbitrary) since it could lead to an estimated value of p that is larger than 1. Since p is a probability the following demand is set: 0 p 1. Therefore it is assumed that each test case takes 1/10 hour to run, which means that the estimation of the parameter n is multiplied with a factor 10 and the estimation of p is divided with a factor 10. For each version the following should be done: 1. Find all the possible combinations of 5 or 7 duts out of all the duts in the version. 2. For every combination, calculate the total runtime until two crashes have occurred. 3. Make a bar chart out of the total runtime until two failures have occurred for all the combinations. 4. Calculate how large part of the combinations that have a total runtime above the threshold for the current stage. The result that is presented for each version contains:

24 CHAPTER 4. METHODOLOGY 20 Histograms over the total runtime until two failures have occurred, i.e. a plot over the number of combinations for each total runtime until two failures have occurred. A pass-rate,r pass, that is the number of combinations that had a total runtime until two failures occurred above the threshold, divided by the total amount of combinations. This is interpreted as the probability for the software to pass. A failure-rate,r fail, that is the number of combinations that had a total runtime until two failures occurred below the threshold, divided by the total amount of combinations. This is interpreted as the probability for the software to fail Comparison between product testing and live testing The data from the live testing is on such a form that the following methods can be applied to it; Parameterizing a Binomial distribution, Jelinski-Moranda estimation and Goel-Okumoto estimation. The result from these methods applied to live testing and product testing are compared. With this method it is desired to find a connection between the result from live testing and product testing, for example that a good result in live testing gives a good result in product testing. If there is a connection, the comparison can be used to predict the result from product testing, based on how the software performs in the live testing. In that case it can be used to shorten the runtimes and thus time to release for the product. 4.3 Threats against validity Concerning the modeling for this master s thesis, some things are threats against the validity: The methods might only work on the data it is developed from The methods in section 4.2 are developed by using data from one product, A. The same methods are also used on data from another product, here called B. This is done because the methods should work on any data on the same form as data A. If all the methods are working on the data from product B, and the both data sets show the same pattern when it comes to results given from the methods, the methods are considered to work on all data on the same form as data from product A. In section 5.6 this threat is examined.

25 CHAPTER 4. METHODOLOGY 21 The assumption that the data belongs to a binomial distribution might be wrong When it is assumed that the data belongs to a binomial distribution, it is also assumed that the parameter p is constant for each version, how likely is that assumption? This is investigated by doing the following for each version: 1. Split the time for which the failures have occurred into a number of intervals. 2. Estimate the parameter p for a binomial distribution for each and one of these intervals. 3. Check if the different values of p are alike or not. The chosen number of intervals to split the total time into is 10. In section 5.6 this threat is examined. The data that is hard to interpret In some cases the data has been hard to interpret, there have been ambiguities. It has been about that digits seem wrong and that results seem to have been mixed up between versions. This is discussed in section 5.6. The assumption that version i is at least not worse than version i-1 might be wrong When estimating the crash time for a dut that did not crash, this is done by using the binomial parameters from the previous version. This builds on the assumption that the versions get better and better, otherwise the estimated crash times will be too large. This is discussed in section 5.6. The assumption that the times between failures are independent might be wrong This is a threat that is not examined in this master s thesis and therefore has to be kept in mind as a possible source of error. The faults are not removed at once It is assumed by some of the methods in section 4.2 that the faults are corrected at once. This is however not done in reality and this must be seen as a source of error.

26 Chapter 5 Results In this chapter the results from the methods in chapter 4 are presented and the threats to validity are examined. 5.1 Parameterizing a Binomial distribution The result from the method in chapter is shown in Table 5.1 Version p P (X(t) 2, t = 200) P (X(t) 2, t = 400) P (X(t) 2, t = 800) 1.0.A A A A A A A A A A A A A A A A Table 5.1: Probabilities calculated from the binomial distribution. The plot of the values of the probability to pass the test, according to Table 5.1, can be seen in Figure 5.1. The versions to the left of the vertical line are 22

27 CHAPTER 5. RESULTS 23 a part of the first stage, where the threshold is 400 hours and the probability is P (X(t) 2, t = 400). The versions to the right of the vertical line are a part of the second stage, where the threshold is 800 hours and the probability is P (X(t) 2, t = 800). Figure 5.1: The estimated probability to pass the test, for data from product A. The vertical line represents a shift from stage 1 to 2. It can be seen that the later versions have lower values of the parameter p and higher probabilities in the last three columns than the earlier versions, which also can be seen in Figure 5.1. This means that the software gets more and more reliable. 5.2 Goel-Okumoto estimation In Table 5.2 the rate, r G O, and the time to next failure, TTNF, as defined in section are presented for each version. Figure 5.2 shows the plots of actual number of occurred failures and the estimated total number of failures for the software at the end of each version. In Table 5.2 it can be seen that the model says that all the faults have been found already at version 1.0.A In the plot it can be seen that the number of faults is increasing from version to version, which means that all faults are not found already at version 1.0.A This indicates that the model is not good for this data. The values of TTNF is not given since no good measure of it could be gained. This is another thing that indicates that the model does not fit the data.

28 CHAPTER 5. RESULTS 24 Version r G O TTNF 1.0.A A A A A A A A A A A A A A A A Table 5.2: The values from the Goel-Okumoto model. Figure 5.2: The plot from the Goel-Okumoto model.

29 CHAPTER 5. RESULTS Jelinski-Moranda estimation In Table 5.3 the rate, r J M, and the time to next failure, TTNF, as defined in section are presented for each version. Figure 5.3 shows the plots of the actual number of occurred failures and the estimated total number of failures for the software at the end of each version. Version r J M TTNF 1.0.A A A A A A A A A A A A A A A A Table 5.3: The values from the Jelinski-Moranda model. The values of r J M vary between 35 and 95 percent. Notable is that the value is very high for the early versions and later versions and lower for the versions in the middle. This indicates that there might be too little data for the early versions and later in the process the model gets a bit more stable. However it is not stable enough to base decisions on. In chapter 6 this method is further discussed under the line Results from the Jelinski-Moranda model vs results from the company. 5.4 Combinations of duts from different racks The histograms can be seen in section A.1. Below the values of r pass and r fail are shown for each version for both stages of the process and for both the ways to estimate crash time for stopped duts.

30 CHAPTER 5. RESULTS 26 Figure 5.3: The plot from the Jelinski-Moranda model. Combination of 5 duts, binomial estimation The rates from this method can be seen in Table 5.4. The rate is the amount of combinations that had a total runtime above 400 hours divided by the total amount of combinations. Combination of 5 duts, no binomial estimation The rates from this method can be seen in Table 5.5. The rate is the amount of combinations that had a total runtime above 400 hours divided by the total amount of combinations. Note that there are big differences between the rates when binomial estimation is included or not. For example version 1.2.A.0.13 has 100 percent probability to pass when with binomial estimation but zero chance to pass without binomial estimation. Apart from version 1.0.A.0.13 the value of r pass is clearly higher for later versions than for earlier versions. This means that the reliability grows with the versions. Combination of 7 duts, binomial estimation The rates from this method can be seen in Table 5.6. The rate is the amount of combinations that had a total runtime above 800 hours divided by the total amount of combinations.

31 CHAPTER 5. RESULTS 27 Version r pass r fail 1.0.A A A A A A A A A A A A A A A A Table 5.4: Pass rate and fail rate for each version, with binomial estimation. Combinations of 5 duts. Version r pass r fail 1.0.A A A A A A A A A A A A A A A A Table 5.5: Pass rate and fail rate for each version, without binomial estimation. Combinations of 5 duts.

32 CHAPTER 5. RESULTS 28 Version r pass r fail 1.0.A A A A A A A A A A A A A A A A Table 5.6: Pass rate and fail rate for each version, with binomial estimation. Combinations of 7 duts. Combination of 7 duts, no binomial estimation The rates from this method can be seen in Table 5.7. The rate is the amount of combinations that had a total runtime above 800 hours divided by the total amount of combinations. Also here there are big differences between the rates when binomial estimation is included or not. Version 1.2.A.0.13 has about 50 percent probability to pass when crash times are estimated with a binomial distribution but zero chance to pass without binomial estimation. This is further discussed in chapter 6 under the line The fact that crash times are modeled with the previous version s parameters. Apart from version 1.0.A.0.13 the value of r pass is clearly higher for later versions than for earlier versions. The values of r pass are plotted in Figure 5.4. The versions to the left of the vertical line are a part of the first stage, where the threshold is 400 hours and the values of r pass are taken from Table 5.5. The versions to the right of the vertical line are a part of the second stage, where the threshold is 800 hours and the values of r pass are taken from Table 5.7. It is clear that the probabilities are higher for the combinations of 5 duts than for the ones with 7 duts. The threshold is 400 hours for 5 duts and 800 hours for 7 duts, which means that the demand is 400/5=80 or 800/7=114.3 hours per dut. Therefore it is not strange that the probabilities are higher for combinations of 5 duts than for the combinations of 7 duts.

33 CHAPTER 5. RESULTS 29 Version r pass r fail 1.0.A A A A A A A A A A A A A A A A Table 5.7: Pass rate and fail rate for each version, without binomial estimation. Combinations of 7 duts. Figure 5.4: The values of r pass, for data from product A. The vertical line represents a shift from stage 1 to 2.

34 CHAPTER 5. RESULTS Comparison between product testing and live testing Parameterizing a Binomial distribution to live data In Table 5.8 the values from this method are shown. Version p P (X(t) 2, t = 200) P (X(t) 2, t = 400) P (X(t) 2, t = 800) 1.0.A A A A A A Table 5.8: Probabilities calculated from the binomial distribution, for data from live testing. The values of the probability to pass the test, i.e. P (X(t) 2, t = 400), in Table 5.8 are shown in Figure 5.5. Figure 5.5: The values of P (X(t) 2, t = 400), for data from the live testing. The values of p and the probabilities show that the product gets more and more reliable.

35 CHAPTER 5. RESULTS 31 Goel-Okumoto for live data In Table 5.9 the rate, r G O, and the time to next failure, TTNF, as defined in section are presented for each version. Figure 5.6 shows the plots of the actual number of occurred failures and the estimated total number of failures for the software at the end of each version. Version r G O TTNF 1.0.A A A A A A Table 5.9: The values from the Goel-Okumoto model, for the data from the live testing. Figure 5.6: The plot from the Goel-Okumoto model, for data from live testing. The values of r G O are equal to one, except for the first version. This means that the model already from the second version says that all the faults in the software are found. This indicates that the model is not good for this data. The values of TTNF are not given since no good measure of it could be gained. Also the plot indicates that the model is not appropriate for the data.

36 CHAPTER 5. RESULTS 32 Jelinski-Moranda for live data In Table 5.10 the rate, r J M, and the time to next failure,ttnf, as defined in section are presented for each version. Figure 5.7 shows the plots of the actual number of occurred failures and the estimated total number of failures for the software at the end of each version. Version r J M TTNF 1.0.A A A A A A Table 5.10: The values from the Jelinski-Moranda model, for the data from live testing. Figure 5.7: The plot from the Jelinski-Moranda model, for the data from the live testing. The values of r J M vary between 25 and 75 percent. How the values change indicates that the model gets more and more stable. This model is though not stable enough to base decisions on. This statement is also supported by the plot, since the estimated total number of faults does not converge.

37 CHAPTER 5. RESULTS 33 Since the reliability models did not give a satisfying result they are not taken in account in the comparison between product testing and live testing. In Table 5.11 the estimated values from the binomial distribution is presented for both product testing and live testing. Product testing Live testing Version p P (X(t) 2, t = 400) p P (X(t) 2, t = 400) 1.0.A A A A A A Table 5.11: Probabilities calculated from the binomial distribution. By comparing the values in Table 5.11 it can be seen that the values from live testing and product testing differ alot. For every version but 1.0.A.1.5 the live testing shows a greater probability to pass than the product test. For version 1.0.A.1.5 they almost show the same probability, it differs 2.9 percent. That means that product testing is a more conservative testing method than live testing. The estimated values of the parameter p is also different for the two levels of testing. For live testing it is smaller than for product testing, except for version 1.0.A.1.5. In product testing an operational profile is used, but in live testing it is not. This shows that the way a software is used has a big impact on the result from the analysis. Important to notice is that three of the versions that are a part of this comparison are the three earliest versions given in the data. These versions have shown to be unstable and therefore the result from them cannot be seen as important as the result from the other versions. This means that the comparison relies on results from the three other versions, which is a bit too little data to draw conclusions from. Because of this and that no actual connection can be found, no further investigations are made for this method. 5.6 Validation Some of the threats mentioned in section 4.3 are investigated below. The methods might only work on the data it is developed from For the data set B, the following versions are a part of the first stage, where each rack contains 5 duts; 1.0.A.1.3, 1.0.A.1.8, 1.0.B.0.0, 1.0.B.0.5, 1.0.B.0.9 and 1.0.A The following are a part of the second stage, where each rack contains 7 duts; 1.1.A.0.1, A.5 and 1.1.A.0.8.

38 CHAPTER 5. RESULTS 34 Parameterizing a Binomial distribution to data from product B When the method in section is adapted to the dataset B, the values in Table 5.12 are produced. Version p P (X(t) 2, t = 200) P (X(t) 2, t = 400) P (X(t) 2, t = 800) 1.0.A A B B B A A A A Table 5.12: Probabilities calculated from the binomial distribution for product B. The plot of the values of the probability to pass the test, according to Table 5.12, can be seen in Figure 5.8. The versions to the left of the vertical line are a part of the first stage, where the threshold is 400 hours and the probability is P (X(t) 2, t = 400). The versions to the right of the vertical line are a part of the second stage, where the threshold is 800 hours and the probability is P (X(t) 2, t = 800). It can be seen that the later versions have lower values of p and higher probabilities in the last three columns than the earlier versions, except for version 1.0.B.0.9 and 1.0.A.1.8. This means that the software gets more and more reliable. Goel-Okumoto estimation for data from product B The values from the Goel-Okumoto model adapted to data from product B is presented in Table Figure 5.9 shows the plots of actual number of occurred failures at the end of each version and the estimated total number of failures for the software. In Table 5.13 it can be seen that the model says that all the faults have been found already at version 1.0.A.1.8. In the plot it can be seen that the number of faults is increasing from version to version, which means that all faults are not found. This indicates that the model is not good for this data. Jelinski-Moranda estimation for data from product B The values from the Jelinski-Moranda model adapted to data from product B is presented in Table Figure 5.10 shows the plots of actual number of

39 CHAPTER 5. RESULTS 35 Figure 5.8: The estimated probability to pass the test, for data from product B. The vertical line represents a shift from stage 1 to 2. Version r G O TTNF 1.0.A A B B B A A A A Table 5.13: The values from the Goel-Okumoto model, for data from product B. occurred failures at the end of each version and the estimated total number of failures for the software. The rate, r J M, in Table 5.14 is increasing from version to version, which is a sign on that the model fits the data. The values of TTNF are higher for the later versions than for the earlier versions. The results point towards a model that gets more and more stable but not stable enough to base decisions on.

40 CHAPTER 5. RESULTS 36 Figure 5.9: The plot from the Goel-Okumoto model, for data from product B. Version r J M TTNF 1.0.A A B B B A A A A Table 5.14: The values from the Jelinski-Moranda model, for data from product B. Combinations of duts from different racks for data from product B The results from this method are shown below, divided in four groups; 1. Combination of 5 duts, binomial estimation 2. Combination of 5 duts, no binomial estimation 3. Combination of 7 duts, binomial estimation 4. Combination of 7 duts, no binomial estimation

41 CHAPTER 5. RESULTS 37 Figure 5.10: The plot from the Jelinski-Moranda model, for data from product B. Combination of 5 duts, binomial estimation The rates can be seen in Table Version r pass r fail 1.0.A A B B B A A A A Table 5.15: Pass rate and fail rate for each version, for product B with binomial estimation. Combinations of 5 duts. Combination of 5 duts, no binomial estimation The rates can be seen in Table There is a big difference in the pass rates between when the binomial estimation is included or not. Without the binomial estimation 5 out of 9 versions have a pass rate that is equal to zero and one that is very close to zero. With

42 CHAPTER 5. RESULTS 38 Version r pass r fail 1.0.A A B B B A A A A Table 5.16: Pass rate and fail rate for each version, for product B without binomial estimation. Combinations of 5 duts. the binomial estimation all the versions have pass rates bigger than zero. Note that version 1.0.B.0.5 has a zero pass rate without binomial estimation and a pass rate of almost 67 percent with the binomial estimation. This depends on that compared to the other versions with a pass rate equal to zero in Table 5.16, version 1.0.B.0.5 had more combinations close to the threshold 400 hours. In Figure 5.11 it can be seen that this is the case. Combination of 7 duts, binomial estimation The rates can be seen in Table Version r pass r fail 1.0.A A B B B A A A A Table 5.17: Pass rate and fail rate for each version, for product B with binomial estimation. Combinations of 7 duts. Combination of 7 duts, no binomial estimation The rates can be seen in Table By comparing Table 5.17 with Table 5.18 it can be seen that the method with binomial estimation has larger pass rates than the method without binomial estimation.

43 CHAPTER 5. RESULTS 39 Version 1.0.A.1.3 Version 1.0.B.0.0 Version 1.0.B.0.5 Version 1.0.B.0.9 Figure 5.11: Histograms for 4 combinations of 5 duts with low pass rates, without binomial estimation. The values of r pass can be seen in Figure The versions to the left of the vertical line are a part of the first stage, where the threshold is 400 hours and the values of r pass are taken from Table The versions to the right of the vertical line are a part of the second stage, where the threshold is 800 hours and the values of r pass are taken from Table The assumption that the data belongs to a binomial distribution might be wrong In Table 5.19 the estimated value of p for each interval and each version can be seen. A plot of the estimated values of p can be seen in Figure 5.13.

44 CHAPTER 5. RESULTS 40 Version r pass r fail 1.0.A A B B B A A A A Table 5.18: Pass rate and fail rate for each version, for product B without binomial estimation. Combinations of 7 duts. Figure 5.12: The values of r pass, for data from product B. The vertical line represents a shift from stage 1 to 2. It can be seen that the estimated values of p are close to constant for each version, except from the first interval. The high values in the first interval probably depends on that the software has some failing functions that are normally removed at once or in an earlier phase of the testing, but in this case they remain. It can also be seen that the two first versions do not show a constant value of the estimated parameter. Since they are the first methods it is probably the

45 CHAPTER 5. RESULTS 41 Version I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 I A A A A A A A A A A A A A A A A Table 5.19: The estimated values of p for the intervals and versions. Figure 5.13: The estimated values of p for different intervals and versions. fact that the software is unstable that causes this. The conclusion of this is that it is a credible assumption that p is constant over each version.

46 CHAPTER 5. RESULTS 42 The data that is hard to interpret When validating the raw data, corrections have been made in the calculations. The assumption that version i is at least not worse than version i-1 might be wrong This assumption is made for the methods in section and section 4.2.4, with binomial estimation. For the later method an alternative is given that does not assume that version i is at least not worse than version i-1. This threat is further discussed in Chapter 6.

47 Chapter 6 Discussion The results in chapter 5 are discussed below, divided in six sections. Discussion about the results from product A In Table 6.1 the result from the company s own analysis of product A is shown. The column Average contains the values of the average mentioned in Chapter 3. Six versions are chosen for which the result from the company is compared to the results in chapter 5. The chosen versions are; 1.0.A.0.13, 1.0.A.1.12, 1.0.A.1.36, 1.1.A.0.5, 1.2.A.0.6 and 1.2.A The result from the company is compared to; The probability of getting a value larger than the threshold, calculated from the estimated binomial distribution as described in section This probability is called probability to pass. The values can be seen in Table 5.1. The pass rates that are estimated with the method in section The rates can be seen in Table 5.4, Table 5.5, Table 5.6 and Table 5.7. Since the reliability growth models did not give a satisfying result, that result is not taken in account in the discussion below. In Table 6.2 the results that are compared in this section are gathered for all the versions. The pass rates are from the method that does not estimate the crash times with a binomial distribution. A plot of these results can be seen in Figure 6.1. The vertical lines stand for versions that got the test result pass from the company. Figure 6.1 shows that the methods in section and section show the same trends but that the later method is a bit more conservative. It can also be seen that for most of the versions that the company gave the test result pass, the two other methods show high values. Note that this is not the case for them all. 43

48 CHAPTER 6. DISCUSSION 44 Version Average Test result Threshold 1.0.A fail A fail A fail A pass A fail A fail A pass A fail A fail A fail A fail A fail A fail A pass A pass A fail 800 Table 6.1: Result of the testing at the company, for product A. Company Method in section Method in section Version Threshold Average Test result Probability to pass r pass 1.0.A fail A fail A fail A pass A fail A fail A pass A fail A fail A fail A fail A fail A fail A pass A pass A fail Table 6.2: The results from different methods, for product A.

49 CHAPTER 6. DISCUSSION 45 Figure 6.1: The plot of the different results, for product A. Version 1.0.A.0.13 This version is a part of the first stage of the process, where the racks contain 5 duts and the threshold is 400 hours. The value of the average from the company is 571 hours, which means that the version passed. The probability to pass calculated from the estimated binomial distribution is 63 percent. The pass rates estimated by the methods in section are; 95 percent without binomial estimation and 97 percent with the binomial estimation, which match the company s result well. In the histograms for this version in appendix A it can be seen that a large part of all possible combinations have total runtimes that are 500 to 700 hours, i.e. clearly higher than the threshold 400 hours. For this version the method that calculates probabilities from an estimated binomial estimation is more conservative than the other methods. Version 1.0.A.1.12 This version is a part of the first stage of the process, where the racks contain 5 duts and the threshold is 400 hours. The value of the average from the company is 426 hours, which means that the version passed. The probability to pass calculated from the estimated binomial distribution is 58 percent. The pass rates estimated by the methods in section are; 29 percent without binomial estimation and 61 percent with the binomial estimation, i.e. there is a big difference between the values. The company passed the version, but not with a large margin. The other methods give a bit too low probabilities to say that the version passed the test. This means that according to the other two

50 CHAPTER 6. DISCUSSION 46 methods the company takes a large risk when giving this version the test result pass. Version 1.0.A.1.36 This version is a part of the second stage of the process, where the racks contain 7 duts and the threshold is 800 hours. The value of the average from the company is 734 hours, which means that the version failed. The estimated binomial distribution shows a 40 percent chance of passing the test. Also the both pass rates are 40 percent. This means that all the methods say that this version fails. Version 1.1.A.0.5 This version is a part of the second stage of the process, where the racks contain 7 duts and the threshold is 800 hours. The value of the average from the company is 772 hours, which means that the version failed the test. The probability calculated from the estimated binomial distribution is 82 percent. The pass rate calculated without binomial estimation is 53 percent and with binomial estimation it is 86 percent. For this version the company s result and the pass rate without binomial estimation match but the methods that use binomial estimation for crash times show much larger values and are therefore less conservative. Version 1.2.A.0.6 This version is a part of the second stage of the process, where the racks contain 7 duts and the threshold is 800 hours. The value of the average from the company is 1738 hours, which means that the version passed the test. The probability calculated from the estimated binomial distribution is 94 percent. The both pass rates are 91 percent. The histograms in appendix A for this version show that almost all combinations have a total runtime high over 1500 hours, which is clearly over the threshold 800 hours. This means that all the methods give this version the grade pass, with large margin. Version 1.2.A.0.13 This version is a part of the second stage of the process, where the racks contain 7 duts and the threshold is 800 hours. The value of the average from the company is 284 hours, which means that the version failed the test. The probability calculated from the estimated binomial distribution is 95 percent. The pass rates are; zero percent without binomial estimation and 50 percent with binomial estimation. The histograms in appendix A are very different for this version. For the method without binomial estimation the histogram shows that many of the combinations have total runtimes lower than 350 hours, which is very low. This means that the methods that involve binomial estimation of the crash times estimate the probability to pass too high and are less conservative.

51 CHAPTER 6. DISCUSSION 47 From the discussion above, it can be said that the methods that use binomial estimation for crash times show results that does not have the same trends as the other methods. This is because they have a dependence between the versions, since the crash times are estimated with parameters from previous version. Discussion about the results from product B In Table 6.3 the result from the company s own analysis of product B is shown. The column Average contains the values of the average mentioned in Chapter 3. Four versions are chosen for which the result from the company is compared to the results in chapter 5. The chosen versions are; 1.0.A.1.8, 1.0.B.0.9, 1.0.A.1.26 and 1.1.A.0.8. The result from the company is compared to; The probability of getting a value larger than the threshold, calculated from the estimated binomial distribution, described in section This probability is called probability to pass. The values can be seen in Table The pass rates that are estimated with the method in section The rates can be seen in Table 5.15, Table 5.16, Table 5.17 and Table Since the reliability growth models did not give a satisfying result, that result is not taken in account in the discussion below. Version Average Test result Threshold 1.0.A fail A fail B fail B fail B fail A pass A pass A pass A pass 800 Table 6.3: Result of the testing at the company, for product B. In Table 6.4 the results that are compared in this section are gathered. The pass rates are from the method that does not estimate the crash times with a binomial distribution. A plot of these results can be seen in Figure 6.2. The vertical lines stand for versions that got the test result pass from the company. Figure 6.2 show that the methods in section and section both produces higher values for the later versions. The method in section have pass rates close to zero for the six first versions, which is not the case for the method that parameterizes a binomial distribution. The result from the company shows that only the four last versions pass. This matches well with the

52 CHAPTER 6. DISCUSSION 48 Company Method in section Method in section Version Threshold Average Test result Probability to pass r pass 1.0.A fail A fail B fail B fail B fail A pass A pass A pass A pass Table 6.4: The results from different methods, for product B. Figure 6.2: The plot of the different results, for product B. pass rates, except for version 1.0.A The probabilities from the binomial distributions match badly with the results from the other methods. Only the last version has a probability high enough to say that the version pass the test. Version 1.0.A.1.8 This version is a part of the first stage of the process, where the racks contain 5 duts and the threshold is 400 hours. The value of the average from the company is 382 hours. This means that it failed at the company. The binomial estimation

53 CHAPTER 6. DISCUSSION 49 gave this version a 40 percent chance of passing the test and the both rates from the combinations of 5 duts are 0 percent without binomial estimation and 23 percent with the binomial estimation, i.e. very low. Version 1.0.B.0.9 This version is a part of the first stage of the process, where the racks contain 5 duts and the threshold is 400 hours. The value of the average from the company is 70 hours. The binomial estimation gives the version a 3 percent chance of getting a pass. The pass rates from the combinations of 5 duts are both 0 percent. This means that all methods say that this version is far from passing the test. Version 1.0.A.1.26 This version is a part of the first stage of the process, where the racks contain 5 duts and the threshold is 400 hours. The value of the average from the company is 525 hours. The calculations from the binomial distribution gave the version a 53 percent chance of passing the test. The pass rates from the combinations of 5 duts are 0 percent without binomial estimation and 72 percent with the binomial estimation. This means that according to the other methods, the company takes a risk when giving this version the test result pass. Note that there is a big difference between the methods that use binomial estimation and the one that does not. Version 1.1.A.0.8 This version is a part of the second stage of the process, where the racks contain 7 duts and the threshold is 800 hours. The value of the average from the company is 1751 hours. The calculations from the binomial distribution gave the version a 91 percent chance of passing the test. The pass rates from the combinations of 7 duts are 99 percent without binomial estimation and 100 percent with the binomial estimation. This means that all the methods say that this is a version that is stable and should pass the test. Results from product A vs results from product B The interpretations of the results from the both products are similar. They both show a growing reliability and they both show that the methods that estimate the crash times with binomial distribution give results that are not to trust. Since all the Matlab functions are working on both the data sets and the results from the both data sets show the same pattern it can be stated that the methods work on data on the same form as data from product A.

54 CHAPTER 6. DISCUSSION 50 Results from the Jelinski-Moranda model vs results from the company To try to understand the shifts in estimated total number of faults in Figure 5.3 and Figure 5.10 it is compared to the result from the company. In Figure 6.3 and Figure 6.4 the Jelinski-Moranda plots are shown for both product A and B. In the plots the versions that passed the company s test are marked with vertical lines. Figure 6.3: The plot from the Jelinski-Moranda model, for data from product A. Versions that passed are marked with a vertical line. It can be seen that for versions that passed the company s test, the estimated total number of faults is closer to the actual number of detected faults than for the versions that failed. This means that when the times between failures are short, the estimated total number of faults increases and if the times between failures are longer, the estimated total number of faults decreases. The shifts are reasonable, but this does not mean that the model is appropriate for the data. The model assumes that the software gets better and better and that faults are corrected at once without introducing any new faults. This is not true for the data in this master s thesis and that is why the model is not appropriate.

55 CHAPTER 6. DISCUSSION 51 Figure 6.4: The plot from the Jelinski-Moranda model, for data from product B. Version that passed are marked with a vertical line. The fact that crash times are modeled with the previous version s parameters The methods that estimate crash times with a binomial distribution show unreasonably high values of the probability to pass for some versions. This is because the crash times are estimated with the parameters from the previous version, which showed up a good result. This is done because of the assumption that version i is at least not worse than version i-1. Clearly in some cases this assumption is wrong. Discussion about when a dut can be stopped, without affecting the test result As mentioned in chapter 3, when two duts have crashed in a rack the remaining duts do not need to be kept running. It has been stated that for the methods developed in this master s thesis, this means that the results might be underestimated. If all the duts run until they crash, no crash times would need to be estimated and the problem would be solved. For the method in section it is though enough to say that the duts should not be stopped before they have run T n 1 hours, where T is the threshold (400 or 800 hours) and n is the number of duts per rack (5 or 7). This is because only the two lowest runtimes affect

56 CHAPTER 6. DISCUSSION 52 if a rack is considered to pass or fail. The worst case scenario is if one of the duts in a rack has a runtime that is zero hours, i.e. it crashes as soon as it is started. In this worst case scenario, the total runtime for that rack would be t 2 (n 1) hours, where t 2 is the second lowest runtime. For the rack to pass, the total runtime must be at least equal to the threshold, T. This means that for the rack to pass the following must hold: t 2 T n 1.

57 Chapter 7 Conclusions The recommendation to the company is to use the method that is based on the method that they use today. It has been stated in the previous chapters that the two reliability growth models that are examined are not appropriate for the purpose of this master s thesis. The Jelinski-Moranda model fits the data better than the Goel-Okumoto model, but it is still not that good that it is recommended to the company as a model to use. The method that was created to find a connection between the results from product testing and live testing did not give a satisfying result. Three of the six versions for which this method was used were the first and unstable versions. Therefore only three versions remained to base the conclusions on. Since there was no clear connection for these three versions this method is not recommended. When estimating the crash times with a binomial distribution, the parameters from the previous version are used. This can be done since the assumption that a version is at least as good as the previous version is made. Observations that have been made make it clear that this is not always the case. This means that crash times can be given a value that is unreasonably large because the previous version was better than the current one. It can also be the other way around, that the previous version got a bad result, which results in a possible underestimation of the crash times for the current version. Investigations made in this master s thesis show that it is reasonable to use a binomial distribution for the data. Though, in what way a binomial distribution should be used must be investigated, so that the problems discussed above disappear. As a result of these problems the method in section 4.2.1, that parameterizes a binomial distribution is not recommended to the company. Because of the uncertainties about estimating the crash times with a binomial distribution, that are discussed above, the method that combines all the possible duts and uses this estimation is not recommended. The method that combines all the possible duts without estimating the crash times with a binomial distribution is a method that the company could use. For this method a dut that is stopped or still running gets a crash time that is equal 53

58 CHAPTER 7. CONCLUSIONS 54 T n 1 to the time when it was stopped/observed. This means that it counts with the lowest possible crash times for these duts, which leads to an underestimation of the runtimes and therefore an underestimation of the result. For some versions this method gives a very low probability for pass, at the same time as the test result from the company is pass. This means that the company takes a big risk, from a statistical point of view, when giving these versions the test result pass. Based on the previous chapters in this master s thesis, the recommendation to the company is to use the method that combines all the possible duts in section to get a measure of how reliable their software is. It is also recommended to let all the duts run at least hours before they are stopped, where T is the threshold and n is the number of duts per rack. This means that no estimation of the crash times will be necessary.

59 Chapter 8 Future work When I have been working with this master s thesis I have discovered several interesting areas. Unfortunately it has not been possible to investigate them all. Work that has not been done and is connected to the work in this master s thesis, is recommended below. Recommended future work Besides more statistically correct analysis, the company also wants to reach lower runtimes. This has not been examined in this master s thesis and therefore is left to do. It would be of interest to investigate how the results would be affected if testing was divided on a larger number of duts, since increasing the runtime is more expensive than increasing the number of duts. One thing to keep in mind is that it has to be researched how many duts it can be divided on. If the number of duts is too large the result will not say anything about the situation that the company is interested in. It is not the same thing to run 1 hour on 1000 duts as it is to run 1000 hours on 1 dut. The comparison between live testing and product testing may also be helpful when trying to lower the runtimes, so further studies on this would be interesting. In this master s thesis the reliability growth models that has been used are Goel-Okumoto and Jelinski-Moranda. These models have been used to get an estimation in one point at a time, not a continuous estimation. It would be interesting to use the models continuous to predict the future result. Since these models did not give a satisfying result it would be interesting to use s-shaped reliability growth models, mentioned in section 2.4. The methods that estimate the crash times with a binomial distribution in some cases gave results that were unreasonable, because of the dependence between adjacent versions. Therefore the dependence needs to be investigated. Even though it seems appropriate to estimate the crash times with a binomial distribution, it would be interesting to test other possible ways of modeling the 55

60 CHAPTER 8. FUTURE WORK 56 crash times for stopped duts. For example it can be assumed that the probability of failure is decreasing in some way. The failures that are of interest in a context like this are failures that depend on memory leakage and other problems related to long run executions. Therefore it could be a good idea to remove the crashes that occur because of failing functions, that normally would be corrected at once or should be corrected in earlier phases. The detection of these faults does not depend on how long the runtime is, but if the function that is failing is tested or not. Typically the large amount of crashes that occur early in the runs are because of these failing functions, and can therefore be removed. How good the methods in this master s thesis are can only be decided by the company after actually trying it and see how it goes when they base their decision on these methods. The thresholds in the methods are also to be set by the company. There is a contradiction between the method in section and the methods in section and section concerning assumptions on the failure probability. In a binomial distribution the parameter p is constant while the reliability growth models assume that the failure probability is decreasing. The assumption that showed the best result in this master s thesis was the assumption that p is constant. Further studies on which assumption that actually is more appropriate would be interesting.

61 Bibliography [1] ISO/IEC :2001(E) International Standard Software Engineering Product Quality Part 1: Quality Model. [2] Burnstein, I. Practical Software Testing, Springer [3] Lyu, M. R. Handbook of Software Reliability Engineering IEEE Computer Society Press and McGraw-Hill Book Company [4] Andersson, C. A replicated empirical study of a selection method for software reliability growth models, Springer Science [5] Goel, L. A. Software Reliability Models: Assumptions, Limitations, and Applicability, IEEE Transactions on Software Engineering [6] Daya Perera, U. Reliability Index - A Method to Predict Failure Rate And Monitor Maturity of Mobile Phones, Annual Reliability and Maintainability Symposium, DOI: /RAMS [7] Musa, J., Iannino, A., Okumoto, K. Software Reliability: Measurement, Prediction, Application, McGraw-Hill [8] Höst, M., Regnell, P., Runeson, P. Att genomföra examensarbete, Studentlitteratur

62 Appendix A Histograms A.1 The histograms The histograms produced by the method in section for product A are presented below. There are four groups of histograms; Histograms produced for combinations of 5 duts, with binomial estimation of crash times. Histograms produced for combinations of 5 duts, without binomial estimation of crash times. Histograms produced for combinations of 7 duts, with binomial estimation of crash times. Histograms produced for combinations of 7 duts, without binomial estimation of crash times. Which group the histograms belong to is written in the captions. 58

63 APPENDIX A. HISTOGRAMS 59 Version 1.0.A.0.9 Version 1.0.A.0.11 Version 1.0.A.0.12 Version 1.0.A.0.13 Figure A.1: Histograms for combinations of 5 duts, with binomial estimation.

64 APPENDIX A. HISTOGRAMS 60 Version 1.0.A.1.5 Version 1.0.A.1.10 Version 1.0.A.1.12 Version 1.0.A.1.17 Figure A.2: Histograms for combinations of 5 duts, with binomial estimation.

65 APPENDIX A. HISTOGRAMS 61 Version 1.0.A.1.20 Version 1.0.A.1.24 Version 1.0.A.1.36 Version 1.1.A.0.1 Figure A.3: Histograms for combinations of 5 duts, with binomial estimation.

66 APPENDIX A. HISTOGRAMS 62 Version 1.1.A.0.5 Version 1.1.A.0.8 Version 1.2.A.0.6 Version 1.2.A.0.13 Figure A.4: Histograms for combinations of 5 duts, with binomial estimation.

67 APPENDIX A. HISTOGRAMS 63 Version 1.0.A.0.9 Version 1.0.A.0.11 Version 1.0.A.0.12 Version 1.0.A.0.13 Figure A.5: Histograms for combinations of 5 duts, without binomial estimation.

68 APPENDIX A. HISTOGRAMS 64 Version 1.0.A.1.5 Version 1.0.A.1.10 Version 1.0.A.1.12 Version 1.0.A.1.17 Figure A.6: Histograms for combinations of 5 duts, without binomial estimation.

69 APPENDIX A. HISTOGRAMS 65 Version 1.0.A.1.20 Version 1.0.A.1.24 Version 1.0.A.1.36 Version 1.1.A.0.1 Figure A.7: Histograms for combinations of 5 duts, without binomial estimation.

70 APPENDIX A. HISTOGRAMS 66 Version 1.1.A.0.5 Version 1.1.A.0.8 Version 1.2.A.0.6 Version 1.2.A.0.13 Figure A.8: Histograms for combinations of 5 duts, without binomial estimation.

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions. ME3620 Theory of Engineering Experimentation Chapter III. Random Variables and Probability Distributions Chapter III 1 3.2 Random Variables In an experiment, a measurement is usually denoted by a variable