A replicated controlled study confirms that developers’ perceptions, preferences, and opinions about software testing techniques do not reliably predict actual A replicated controlled study confirms that developers’ perceptions, preferences, and opinions about software testing techniques do not reliably predict actual

Why the Testing Method Developers Prefer Is Rarely Ever the One That Finds the Most Bugs

Abstract

1 Introduction

2 Original Study: Research Questions and Methodology

3 Original Study: Validity Threats

4 Original Study: Results

5 Replicated Study: Research Questions and Methodology

6 Replicated Study: Validity Threats

7 Replicated Study: Results

8 Discussion

9 Related Work

10 Conclusions And References

\

7 Replicated Study: Results

Of the 46 students participating in the experiment, seven did not complete the questionnaire17 and were removed from the analysis. Table 19 shows the changes in the experimental groups due to students not participating in the study. Balance is not seriously affected by mortality—although it would have been desirable that Group 5 had at least one more participant.

Additionally, another four participants did not answer all the questions and were removed from the analysis of the respective questions.

7.1 RQ1: Participants’ Perceptions as Predictors

7.1.1 RQ1.1-RQ1.5: Comparison with Original Study Results Appendix C shows the analysis of the experiment. Program is the only statistically significant variable (group, program and the program by technique interaction are not significant). In this replication, fewer defects are found in cmdline compared to nametbl and ntree, where the same amount of defects are found. Some results are in line with those obtained in the original study:

– There is no interaction with selections effect. Group is not significant.

– Mortality does not affect experimental results. The analysis technique used (Linear Mixed-Effects Models) is robust to lack of balance.

– Results cannot be generalized to other subject types.

\ But others contradict those obtained in the original study, and therefore need further investigation:

– Maturation effect cannot be ruled out. The program where lowest effectiveness is obtained is the one used the first day.

– Order of training does not seem to be affecting results. All techniques show the same effectiveness. Table 20 shows the results of participants’ perceptions for techniques. The results are the same as in the original study (χ 2 (2,N=37)=3.622, p=0.164). Our data do not support the conclusion that techniques are differently frequently perceived as being the most effective..

Our data do not support the conclusion that participants correctly perceive the most effective technique for them. The overall and per technique kappa values and 95% CI reported in Table 21 are in line with those in the original study. This suggests that the hypothesis we elaborated in the original experiment would not be correct. For some reason, perceptions are more accurate with the CR technique.

\ Again as in the original study, we have not been able to observe bias in perceptions (Stuart-Maxwell outputs (χ 2 (2, N=37)=3.103,p=0.212), and McNemar-Bowker outputs (χ 2 (3,N=37)=3.143, p=0.370)). Table 22 shows the value of Krippendorff’s α and 95% CI, overall and for each pair of techniques, for all participants and for every design group (participants that applied the same technique on the same program) separately, and Table 23 shows the value of Krippendorff’s α and 95% CI, overall and for each

program/session. Again, the results obtained are the same as in the original study. Participants do not obtain similar effectiveness values when applying the different techniques (testing the different programs) so as to be difficult to discriminate among techniques/programs.

Table 24 and Figure 2 show the cost of mismatch. As in the original study, the mismatch cost is not related to the technique perceived as being the most effective, (Kruskal-Wallis (H(2)=2.979,p=0.226)). Also, there are about the same proportion of mismatches as in the original study (48% of mismatches in the original study versus 51% in the replicated study.

However, there are some differences with respect to the original study: – While CR had the greatest number of mismatches in the original study, now it has the smallest. The number of mismatches for BT and EP has increased with respect to the original study. – In the replicated study, the mismatch cost is slightly lower (25pp compared with 31pp in the original study). The mismatch cost is smaller when CR is involved.

\ This could be due to the change in the seeded faults or just to natural variation. It should be further checked. However, it is a fact that the effectiveness of EP and BT has decreased in the replicated study, while CR has a similar effectiveness as in the original study. This suggests that the mismatch cost could be related to the faults that the program contains. However,

this issue needs to be further investigated, as we have few data points. Note that, as in the original study, the existence of few datapoints could be affecting these results.

\ Table 25 shows the average loss of effectiveness that should be expected in a project due to mismatch. The expected loss in effectiveness in a project is similar to the one observed in the original study (13pp), but this time it is related to the technique perceived as most effective (Kruskal-Wallis (H(2)=9.691, p=0.008)). This means that some mismatches are more costly than others. The misperception of CR as being the most effective technique has a lower associated cost (4pp) than for BT or EP (18pp).

\ This suggests that participants’ who think CR is the most effective might be allowed to apply this technique, as, even if they are wrong, the loss of effectiveness would be negligible. However, participants should not rely on their perceptions even in this case, since fault type could have an impact on this result and they will never know what faults there are in the program beforehand. Note that again the existence of few datapoints could be affecting these results. Therefore, this issue needs to be further researched.

\ The findings of the replicated study are:

– They confirm the results of the original study.

– A possible relationship between fault type and mismatch cost should be further investigated.

Since the results of both studies are similar, we have pooled the data and performed joint analyses for all research questions to overcome the problem of lack of power due to sample size. They are reported in Appendix D. The results confirm those obtained by each study individually. This allows us to gain confidence in the results obtained.

7.1.2 RQ1.6: Perceptions and Number of Defects Reported

One of the conclusions of the original study was that perceived technique effectiveness could match the technique with the highest number of defects reported. Table 26 shows the value of kappa and its 95% CI, overall and for each technique separately. We find that all values for kappa with respect to the perceived most effective technique and technique with greater number of defects reported are consistent with lack of agreement (κ<0.4, poor). However, the upper bound of all 95% CIs show agreement, and the lower bound of all 95% CIs but BT are greater than zero. This means that although our data do not support the conclusion that participants correctly perceive the most effective technique for them, it should not be ruled out. This means that participants perceptions about technique effectiveness could be related to reporting a greater number of defects with that technique.

As lack of agreement cannot be ruled out, we examine whether the perceptions are biased. The results of the Stuart-Maxwell test show that the null hypothesis of existence of marginal homogeneity cannot be rejected (χ 2 (2,N=37)=2.458, p=0.293). This means that we cannot conclude that perceptions and reported defects are differently distributed. Additionally, the results of the McNemarBowker test show that the null hypothesis of existence of symmetry cannot be rejected (χ 2 (3,N=37)=2.867, p=0.413). This means that we cannot conclude that there is directionality when participants’ perceptions do not match the technique with highest defects reported. The lack of a clear agreement could be due to the fact that participants do not remember exactly the number of defects found with each technique.

\ 7.1.3 RQ1.1-RQ1.2: Program Perceptions

Table 27 shows the results of participants’ perceptions for the program in which the participants detected most defects. We found that the same phenomenon applies to programs as to techniques. All three programs cannot be considered differently frequently perceived as being the ones where most defects were found, as we cannot reject the null hypothesis that the frequency distribution of the responses follows a uniform distribution (χ 2 (2,N=37)=2.649, p=0.266). Our data do not support the conclusion that programs are differently frequently perceived as having a higher percentage of defects found than the others. This contrasts with the fact that cmdline has a slightly higher complexity and number of LOC, and that ntree shows highest Halstead metrics. We expected cmdline and/or ntree should be perceived less frequently as having a higher detection rate.

However, the values for kappa in Table 28 show that there seems to be agreement overall and for cmdline and ntree (κ >0.4, fair to good and agreement by chance can be ruled out, since 0 does not belong to the 95% CI), but not so for the nametbl program (κ=0.292, poor and agreement by chance cannot be ruled out, as 0 belongs to the 95% CI). This means that participants do tend to correctly perceive the program in which they detected most defects. This is striking, as it contrasts with the disagreement for techniques. Pending the analysis of the mismatch cost, it suggests that participants’ perceptions on the percentage of defects found may be reliable. This is interesting, as cmdline has a higher complexity. Since there is agreement, we are not going to study mismatch cost. Misperceptions do not seem to affect participants’ perception of how well they have tested a program.

7.2 RQ2: Participants’ Opinions as Predictors

7.2.1 RQ2.1: Participants’ Opinions

Table 29 shows the results for participants’ opinions with respect to techniques.

With regard to the technique participants applied best (OT1), we can reject the null hypothesis that the frequency with which they perceive that they had applied each of the three techniques best is the same (χ 2 (2,N=38)=10.947, p=0.004). More people think they applied EP best, followed by both BT and CR (which merit the same opinion).

\ In the case of the technique participants liked best (OT2), the results are similar. We can reject the null hypothesis that participants equally as often regard all three techniques as being their favourite technique (χ 2 (2,N=38)=22.474, p=0.000). Most people like EP best, followed by both BT and CR (which merit the same opinion).

\ Finally, as regards the technique that participants found easiest to apply (OT3), the results are exactly the same as for the preferred technique (χ 2 (2,N=38)=22.474, p=0.000). Most people regard EP as being the technique that is easiest to apply, followed by both BT and CR (which merit the same opinion).

\ Table 30 shows the results for participants’ opinions for programs. We cannot reject the null hypothesis of all programs equally frequently being viewed as the simplest. (χ 2 (2,N=38)=1.474, p=0.479). Therefore, our data do not support the conclusion that all three programs are differently frequently perceived as being the simplest. This result suggests that both the differences in complexity and size of cmdline and the highest Halstead metrics of ntree are small. This result suggest that participants could be interpreting differently this question. Another possibility could be that the question that has been used to operationalize the corresponding construct is vague, and participants are not interpreting it correctly.

7.2.2 RQ2.2: Comparing Opinions with Reality

The technique that participants think they applied best (OT1) is not a good predictor of technique effectiveness. The overall and per technique kappa values in the fourth column of Table 31 are consistent with lack of agreement (in all cases (κ <0.4, poor), and although the upper bound of the 95% CIs show agreement, 0 belongs to most 95% CIs, meaning that agreement by chance cannot be ruled out). However, we find that there is a bias, as the Stuart-Maxwell and McNemar-Bowker tests can reject the null hypotheses of marginal homogeneity (χ 2 (2,N=38)=10.815, p=0.004) and symmetry (χ 2 (3, N=38)=12.067, p=0.007), respectively.

\ Looking at the light and dark grey cells in the corresponding contingency table represented in Table 32, we find that the cells placed under the diagonal have higher values than those positioned above the diagonal. In other words, there are rather more participants that consider that they applied EP best, despite achieving better effectiveness with CR and BT (9 and 5), than participants who consider that they applied CR or BT best, despite being more effective using EP (1 in both cases). This suggests that there is a bias towards EP. This bias is much more pronounced with respect to CR. These results are consistent with the ones found in the previous section.

\ There are several possible interpretations for these results: 1) we do not know if the opinion on the best applied technique is accurate (meaning that it is really the best applied technique); 2) possibly due to the change in faults, technique performance is worse in this replication than in the original study; and 3) it could be that participants have misunderstood the question. Interviewing participants or asking them in the questionnaire about the reasons for their answers, would have helped to clarify this last issue.

\ As regards participants’ favourite technique (OT2), the results are similar. This opinion does not predict technique effectiveness, since all kappa values in the fourth column of Table 31 denote lack of agreement (in all cases (κ <0.4, poor), and although the upper bound of the 95% CIs show agreement, 0 belongs to all 95% CIs, meaning that agreement by chance cannot be ruled out). Again, we find there is bias, as the Stuart-Maxwell and the McNemar-Bowker tests can reject the null hypotheses of marginal homogeneity (χ 2 (2,N=38)=11.931, p=0.003) and symmetry (χ 2 (3,N=38)=11.974,p=0.007), respectively. Looking at the light and dark grey cells in Table 33, we again find

that there is bias towards EP. There are rather more participants that think that they applied EP better, despite being more effective using CR and BT (12 and 5), than participants that considered that they applied CR or BT better, despite being more effective using EP (1 in both cases). Note that the bias between CR and EP is more pronounced. Note that it is very unlikely that participants have not properly interpreted this question. It just seems that the technique they most like is not typically the most effective.

Finally, with respect to the technique that is easiest to apply (OT3), we find that the results are exactly the same as for their preferred technique. However, as we have seen in OT2, their preferred technique is not a good predictor of effectiveness (see third row of Table 31), and there is bias towards EP (see light and dark grey cells in Table 33). These results are in line with a common claim in SE, namely, that developers should not base the decisions that they make on their opinions, as they are biased. Again, it should be noted that participants might not be interpreting the question as we expected. Further research is necessary.

As far as the simplest program is concerned, we find, as we did for the techniques, that it is not a good predictor of the program in which most defects were detected (the values for overall and per program kappa in Table 34 denote lack of agreement (in all cases (κ <0.4, poor), and although the upper bound of the 95% CIs show agreement, 0 belongs to 95% CIs— except ntree, meaning that agreement by chance cannot be ruled out). Unlike the opinions on techniques, we were not able to find any bias this time, as neither the null hypothesis of marginal homogeneity (χ 2 (2,N=38)=1.621, p=0.445) nor symmetry (χ 2 (3,N=38)=3.286, p=0.350) can be rejected. This result suggests that the programs that participants perceive to be the simplest are not necessarily the ones where most defects have been found. Again, it should be noted the problem of participants interpreting the simplest program in a different way as we expected.

7.2.3 Findings

Our findings suggest:

– Participants’ opinions should not drive their decisions.

– Participants prefer EP (think they applied it best, like it best and think that it is easier to apply), and rate CR and BT equally.

– All three programs are equally frequently perceived as being the simplest.

– The programs that the participants perceive as the simplest are not the ones where the highest number of defects have been found.

These results should be understood within the validity limits of the study.

7.3 RQ3: Comparing Perceptions and Opinions

7.3.1 RQ3.1: Comparing Perceptions and Opinions

In this section, we look at whether participants’ perceptions of technique effectiveness are biased by their opinions about the techniques. According to the results for kappa shown in the fourth column of Table 35 (PT1-OT1), we find that results are compatible with agreement (overall and per technique, except for BT in which lack of agreement cannot be ruled out) between the technique perceived to be the most effective and the technique participants think they applied best (in all cases (κ >0.4, fair to good), and in all cases but BT, 0 does not belong to 95% CIs, meaning that agreement by chance can be ruled out).

\ This is an interesting finding, as it suggests that participants think that technique effectiveness is related to how well the technique is applied. Technique performance definitely decreases if they are not applied properly. It is no less true, however, that techniques have intrinsic characteristics that may lead to some defects not being detected. In fact, the controlled experiment includes some faults that some techniques are unable to detect. A possible explanation for this result could be that the evaluation apprehension threat is materializing.

On the other hand, the kappa values in the fourth column of Table 35 (PT1-OT2) reveal a lack of agreement for CR and BT between the preferred technique and the technique perceived as being most effective (in all cases (κ <0.4, poor), and although the upper bound of the 95% CIs show agreement, 0 belongs to 95% CIs, meaning that agreement by chance cannot be ruled out), whereas overall lack of agreement cannot be ruled out ((κ <0.4, poor), the upper bound of the 95% CI shows agreement, and 0 does not belong the to 95% CI, meaning that agreement by chance can be ruled out). Finally, there is agreement (κ >0.4, fair to good) in the case of EP.

\ This means that, in the case of EP, participants tend to associate their favourite technique with the perceived most effective technique, contrary to the findings for CR and BT. This is more likely to be due to EP being the technique that many more participants like best (and the chances for there being a match are higher compared to the other techniques) than to there actually being a real match. With respect to directionality whenever there is disagreement, the results of the Stuart-Maxwell and the McNemar-Bowker tests show that the null hypotheses of marginal homogeneity (χ 2 (2,N=37)=8.355,p=0.015) and symmetry (χ 2 (3,N=37)=8.444, p=0.038) can be rejected.

\ Looking at the light grey cells in Table 36, we find that there are more participants claiming to have applied CR best that prefer EP than vice versa (8 versus 1). This means that the mismatch between the technique that participants like best and the technique that they perceive as being most effective can largely be put down to participants who like EP better perceiving CR to be more effective.

The results for the agreement between the technique that is easiest to apply and the technique that is perceived to be most effective are exactly the same as for the preferred technique (see third row of Table 35). This means that, for EP, the participants equate the technique that they find easiest to apply with the one that they regard as being most effective. This does not hold for the other two techniques. Likewise, the mismatch between the technique that is easiest to apply and the technique perceived as being most effective can be largely put down to participants who applied EP best perceiving CR to be more effective (see Table 36).

\ As mentioned earlier, we found that participants have a correct perception of the program in which they detected most defects. Table 37 shows that participants do not associate simplest program with program in which most defects were detected (PP1-OP1). This is striking as it would be logical for it to be easier to find defects in the simplest program. As illustrated by the fact that the null hypotheses of marginal homogeneity (χ 2 (2,N=37)=3.220,p=0.200) and symmetry (χ 2 (3,N=37)=4.000, p=0.261) cannot be rejected, we were not able to find bias in any of the cases where there is disagreement. A possible explanation for this result is that participants are not properly interpreting what simple means.

7.3.2 RQ3.2: Comparing Opinions

Finally, we study the possible relation between the opinions themselves. Looking at Table 38, we find that participants equate the technique they applied best with their favourite technique and with the technique they found easiest to apply (overall and per technique (κ >0.4, fair to good), and 0 does not belong to 95% CIs, meaning that agreement by chance can be ruled out). It makes sense that the technique that participants found easiest to apply should be the one that they think they applied best and like best. Typically, people like easy things (or maybe we think things are easy because we like them). In this respect, we can conclude that participants’ opinions about the techniques all have the same directional effect.

7.3.3 Findings

Our findings suggest:

– Participants’ perceptions of technique effectiveness are related to how well they think they applied the techniques. They tend to think it is they, rather than the techniques, that are the obstacle to achieving more effectiveness (a possible evaluation apprehension threat has materialized).

– We have not been able to find a relationship between the technique they like best and find easiest to apply, and perceived effectiveness. Note however, that the technique participants think they have applied best is not necessarily the one that they have really best applied.

– Participants do not associate the simplest program with the program in which they detected most defects. This could be due to participants not properly interpreting the concept ”simple”.

– Opinions are consistent with each other.

Again, these results are confined to the validity limits imposed by the study

:::info Authors:

  1. Sira Vegas
  2. Patricia Riofr´ıo
  3. Esperanza Marcos
  4. Natalia Juristo

:::

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 license.

:::

\

Market Opportunity
WHY Logo
WHY Price(WHY)
$0,00000001518
$0,00000001518$0,00000001518
-%0,71
USD
WHY (WHY) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.