Benchmark Software Testing
…
7 pages
Sign up for access to the world's latest research
Abstract
I've been paged at 2 a.m. because "the app feels slow." No stack trace. No crash. Just vibes. That's usually when we discover nobody ever defined what fast actually means. That's the hole benchmark software testing is supposed to fill. Early in my career, we shipped a feature-heavy release to prod. All unit tests were green. Load tests? "We'll do it later." Traffic doubled after a marketing push, CPU spiked, latency crept from 200ms to 1.8s, and users bailed. Postmortem verdict: no baseline. No benchmark. Just assumptions. We fixed the bug, but the real failure was the process. If you don't measure performance against a known standard, you're not engineering. You're guessing. Tools like Keploy exist because guessing doesn't scale.
Related papers
Lecture Notes on Empirical Software Engineering, 2003
International Journal of Management Reviews, 2007
Experimental evaluation is key to systems research. Because mod- ern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repe- tition costs time to complete experiments. Currently, many evalua- tions give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable. In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of ex- perimentation time. Time efficiency comes from two key obser- vations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision. We present our methodology as a cookbook that guides re- searchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.
2008
We had the opportunity to conduct an empirical study in the context of the testing environment for a large commercial product. The particular goal of the organization for which this study was done, was to gain a strong understanding of how particular aspects of their testing practice impact on the quality of the released products. In this paper we present some of the results of that research as it relates to the verification of intuitive claims of those in this industrial environment, and documented claims from other research about the relationships between several parameters. The parameters of interest to the organization were: breadth of system and regression testing of software components defined by code coverage, number of defects discovered by an in-house test team prior to the release of those software components, and number of defects discovered by the customer in the field subsequent to the release of those software components. 1
2008
This paper discusses software test metrics and their ability to show objective evidence necessary to make process improvements in a development organization. When used properly, test metrics assist in the improvement of the software development process by providing pragmatic, objective evidence of process change initiatives. This paper also describes several test metrics that can be implemented, a method for creating a practical approach to tracking & interpreting the metrics, and illustrates one organization's use of test metrics to prove the effectiveness of process changes. Also, this paper provides the Balanced Productivity Metrics (BPM) strategy and approach in order to design and produce useful project metrics from basic test planning and defect data. Software test metrics is a useful for test managers, which aids in precise estimation of project effort, addresses the interests of metric group, software managers of the software organization who are interested in estimating software test effort and improve both development and testing processes.
Rethinking Productivity in Software Engineering, 2019
Benchmarking is all about comparing. A well-known phrase is "Comparing apples to apples and oranges to oranges." One of the key challenges in the software industry is to measure productivity of completed sprints, releases, projects, or portfolios in such a way that this information can be used for processes such as estimation, project control, and benchmarking. But how can we compare apples to apples in an industry that is immature when it comes to productivity measurement? The economic concept of productivity is universally defined as output/input. In the context of productivity measurement in software development, input is usually measured in effort hours spent. Although it's important to define the right scope of activities when benchmarking, it's just as important to measure the output of a sprint, release, or project in a meaningful way. To be able to benchmark productivity in an "apples to apples" way, it's crucial that the output is measured in a standardized way. An important aspect of standardization is that the measurement is repeatable, so different measurers attribute the same number to the same object. In practice, many measurement methods are being used that are not standardized. Because the output is not standardized, the same number may relate to different aspects, or the same object gets different ratings. This means that the productivity information is not comparable and therefore not useful in benchmarking. Examples of these popular, but unstandardized measurement methods are lines of code (LOC) and all variants, use case points, complexity points, IBRA points, and so on. Also, the story point, which is popular in most agile development teams, is not standardized and therefore can't be used in benchmarking across teams or organizations. At this moment, only the standards for functional size measurement (the main ones being Nesma, COSMIC, and IFPUG) comply with demands for standardized measurement procedures and intermeasurer repeatability to produce measurement results that can be compared across domains to benchmark productivity. Functional Size Measurement Functional size is a measure of the amount of functionality provided by the software, derived by assigning numerical values to the user practices and procedures that the software must perform to fulfill the users' needs, independent of any technical or quality considerations. The functional size is therefore a measure of what the software must do, not how it should work. This general process is described in the ISO/IEC 14143 standard.
System Performance Evaluation: …, 1999
We consider what aspects of software performance can be validated during the early stages of development, before the system is fully implemented, and how this can be approached. There are mature and successful methods available for immediate use, but there are also difficult aspects that need further research. Ease of use and integration of performance engineering with software development are important examples. This paper describes issues in early performance validation, methods, successes and difficulties, and conclusions.
International Journal of Latest Research in Science and Technology , 2021
This paper aims to review the potency owned by benchmarking enriching the management to apply in the organization. The data and ideas on this paper come essentially from a review of articles relating to benchmarking in the organization. The grade of success to benchmark in assorted companies varies among them because of the different management cultures and desires. Benchmarking process doesn't only collect data on practices of a corporation attains against other companies but also the way to confess a replacement idea and a replacement method to enhance the method and to be better to satisfy the customer's satisfaction. This paper provides a description for the practitioners to improve to be better organized.
International Journal of High Performance Computing Applications, 2004
We define an approach to benchmarking, “purpose-based benchmarks”, which explicitly and comprehensively measures the ability of a computing system to reach a goal of human interest. This contrasts with the traditional approach of defining a benchmark as a task to be timed, or as the rate at which some activity is performed. Purpose-based benchmarks are more difficult to create than traditional benchmarks, but have a profound advantage that makes them well worth the trouble: they provide a well-defined quantitative measure of the productivity of a computer system.
Shubham Jha