Authored by Dilpesh Bhesania (IBM) and Richard Jordan (Nationwide Building Society)
“Quality is not an act; it is a habit” – Aristotle. This quote is relevant today, whether that is in relation to a Product, Process, or a team. Quality is at the centre of everything an organisation strives to achieve. Quality can be subjective but having a mechanism to demonstrate the quality position can be of great significance. In the case of application/product delivery there are various reasons as to why teams may choose to measure the quality position, some of which are outlined in this article.
In the traditional world of Waterfall delivery, the perceived quality of a product was validated by a separate Testing team who would design and run a series of tests to determine if the solution met the requirements that were defined by the business or organisation. This approach raised a question around who ultimately is accountable for quality, however the link back here are the types of metrics that would have been used to assess the quality of a product, process or organisation.
The most apparent measure of quality in traditional delivery was attributed to the number of defects that were raised, whether in totality or the rate at which they were detected. This mustn’t be confused with Defect Density which is a completely separate metric. The total number of defects raised is a simple numerical value where the higher the number the poorer the perceived quality.
Other areas of traditionally measuring quality included:
- Test Status – Passed/Failed/Blocked/No Run
- Test Schedule – Progress against Plan
- Planned vs Actual
These metrics were presented in various flavours split between applications, platforms, teams etc. coupled with calling out top blockers, RAG statuses, and path to green and so on. Areas more mature in their reporting would extend this to include metrics such as the total effort and/or cost associated with the testing activities, feature stability etc.
While these metrics served a purpose for demonstrating testing progress, it did not really demonstrate the associated quality of the product or the processes being adopted. More importantly it failed to show a joined up, consistent view of how quality was being considered across the delivery lifecycle.
A slightly more controversial viewpoint is that, in a traditional delivery construct, teams would provide metrics because that is what they were told by the Management to produce. Project Managers would want to see numbers which generally had a focus on the volume of tests with a view of “more tests mean better progress/quality!”. This put the wrong focus onto metrics as it became increasingly difficult to demonstrate the value of Testing but more importantly, to demonstrate Quality.
With an ever-growing shift to adopting agile methodologies, there comes the common misconception that this instantly attributes to a more collaborative ways of working but this isn’t always the case, especially when an Organisation’s culture and mindset is very much traditional.
While it’s true that quality isn’t isolated to a particular area, this article attempts to highlight some of the measures which can be employed to demonstrate the quality position of not only the product but embed this across the lifecycle. These metrics encourage engineering teams to adopts a “Quality first” approach and actively seek out areas of improvement or where efficiencies can be made rather than default to the Test function on utilising these metrics to reflect progress.
It will also call out some efficiency measures which can complement the overall reporting structure; however the key is to allow people to shift into a mindset that metrics are there to enable valuable insight into teams, processes and products. Blindly producing ‘throw-away’ metrics because “that’s the way it’s always been done” is a hindrance to an organisations Quality transformation journey.
Why teams should measure – Purpose
There are many reasons why teams would use certain metrics to report progress, efficiency and quality positions. For example:
- Evaluate Product Health – Having a visual and or/numerical indication of how “healthy” your product is can give you a quick indication of where focus needs to be applied to either course correct (feedback loops) or maintain current trajectory.
- Determine Progress – These allow you to understand by when a particular goal or story will be met. It also highlights any impediments a team may be facing which is hindering progress and ultimately the quality of a product.
- Requirements Coverage and Traceability – Quickly identify where gaps exist when it comes to coverage and identify areas of risk
- Uncover Inefficiencies – This can be from multiple viewpoints, be it feature delivery, code rework, value realisation, time to market, the list goes on. Having a set of metrics which can demonstrate where inefficiencies are within the product lifecycle helps to target areas of remediation.
- Articulate Product Complexity – You could ask “Why do I care about Product Complexity?”. Traditionally many organisations probably wouldn’t have even thought about considering this as a quality indicator, let alone measuring it. But a poorly managed product, whether through coding standards, bloated features, poor/inaccurate design etc. will naturally lead to increased rework and costs but more importantly it will result in eroded value. Having a way to measure this, such as McCabe’s Cyclomatic Complexity, helps to understand just how complex a product may be. Just because a product is complex doesn’t mean it needs to be complicated. A metric such as this can help provide clarity around adequate unit test coverage, for example.
- Adherence to Acceptance Criteria / Definition of Done
- Articulation of Risk / Problem Areas – Metrics can provide meaningful insight into what risks may be apparent, both in the product as well as the process. We seldom see metrics which help to articulate risk position, and this can further play into other areas being called across other sections.
- Create Feedback Loops – The frequency of issuing metrics can create additional feedback loop instances, or simply create one that never existed.
- Investment Payoffs – Quantify how certain investments or changes in strategy have helped increase the quality of a product or the way in which a team operates.
- Automation Effectiveness – This can help determine if your Automation strategy for example is helping you to deliver your product(s) without compromising quality.
- Accountability – Ultimately everyone should take a level of accountability when it comes to Quality; after all its everyone’s problem. You can’t hide behind metrics!
- Speed to Market / Product Release – The end user wants functionality, fast. This demand is constantly growing in today’s constantly evolving technology landscape and having a view of how quickly these features are released provides insight on if your consumer base is satisfied.
- Trend Analysis / Continuous Improvement – Quickly determine if improvements are paying off or if quality is on a decline. Having this view allows for earlier course correction and remediation.
The DevOps Research and Assessment (DORA) Programme
The DORA was a seven-year research programme which, from 2014, looked at various data points from over 32,000 professionals globally across the IT industry. It was acquired by Google in 2018. For the first 5 years, the DORA group published a set of annual reports to provide a benchmark for DevOps practices while also providing direction and guidance for teams on how to continuously improve their outputs.
In 2018 the DORA published a book, “Accelerate” in which the team identified a core set of metrics which claimed to demonstrate the software development and delivery capabilities of teams. The premise is that these 4 metrics help teams to make informed decisions driven by data to continuously improve the rate of delivery, improve practices and ways of working, and ensure the product remains reliable. From the study it was established that top performing teams will look to consistently improve across these 4 metrics1.
- Change Lead Time – This looks at how long it takes a team to having code running successfully in Production from it being committed in development environments. It can show how mature the delivery process is or equally and highlight areas of inefficiencies. Longer lead times can often be attributed to reasons such as not having CI/CD pipelines, shared environments, isolated development and testing teams and cumbersome Route to Live processes. Teams that aim for “elite” status are able to have code running successfully in Production in a day, while others that aren’t as mature may only have monthly, quarterly or even half yearly deployments. This increases the risk of trying to deliver too much and possibly creating poor user experiences or system outages.
- Deployment Frequency – This looks specifically at how often changes are pushed into Production. Interestingly this can be applied at any level i.e. a view can be taken holistically across the organisation, but equally where there are some teams that are more mature in their DevOps capabilities then these metrics can be applied locally too. The underlying insight which is being sought still holds true in either case. Teams aim to deliver smaller, more frequent changes into Production ultimately speeding up benefits realisation for the end user. The added benefit here is that as these changes tend to be much smaller, the risk of production failures or regression is significantly reduced. Teams that aim for “elite” status have almost on-demand deployment capabilities while others that aren’t as mature in their DevOps capability may only have monthly, quarterly or even half yearly releases. The deployment frequency itself can be easily calculated and this should be used to generate insight into not only how teams can get quicker but also if there are other underlying impediments which may be hindering faster and more frequent releases into Production.
- Change Failure Rate – This looks at how many or how often a change pushed into Production results in a failure. This can be total outage, service performance or availability. Ultimately, it’s any time a change that enters Production requires a fix. Typically, by measuring the Change Failure Rate a team can assess the maturity of its deployment process, whether this happens to be manual or part of an integrated pipeline. It can also, and probably more importantly, highlight quality issues which may have leaked through from earlier in the Product lifecycle. The focus should always be on delivering quality features to Production, and not on the amount teams are able to deliverQuality vs. Quantity. Target failure rates can be set by the organisation in line with factors such as risk appetite but ideally teams should be aiming for near-zero failure rates. More mature or “elite” teams will have near-zero failure as a result of their organisational or team culture but also due to the other metrics called out throughout this article.
- Mean Time to Recovery (MTTR) – This focuses on being able to measure the reliability of your systems and applications. It looks at how long it takes for an organisation to recover from an outage or incident. It’s inevitable that failures will occur but being able to recover from these failures is key. Teams aiming for “elite” status will be able to recover in minutes or hours whereas others that may not have mature incident management processes may take much longer, in some cases weeks or even months. This information can also provide other meaningful insight, such as being able to determine whether an organisation has sufficient alerting and monitoring capabilities or of the size and number of Production deployments needs to be revisited. As mentioned previously, smaller more iterative releases reduce the risk of production failures and this in turn results in a continued positive user experience.
The DORA metrics have been referenced here to demonstrate that there is no single way to measure quality, service stability or organisational maturity however in conjunction with other metrics they can be a good starting point for teams and organisations.
What teams should measure – Quality and Efficiency Indicators
The table below extends further on DORA to covers just some of the metrics that can be used to demonstrate the quality and efficiency position of a team, product or organisation. Links are made to one or more of the DORA metrics. Not all metrics can, or should, link back to DORA and this is to illustrate the point that 4 metrics alone will not be sufficient to provide holistic insight. To reemphasise, these are not prescriptive but they are intended to help shift the mindset from “We must provide metrics because that is what is being asked” to “what are these metrics actually telling me about my organisations position on quality”.
Metric/Indicator
|
Type
|
Description / Primary Objective
|
Calculation
|
Additional Information
|
DORA Link
|
---|---|---|---|---|---|
Automated Vs Manual Testing
|
Efficiency | Primarily this metric helps to understand the overall split between Automated and Manual testing, along with the journey being made to move towards a higher level of automation. Higher levels of automation generally result in more efficiency, not only because of the faster associated run times but it frees up time for people to work on other areas/tasks. |
Automation Coverage = (Total No. of Automated Tests / Total No. of Tests) * 100
|
This metric can also supplement the position of retaining a number of manual tests by helping articulate why certain elements should not be automated e.g., higher levels of complexity, limited rate of execution, low rate of change, low risk etc. |
Change Lead Time Deployment Frequency |
Test Suite Run / Execution Time | Efficiency | This can help understand how long it takes for a set of tests to be executed, which can either be part of a DevOps/CICD pipeline or standalone. Coupled with the measure for Automation it showcases just how quickly value can be demonstrated. | Run/Execution Time = End Time – Start Time | While the time factor alone may not seem to provide value, it actually starts to prompt questions such as:
|
Change Lead Time Deployment Frequency |
Automation Test Stability
|
Efficiency | This can be incredibly useful in determining how stable the automation collateral itself is. If tests are continually failing or are in constant need or refactoring, then this is a good indication that the tests themselves are not stable enough when testing the product. There should be an upward trend in stability over time. |
Automation Test Stability = (Total No. of Failures / Total No. of Executions) * 100
|
Ideally this should be calculated on a per test basis so as not to skew the overall view of stability, as only a subset of tests could be problematic for example. | Change Lead Time |
In Sprint Automation vs Out of Sprint (Value Realisation)
|
Efficiency | This can be a good indicator at showing how much value is realised from within the sprint itself by means of automation. Any new features delivered within the sprint should be coupled with automation test assets, making integration into CI/CD pipelines much more effective |
In Sprint Automation = (Total No. of Tests Created In Sprint / Total No. of Tests Created Out of Sprint) * 100
|
– | Change Lead Time |
Defect Density
|
Quality | This can help provide a good indication of problematic areas within the overall product. Defect Density can be calculated for a product in its entirety or it can be broken down into smaller areas such as modules, technology types etc. It’s another good feedback mechanism to sprint teams in terms of where additional focus may be needed. Typically, the Defect Density is measured per KLOC (1000 Lines of Code). |
Defect Density = (Total No. of Defects / Total Lines of Code) * 1000
|
This can also be used to demonstrate why a team is only able to achieve a particular velocity, as a higher number of defects will impede progress. | Change Failure Rate |
Build Stability
|
Quality | Demonstrating build stability can start to generate an insight into the quality of the code that is being output. It also starts to highlight where impediments may exist within the sprint and where focus needs to be applied in order to remove that impediment completely or to alleviate some of the pressures. A high number of build failures may indicate an issue with the DevOps toolchain or CI/CD pipeline for example. Collection of this metric may only be possible if teams are utilising a DevOps toolchain. |
Build Stability = (Total No. of Build Failures / Total No. of Builds) * 100
|
It can also link into the stability of the underlying code, and this can be indicated in a number of ways e.g. Unit Test Coverage, Code Churn etc. These will be expanded on further. | Deployment Frequency |
Unit Test Coverage
|
Quality / Completeness |
This is a great way of understanding how much of the source code has been tested. It can be done through a number of methods such as:
Line Coverage – This is how many lines of code have been tested |
Statement Coverage = (Total No. of Executed Statements / Total No. of Statements) * 100 Branch Coverage = (Total No. of Executed Branches / Total No. of Branches) * 100 Line Coverage = (Total No. of Executed Lines / Total No. of Lines) * 100
|
As well as providing an insight to early product quality issues, having a robust set of Unit Tests can demonstrate the breadth and depth of coverage for the product under test. Ultimately the higher the coverage at these early stages, the better the chance of finding material issues. |
Change Lead Time Deployment Frequency |
Code Churn | Quality | Fundamentally, this is “Rework” and measuring the level of code churn can help in identifying problematic areas and time/effort expended on making changes. It’s accepted that rework is inevitable but code that undergoes change on a constant basis could be an early indicator of poor quality. It is not unusual for code churn levels to vary throughout the product lifecycle, for example being extremely high during initial development through to low and fairly stable following product release. | Total Code Churn = Lines of Code Added + Lines of Code Deleted + Lines of Code Modified | It can be a good indicator of what defects to possibly expect once code has been released to Production. Demonstrating a higher level of code churn as a sprint/release/production deadline approaches could be a warning signal of fault-prone code. | Change Lead Time
Deployment Frequency |
Performance Monitoring | Quality / Coverage | Performance is critical in today’s world and having a way to monitor this is key especially where user demand is increasing. Having a view of where there may be poor SLA adherence, memory leaks, non-compliant Garbage Collection etc. will help target areas of focus. This can be monitored release upon release to establish trends and fix issues which may arise. | While there are many low-level calculations that can be made from a Performance perspective, below are some of the areas which could be considered when assessing performance quality at the early stages of the Product lifecycle
· Compliance with architectural practices · Connection pooling Vs Static connections · Memory management · Technology aligned practice adherence e.g. Object Oriented · Constant benchmarking against agreed SLAs |
Trends in system or application characteristics can be monitored to understand if degradation is being introduced. This can include:
· Load Analysis · Stress Profiles · Endurance / Soak Analysis |
MTTR |
Reliability | Quality / Coverage | The focus on reliability is now bigger than ever before, and it continues to grow. Organisations as well as the end user want reliable systems and applications, so having a way to measure this is incredibly useful to determine where flaws may exist. Reliability has an incredibly strong link into resilience and measuring the reliability aspects of a product or application can provide strong insight into potential resiliency issues.
Rolling metrics can be set to any interval or time period. |
Mean Time Between Failures = Total Operational Time / Total No. of Failures (Rolling Metric)
Average Failure Rate = Total Production Failures / Total No. of Components Deployed * 100 Mean Time to Repair = Total Time Spent on Repairs / Total No. of Repairs (Rolling Metric) Mean Time to Recover = Total Downtime / Total No. of Incidents (Rolling Metric) |
Reliability metrics can also help you understand the effectiveness of your Incident Management and support processes. If the average time taken to resolve an issue is growing, then it could indicate an issue with the processes in place.
Software or code complexity links in closely with reliability, as generally higher complexity code will lead to reliability issues. |
Change Failure Rate
MTTR |
Security | Quality / Coverage | Knowing the quality position of your code from a Security standpoint is critical, and code should be able to stand up to attacks. In a 2021 report from IBM, the cost of a data breach totalled $4.24 million, so it is clear why security is instrumental when it comes to producing quality products2. It’s not enough to assume that security is at the forefront of everyone’s mind, therefore having the ability to track threat resistance, code scan output trends, or the applications that meet compliance requirements provides you with a profile of what level of risk you’re carrying. Rolling metrics can be set to any interval or time period. |
Application Infiltration Rate = (Total No. of Infiltrated Applications / Total No. of Applications) * 100
Security Defect Density = (Total No. of Security Defects / Total Lines of Code) * 1000 Vulnerability Creation Rate = Total. No of Vulnerabilities Created (Rolling Metric) Vulnerability Remediation Rate = Total No. of Vulnerabilities Remedied (Rolling Metric) Vulnerability Growth Rate = Vulnerability Creation Rate – Vulnerability Remediation Rate |
This may help to provide insight into the types of tools or practices which could be considered in the product lifecycle. For example, Static Application Security Test (SAST) or Dynamic Security Test (DAST) tools. It could identify weaknesses in vulnerability libraries and coding standards or highlight capability concerns. | MTTR |
Maintainability | Quality | This is all about how easily code can be updated with regards to improvements, removing redundancies, and addressing defects. There is a direct link between the success of a Product and the ability to maintain its code. It’s a consistent activity to address new demand, making code more efficient, addressing security vulnerabilities, or simply making performance optimisations. It is also useful to understand how well code can be maintained or, more importantly, how quickly it can be made operational again following an outage. A metric such as the ‘Mean Time to Repair’ can be used to determine just how efficiently this can be done and highlight possible areas of improvement. | Mean Time To Repair (MTTR) = Total Time Spent on Repairs / Total No. of Repairs
There is a compound metric that can be used to help determine the Maintainability Index (MI). The higher the Index the better the maintainability of the code. 171 – 5.2 * ln(Halstead Volume) – 0.23 * (Cyclomatic Complexity) – 16.2 * ln(Number of Statements) 3 |
There are a number of additional measures that can be employed to help quantify the Maintainability of code. These can include:
· Level of Coupling · Level of Cohesion · Duplicated Code · Naming Convention Consistency |
MTTR |
Where should teams make metrics available?
Metrics can be made available on any platform that is being used by the teams. There is no hard and fast rule; they just need to be consistent. For example, don’t have some metrics being reported in Confluence and some in Jira as that doesn’t make sense and it doesn’t give stakeholders all the information in a single place. After all, it’s all about being efficient.
Platforms such as Confluence, Jira, and Test Management tools can definitely be used. If more is being done in terms of CI/CD, then it may be useful to configure reports from the dev ops chain toolset to demonstrate certain capabilities.
Now more than ever access to information needs to be streamlined and having something that can provide the insight to enable decisions to be made is incredibly important. Stakeholders don’t want to be searching in multiple places to get the information they need.
Who should be using these metrics?
Ideally these reports should be automated and take input from the various toolsets that are being used to support all the activities performed within the delivery functions. This can cover design, build, test, infrastructure, pipeline activities, build releases etc. which enable the above metrics to be reported against. These can be owned by, for example:
- Sprint Teams
- Scrum Masters
- Quality Engineers/SMEs/Leads
- Technology Leads
- Support Teams
- Incident Management
- Product Owners
Reporting is not enough. There needs to be tangible and meaningful insight gained from the use of these metrics to allow teams to constantly improve and strive to deliver in a fast, nimble and more iterative manner. It has to be a consolidated effort across all areas of the delivery function to recognise the areas which are doing well but equally take accountability of those that are not.
When should these metrics be collected?
There is no hard and fast rule on the frequency of collecting metrics. They can be collected as frequently as deemed necessary, as it is all in aid of fast feedback and enabling better quality. Therefore, having an automated reporting capability would enable reports to be distributed daily and to be used in various ceremonies, be it stand ups or retrospectives. It can also assist in backlog prioritisation and identifying key areas of focus to address possible quality concerns.
- Daily Report generated by automated hourly metric collections (of course this depends on team construct, velocity etc.)
- Weekly / End of Sprint Summary – Enabling trend analysis and continuous improvement
- Monthly – Feeding up into key business/stakeholder meetings etc.
The metrics in this article are by no means prescriptive, and should not be taken as such. It is clear there is no single metric which will provide holistic or complete insight into the quality of software/product/code. They are there to serve as examples on how an organisation can start to include certain elements within their reporting and, more importantly, derive meaningful insight into the overall Quality journey. As mentioned, the frequency of capturing these metrics is down to the individual teams producing and/or consuming them but agreeing on consistency of generation will be key. It may be worthwhile pinning metric generation to key delivery milestones, or simply in line with sprints. As long as there is value to be gained from the insight provided by the metrics then the frequency is fluid.
If your organisation is looking to start capturing metrics, then trying to measure everything may be a little overwhelming and it may become difficult to see the wood for the trees. Starting with a smaller subset of metrics in line with what your stakeholders are most interested in seeing will allow you to drive more insight and ultimately influence greater change. Ask yourself, “What would I choose if I were just starting to report metrics?”.
The art of storytelling is one that is backed by robust metrics. As the phrase goes “Knowledge is Power” and knowing what to do with what these metrics are telling you could be the difference between success and failure. After all, “Quality is not an act, it is a habit”.
NB: All views expressed throughout this article are strictly personal and are those of the authors. They are not a reflection of the organisations they represent.
1 https://www.devops-research.com/research.html
2 https://www.ibm.com/uk-en/security/data-breach
3 https://www.ibm.com/docs/en/addi/6.0.0?topic=reports-maintainability-index-report