jump to navigation

The Power of Three (Defect Categories) December 5, 2012

Posted by Tim Rodgers in Project management, Quality.
Tags: , , , ,
add a comment

Over the last few weeks of software projects at HP we would have cross-functional discussions about open defects to essentially decide whether or not to fix them. We considered the probability or frequency of the defect’s occurrence, and the severity or impact of the defect to the end-user, then assigned the defect to a category that was supposed to ensure that the remaining time on the project was spent addressing the most important outstanding issues.

I don’t remember exactly how many different categories we had in those days (at least five), but for some reason we spent hours struggling with the “correct” classification for each defect. I do recall a lot of hair-splitting over the distinction between “critical,” “very high,” and “high” which seemed very important at the time. Regardless, everyone wanted their favorites in a high-priority category to make it more likely that they would get fixed.

I think we could have saved a lot of time if we had used just three categories: (1) must fix, (2) might fix, and (3) won’t fix. That covers it. Nothing else is necessary. The first group are those defects you must fix before release. The second group are the ones that you’ll address if you have time after you run out of the must-fix defects. The third group are the ones you aren’t going to fix regardless of how much time you have. To be fair, the might-fix defects should be ranked in some order of priority, but at that point you’ve already addressed the must-fix defects and it won’t matter much which defects you choose.

I’m not a psychologist, but I think there’s a big difference between trying to classify things into three categories vs. trying to classify things into more than three categories. I think the human brain gets a little overwhelmed by too many choices. Two might seem better than three because it forces a binary selection, but I think it’s a good idea to compromise and allow for a “maybe” category rather than endure endless indecision.

Calling Attention to Internal Quality Costs December 3, 2012

Posted by Tim Rodgers in Process engineering, Quality.
Tags: , , ,
add a comment

Electronics manufacturing services and contract manufacturers typically operate with very small profit margins, and Foxconn was no exception when I managed a factory quality team there in the late 00s. I assumed this would be a receptive environment for an initiative targeting cost-of-quality (COQ): measuring the costs of prevention processes, appraisal processes (test and inspection), and internal failures (rework and repair); and then working to reduce those costs. External failures that lead to product returns and other warranty costs get a lot of attention because they’re visible to customers and end-users and can ultimately impact loyalty and future business.

Initially it didn’t seem that anyone noticed or cared much about the cost of the internal processes required to minimize those external failures. Some people understood that testing and inspection are non-value-added activities and targets for elimination in any lean manufacturing environment, nevertheless they were assumed to be necessary regardless of how much attention was given to defect prevention. Of course any variability in a manufacturing process leads to higher cost (see Taguchi loss functions), but that was even harder to command attention in this environment. It was challenging enough to get the support to collect the basic cost data and assemble a Pareto diagram.

What I finally realized was that cost savings weren’t appreciated unless they were presented in discrete, quantized bundles. Dollar savings (or RMB) claimed on a spreadsheet doesn’t have the visible or practical impact as headcount reduced or processes eliminated. For example, I started working with the industrial engineers to figure out how what improvement in our internal end-of-line yield would enable us to eliminate exactly one repair station for each production line. When audit inspections and reliability testing of production units consistently exceeded required quality levels I argued for reduced sampling rates that allowed us to re-allocate the headcount to value-added activities.

The performance measures for the factory included units per headcount and units per production hour. I finally got traction with my COQ program when I targeted changes to directly impact those measures, reducing headcount and cycle time without compromising quality.

Managing Quality Without Data: Don’t Try This At Home November 16, 2012

Posted by Tim Rodgers in Management & leadership, Quality.
Tags: , , ,
1 comment so far

I know from experience how hard it is to measure, analyze, improve, and control quality in a high-stakes or high-volume production environment. There’s never enough data, and there’s constant pressure to draw conclusions and make decisions. What can we do to fix this defect? Did the process change work? When can we get the line running again? Sample sizes are usually too small to determine whether differences are statistically significant. One data point on a run chart will be taken as evidence that things are trending in the right direction. One defective product chosen at random “proves” that the whole batch is bad.

It’s worse when there’s no data at all, by which I mean the lack of a reliable source of objective, unfiltered, and unbiased data. You can’t run an effective quality system without links to the factory’s internal information systems: ERP, MRP, and other shop floor control and measurement systems. Data that’s automatically collected and reported in the course of normal production is less likely to be manipulated to make the situation look better (or worse) than it is. Data that’s manually collected is acceptable but less trustworthy, unless repeatability and reproducibility has already been established for operators.

There are too many people and constituencies with a vested interest when it comes to quality, people who want to believe that quality is always good, or at least good enough. It’s not fun for anyone when the production line is down or field failures are up. It’s easy to discount or ignore data as outliers when they don’t fit the desired story. There are also real situations where data may be suppressed or even fabricated. I think you’re better off with no data than with data that’s been compromised, but of course the better solution is to improve data integrity before making any changes to improve quality.

Common Fallacies That Cause People to Doubt Statistics October 23, 2012

Posted by Tim Rodgers in baseball, Process engineering, Quality.
Tags: , , , , ,
add a comment

Lately I’ve been reviewing some old text books and work files as part of my preparation for the ASQ Six Sigma Black Belt certification exam in March. It’s interesting, and I think often amusing, to contrast the principles of inferential statistics and probability theory with they ways they’re used in the real-world. I think people tend to underestimate how easy it is to misuse statistical methods, or at least apply them incorrectly, and this can lead them to undervalue all statistical analysis, regardless of whether or not the methods were applied correctly.

I see this in baseball and political commentary all the time, particularly in the way people selectively or incorrectly use numbers to defend their point of view, while at the same time mocking those people who use numbers (correctly or not) to defend a different point of view.

Here are a few of the more-common mistakes that I’ve seen in the workplace:

1. Conclusions based on small sample sizes or selective sampling. Yes, we often have to make do with less data than we’d like, but that makes it especially important to put confidence intervals around our conclusions and stay open-minded about the possibility of a completely different version of reality. Also, a sample is supposed to represent the larger population, and we have to beware of sampling bias that excludes relevant members of the population and skews any findings based on that sample. Otherwise the findings are meaningful only for a subset of the population.

2. Unknown or uncontrolled measurement variability. We often assume that our measurement processes are completely trustworthy without considering the possible effects of variability due to equipment or people. If the variance of the measurement process exceeds the variance of the underlying processes that we’re trying to measure, we can’t possibly know what’s really going on.

3. Confusing independent vs. dependent events. There is no such thing as “the law of averages.” If you flip a coin 10 times and it comes up heads every time, the probability of a heads coming up on the 11th flip is still 50%. The results of those previous coin flips do not exert any influence whatsoever on future outcomes, assuming each coin flip is considered a single event. That being said, the event “eleven consecutive coin flips of heads” is an extremely unlikely event. If you take a large enough sample size, the sample statistics will approximate the population statistics (50% heads and 50% tails for an honest coin), sometimes simplistically referred to as “regression to the mean.”

4. Seeing a trend where none exists. This is usually the result of prediction bias, where we start with a conclusion and look for data to support it, and sometimes leading to selection bias, where we exclude data that doesn’t fit the expected behavior. Often we’re so eager for signs of improvement that we accept as proof a single data point that’s in the right direction. This is why it’s important to apply hypothesis tests to determine whether the before and after samples represent statistically significant differences. It’s also why we should never fiddle with a process that varies randomly but operates within control limits.

5. Correlation does not imply causation. You may be able to draw a regression line through a scatter plot, but that doesn’t necessarily mean there’s a cause-and-effect relationship between the two variables. This is where we have to use engineering judgment or even common sense. Earlier this year the Atlanta Braves baseball team lost 16 consecutive games that were played on a Monday. No one has been able to explain how winning or losing a baseball game could possibly be caused by the day of the week. A related logical fallacy is post hoc, ergo propter hoc (after it, therefore because of it). Chronological sequence does not imply causation, either.

Ramp Readiness Indicators for Product Development October 3, 2012

Posted by Tim Rodgers in Management & leadership, Product design, Quality.
Tags: , , ,
1 comment so far

Sometimes you have little or no control over the launch date for a new product because of customer commitments or competitive market pressure. You have to fill your product fulfillment pipeline to the channel, and it can be costly to delay the start of the production ramp until you’ve finished tweaking the design and everything is running perfectly at the factory. Regardless of whether the start-of-ramp is flexible or not, it’s important to assess ramp readiness and address any issues while there’s still time, before the stress of meeting delivery quotas and schedules.

A lot of people seem to think of ramp readiness as checking things off a to-do list, verifying that manufacturing processes are in-place and ready to build products. Are the shop MRP systems aligned with the bill-of-materials? Is the production line balanced to meet capacity requirements? Are all the work stations and process steps equipped with the necessary tools? Are work instructions published and approved, and operators trained? Are part suppliers delivering according to the appropriate JIT schedules?

Those are good and important things to know, but if you want to avoid quality problems and delivery interruptions after ramp, then you need to know much more.

1. Is the product design stable? Early prototype builds should provide feedback to improve the design, and in the ideal world, the design is complete and stable before the final prototype build. In the real world, that doesn’t mean the design can’t be changed, but the bar needs to be higher as the ramp date approaches, requiring more review and higher-level approval, or else there won’t be enough time to make the necessary changes in production. Indicators here include the trend in design change requests, the completeness of the BOM (including final, approved drawings showing all critical-to-function and critical-to-assembly dimensions and other requirements), and the percent of defects found during prototype builds that are attributed to design issues. The design is an input to the production process, and a design that’s still changing after ramp will surely cause trouble.

2. Are the factory processes capable and stable? Some people like to look at production yields and test results during prototype builds as a sign of readiness, and the defects found during these builds will certainly guide improvements in the design or the manufacturing processes. However, prototype builds volumes are always small and provide limited insight about what will happen after the ramp. Process capability means the suppliers and the factory can produce products that are in-spec, but you also need to identify and manage the sources of variability that affect process stability. Are the critical dimensions of all parts in-spec and meeting Cpk requirements? Are critical production processes operating within SPC control limits? Has a GRR analysis been completed for all tools, jigs, and fixtures? Are process documents under change control? Have all pre-ramp waivers been addressed, resolved, and closed?

Ramp readiness requires more than checking items off a list. It requires attention to design, supply chain, and production process readiness to prevent costly quality hiccups when you can least afford it.

Why Keep Testing If You Never Find Defects? September 26, 2012

Posted by Tim Rodgers in Process engineering, Quality.
Tags: , , ,
add a comment

When I started working at the factory in China I inherited a quality system that included long checklists of visual inspections, almost all of which had been specified by our customer. I’ve never been a big fan of inspections, especially those that rely on subjective human judgment, but I’ll admit that in some cases they’re a quick (but costly) way of detecting defects.

Anyway, one set of inspections seemed especially puzzling. At the end of the production line the customer required an audit of a sample of the finished goods, including the accessories, localized printed materials, and the final packaging before everything was loaded on pallets for the shipping container. Boxed units were taken off the line and opened, and everything inside that had a barcode was removed from the box and hand-scanned to verify that the box contained everything it was supposed to contain.

What made this a head-scratcher was that the end of the production line was ten feet away, where the finished goods, the accessories, and the localized printed materials were each individually barcode-scanned and then put in the box. If the operator tried to put the wrong thing in the box, something that wasn’t on the approved bill-of-materials, it would be detected by the scanner. I wasn’t really surprised to discover that the end-of-line audit never found any missing or wrong parts, and we had a discussion with our customer about the value of the audit.

This brings me to my question: if you never find any defects, is the test or inspection routine still effective or useful? Can’t we get rid of it? Is this test any good?

Let me pause here for a moment and emphasize (again) that you don’t achieve quality by testing or inspections, especially at the end of the production line. Nevertheless, I think we can agree that an audit program, strategically placed in the process flow, can be a useful tool for verifying that customer requirements are being met.

When you’re considering eliminating or changing a test, I think you have to start by asking: What customer requirement does this test correspond to? What defect is this test supposed to find? If the test never fails and doesn’t find those defects, that either means there are no defects (or, at least no defects of that category), or the test isn’t capable of finding them, possibly because of bad test design or bad test execution; which is why the capability of the test should be investigated and verified. By the way, that’s how we discovered the problem with the barcode-scan audit described above: bad design and bad execution.

Assuming it’s a good test, correctly implemented and capable of finding defects that are linked to customer requirements, there are several options if you’re still not finding defects:

1. You can raise the quality standards and tighten the spec limits that define failures. That may give you more failures and an opportunity to eliminate a root cause or reduce variability somewhere in the process. Reducing variability is always a good thing, but you have to consider the cost to do so, and whether this is really a high-value opportunity.

2. You can reduce the audit frequency. Maybe you really have a design and a process that doesn’t generate defects, but maybe it would still be a good idea to check on it from time to time, wouldn’t it?

3. You can eliminate the test altogether. This is a risky move because you’re voluntarily giving up an opportunity to verify that some customer requirement are being met. Before eliminating the test I’d make sure there’s some other way to verify the customer requirement.

Reducing the cost-of-poor-quality by shifting from appraisal activities to prevention activities is certainly a worthy goal, but we shouldn’t be too quick to stop testing just because we aren’t finding any defects.

The True Root Cause of Field Failures September 16, 2012

Posted by Tim Rodgers in Management & leadership, Organizational dynamics, Product design, Quality.
Tags: , , , , , ,
add a comment

Field failures are always unfortunate, sometimes costly, and they will always happen. Preventing field failures should certainly be the goal of all product development, manufacturing, and quality assurance organizations. However, given that failures will occur, I think the more significant issue is whether or not the organization can learn and make the necessary changes to eliminate the root cause.

Unfortunately many teams perform only a superficial investigation of root cause and address only the immediate, proximate cause of the failure. This might be a batch of defective parts, or an incomplete work instruction, or a badly-specified assembly drawing. The corrective action in these cases may be targeted to ensure that this particular field failure won’t happen again, but the question that should be asked is: “How do we know that a similar problem won’t happen?”

You have to dig deeper to understand if the failure was a low-probability, outlier event, or if there’s a systemic issue in the design, or production, or testing, or management, or other processes that will surely cause another, potentially more-serious problem. This is harder because it requires the organization to look in the mirror. Everyone knows about the Five Whys as a tool to determine root cause, but I don’t think the majority of organizations have the political will and courage to ask why those parts or work instructions or drawings were bad in the first place.

Was a there a missing process, or was the process not followed? Failure to understand responsibilities or dependencies or expectations? Conflicting business priorities? Lack of consistent management support? Shortcuts in the name of expediency or productivity or cost reduction? These are the kinds of root causes that will continue to result in field failures despite the most-dedicated “whack-a-mole” efforts.

Maybe that’s OK. Maybe the business’s leadership is willing to take a chance on quality instead of doing what’s necessary to create a culture that values quality. To be fair, every business has to decide what level of quality is good enough, and what they’re willing to do to achieve those goals. If that’s the case, then they shouldn’t be surprised when field failures occur. “The fault, dear Brutus, is not in our stars, but in ourselves.”

Design-Limited Quality September 3, 2012

Posted by Tim Rodgers in Product design, Quality, Supply chain.
Tags: , , , ,
add a comment

When I joined Hewlett-Packard in 1988 I was assigned to a team that was working on a design for manufacturability manual for printed circuit board designers. Our primary objective was to provide performance and cost information that could be used to guide decisions about different design options. My favorite project during that time was a predictive model to estimate the manufacturing yield of a PCB design based on a composite “complexity” metric. Because we were an internal supplier, I was able to look at the actual lot yields for hundreds of active part numbers with known design parameters, so it seemed like a fairly straightforward exercise to experiment with different regression models to find an optimum fit between complexity and yield.

This turned out to be a lot more complicated than I expected, mainly because manufacturing yields are not normally distributed. The simple arithmetic mean of a bunch of individual lot yields is pretty much meaningless, and the time interval between lots meant that the process itself wasn’t the same each time. (For those who care about the technical details, when you plot individual lot yields for a large population of lots it seems to fit a Weibull distribution from reliability engineering, which isn’t really surprising if you think about it.)

What I was really looking for was the theoretical maximum yield enabled by a given set of design parameters, but what I had were the actual lot yields, each of which were influenced by the inherent variability of parts and manufacturing processes, including workmanship. For the purposes of the DFM manual it wasn’t necessary to predict the actual yield; it was enough to provide a model to compare the theoretical yield for two or more design options. I was in for a lot of data crunching, but in the end we got a useful model. More than ten years later I still saw copies of that DFM manual on the desks of PCB designers.

The point is that product quality depends both on the design and all the steps required to create a product from that design. Eliminating special causes and reducing the variability of parts and processes can help approach the theoretical maximum yield, but the design establishes an upper limit of quality that cannot be exceeded in the real world without improving the design itself. This is another reason why it’s so important to stabilize and ultimately freeze the design so the manufacturing processes can stabilize before the ramp to full production.

By the way, I don’t think any practical, cost-effective design can guarantee 100% yield at the factory, or have zero field failures. A single lot may have 100% yield, but that’s just a sample of a larger population of all possible combinations of part and process variation, assuming the same process each time. If you could do a Monte Carlo simulation that accounts for all sources of manufacturing variability (assuming only common causes), and run the simulation to a very large number of trials, you could come up with a pretty good estimate of the maximum yield. But, even if all the processes were running at six sigma, you would still have some small percentage of non-conformance.

Measuring Test Effectiveness: Three Views August 20, 2012

Posted by Tim Rodgers in Product design, Quality.
Tags: , , ,
add a comment

I’ve managed testing organizations supporting both hardware and software development teams, and I often had to explain to folks that it’s not possible to “test-in quality.” Quality is determined by the design and the realization of that design, whether through software coding or hardware manufacturing. Testing can measure how closely that realization matches the design intent and customer requirements, but testing can’t improve the quality of a poorly-conceived design (and neither can a flawless execution of that design).

So, how can you measure the effectiveness of a test program? Here are three ways that make sense to me:

1. Testing should verify all design requirements and all possible failure modes. This means there should be at least one test case associated with each functional, cosmetic, reliability, regulatory and all other requirements. Also, each failure mode predicted from an FMEA review should have a corresponding test to determine the design margin.

2. Testing should be designed to eliminate escapes, or at least make their occurrence a statistical improbability that is acceptable to the business. An escape is any defect found by the end-user. It may not be economically feasible to achieve zero defects, but any reported escape is an opportunity to improve test coverage. Is there an existing test case corresponding to this defect? Was the test performed and reported correctly? Does this test have to be run on a larger sample size to improve the confidence level?

3. Testing should report defects that get fixed. Testing is buying information, and if that information has no value to the organization, then it’s not a good use of resources. When I managed software quality I looked at the “signal-to-noise ratio,” or the percentage of all defects reported that were fixed by the development team. Defects that are not fixed are either potential escapes that should be discussed as a business risk, or they’re “noise” that waste money and distract the development team.

It may not be possible to test-in quality, but poorly-designed testing will surely frustrate your efforts to achieve the required level of quality.

The Zero Defects Quest July 1, 2012

Posted by Tim Rodgers in Management & leadership, Product design, Quality.
Tags: , , , , ,
add a comment

A few years ago I had a conversation with some of the engineers in a quality team I managed. The engineers were struggling to understand the business’s stated goal of “zero defects,” puzzled over what it would take to get there and worried about the consequences of falling short. They wanted the goal changed to something less demanding that could be realistically achieved, or some acknowledgement by senior management that “zero defects” was nothing more than a slogan to inspire employees, customers, and shareholders.

I was sympathetic. I understood that it was probably impossible to achieve zero defects given the complexity of the design and the influence of many inputs and processes that were difficult to control. I believe in the concept of good-enough quality, which weighs the level of quality and reliability of the organization’s output against the cost required to achieve it. Beyond the level of quality required or expected by the customer, I believe there’s an asymptotic relationship between defects and cost, and working to further reduce the number of defects will require ever-increasing expenses that are unlikely to be recovered through higher prices or larger market share.

I was, however, concerned about the slippery slope and what would happen if we became accustomed to a lower level of quality as an acceptable deliverable. I wanted to create a culture where people were dissatisfied with any defect, regardless of whether it was economically feasible to address all defects. I wanted to change the goal from “zero defects” to “all defects analyzed, root cause determined, prevention strategies proposed, and resolution (or not ) openly agreed to.” It wasn’t an inspiring slogan that fit on a poster, but when people realized that every defect would require discussion as an opportunity for improved quality, quality started to improve without a significant increase in cost. I don’t think we would have gotten there with an unrealistic goal that was cynically ignored.

Follow

Get every new post delivered to your Inbox.

Join 201 other followers

%d bloggers like this: