THE COMPETITIVE SEMICONDUCTOR MANUFACTURING HUMAN
RESOURCES PROJECT:
Second Interim Report
CSM32
Clair Brown, Editor
14. Statistical Tools for Industry Data
Linda Sattler
Appendix
Multidimensional Scaling Distances  A Modification of the Simple
Matching Coefficient Technique
Let two objects, X and Y, have a number, p, of True/False
variables associated with them denoted by (x_{1}§ x_{p})
and (y_{1}§ y_{p}) where x_{i}=
1 or 0 depending on whether the ith variable is True or
False. Set
a=
where a + d are the number of variables
both objects have identically scored. Sokal and Michener’s
measure of similarity between X and Y is given by
This method is modified slightly to account for the missing data
by only considering the variables that both fabs X and Y have data
for (missing variables are not used). To take in the High/Medium/Low
scores, the similarity measure used is simply the fraction of identical
variables fabs X and Y have in common out of the variables considered
(those that are not missing). The distance measure between fabs
X and Y is then
These distance measures are constructed without quantifying the
data. As well, variables that have responses from only a few companies
can still be used for those companies’ distance metrics. In
other words, all available data are actually used.
Principal Components Analysis  Adjustments and Data Coding
In order to use principal components analysis, however, the data
need to be altered. The True/False scores were coded 1/0 and the
High/Medium/Low scores were coded 1.0/0.5/0.0. Adjustments for missing
data are needed. The first attempt, substituting the mean of a variable
for the missing variable, produced bad results. Although substituting
the mean for the missing values does not alter the principal components
themselves in any way (they would have the same coefficients if
the fabs with missing data were not present), fabs with a lot of
missing data were always shown "close" together. Instead
of finding organizational similarities, "missing value"
similarities were found. The next attempt was more fruitful. Instead
of replacing a missing value with the mean of the variables, it
was replaced with the fab’s value for the variable with the
largest correlation. For example, if fab X is missing a value, ,
in variable i and variable i has the largest correlation with variable
j, let . If is also missing, the next largest correlated variable
is used and continue in this manner. This procedure is used by Breiman
et al. [0] with Classification and Regression trees and they show
it to work well even with a large amount of missing data. After
these data adjustments, principal components analysis may be used.
Robustness of Graphs  Simulation Technique Used
1. i = 0;
2. Choose a random set of factors. Call this set ;
3. Find the distance measures for the set , call this set of measures
;
4. Find the multidimensional scaling graph, , for the set ;
5. Let i = i + 1;
6. If i < 100, repeat steps 2 through 5.
To find the "average" multidimensional scaling graph,
the mean of the distance measures, , over i is taken. Call this
. The multidimensional scaling graph for the set is taken as the
"average" graph.
To get an idea of how the fabs on this graph may vary is a bit difficult.
Two distance measures, and may be very much alike, but their MDS
graphs may be mirror images of each other (flipped or rotated as
well, see Figure 1413). Thus, the same fabs appear in a far different
position on and . To get bounds a heuristic approach is taken. The
Euclidean distances of from are taken along the X axis, , and along
the Y axis, (where i is the simulation number and j is the fab number).
Some of these distances will be extremely large (as in the case
where is a mirror image of ) and others should be fairly small (when
has fabs in the same general position as ). In this analysis the
first 25% quantile of distances comes from the graphs that have
the fabs in the same general position as (this should be checked
individually depending on the data set). Rectangles are then formed
around the using the 25% quantile distances to form the box (see
Figure 1414).
Figure 1413. Similar Distances, but
very different positions on the MDS graph
Figure 1414. Creating the Error Rectangles
around the MDS Fabs
Percent Alike Graphs  Creating Similarity Scores
The similarity measures are similar to those used in Multidimensional
Scaling, but modified so that there is a similarity measure of fab
i in group A with ALL of the fabs in group B, instead of similarity
with just a single fab. This is done for each fab, i, in group A
as follows:
1. Let = the number of factors fabs i and j have common scores for
in organizational area X where ;
2. Let = the number of factors fabs i and j both have that are not
missing in organizational area X where ;
3. in organizational area X, for all fabs ;
4. in organizational area X, for all fabs ;
5.
The factors are summed first before finding a percentage because
the number of missing factors in any particular area may be large
enough to create extreme percentages (such as or if is small. To
get a similarity measure of fab i in group A with all of the fabs
(excepting i) in group A, the same technique for each fab and is
used. These measures are called .
These measures need to be graphed so one can easily see if any differences
exist. A bar graph was created with the following significant measures.
1. min( ) and min( ) over all ;
2. median( ) and median( ) over all ;
3. max( ) and max( ) over all .
Common Value Analysis  Technique Description
Suppose one wishes to find the individual factors of Group A that
differ significantly from Group B. The crudity of the scoring is
important in this analysis, because if a variable has too many values
it is difficult to find a representative value for a group. Common
Value Analysis works as follows:
1. Find the most common value for factor i in group A, . If more
than half the group has missing values, the "common value"
is missing as well.
2. Find the most common value for factor i in group B, . If more
than half the group has missing values, the "common value"
is missing as well.
3. If and neither "common value" is missing, label this
factor a possible "interesting" variable.
The benefit of this analysis is to reduce the number of variables
to a reasonable number that can be checked one by one.
End of Chapter 14
Go
to Table of Contents for this Chapter Go to Table of Contents for the
CSMHR Interim Report
