About WTS Prof. Clair Brown Faculty, Students and Affiliates Research Areas Online Research Reports Working Papers


Second Interim Report
Clair Brown, Editor

14. Statistical Tools for Industry Data
Linda Sattler


Multidimensional Scaling Distances - A Modification of the Simple Matching Coefficient Technique
Let two objects, X and Y, have a number, p, of True/False variables associated with them denoted by (x1§ xp) and (y1§ yp) where xi= 1 or 0 depending on whether the ith variable is True or False. Set


where a + d are the number of variables both objects have identically scored. Sokal and Michener’s measure of similarity between X and Y is given by

This method is modified slightly to account for the missing data by only considering the variables that both fabs X and Y have data for (missing variables are not used). To take in the High/Medium/Low scores, the similarity measure used is simply the fraction of identical variables fabs X and Y have in common out of the variables considered (those that are not missing). The distance measure between fabs X and Y is then

These distance measures are constructed without quantifying the data. As well, variables that have responses from only a few companies can still be used for those companies’ distance metrics. In other words, all available data are actually used.
Principal Components Analysis - Adjustments and Data Coding
In order to use principal components analysis, however, the data need to be altered. The True/False scores were coded 1/0 and the High/Medium/Low scores were coded 1.0/0.5/0.0. Adjustments for missing data are needed. The first attempt, substituting the mean of a variable for the missing variable, produced bad results. Although substituting the mean for the missing values does not alter the principal components themselves in any way (they would have the same coefficients if the fabs with missing data were not present), fabs with a lot of missing data were always shown "close" together. Instead of finding organizational similarities, "missing value" similarities were found. The next attempt was more fruitful. Instead of replacing a missing value with the mean of the variables, it was replaced with the fab’s value for the variable with the largest correlation. For example, if fab X is missing a value, , in variable i and variable i has the largest correlation with variable j, let . If is also missing, the next largest correlated variable is used and continue in this manner. This procedure is used by Breiman et al. [0] with Classification and Regression trees and they show it to work well even with a large amount of missing data. After these data adjustments, principal components analysis may be used.

Robustness of Graphs - Simulation Technique Used
1. i = 0;
2. Choose a random set of factors. Call this set ;
3. Find the distance measures for the set , call this set of measures ;
4. Find the multidimensional scaling graph, , for the set ;
5. Let i = i + 1;
6. If i < 100, repeat steps 2 through 5.
To find the "average" multidimensional scaling graph, the mean of the distance measures, , over i is taken. Call this . The multidimensional scaling graph for the set is taken as the "average" graph.
To get an idea of how the fabs on this graph may vary is a bit difficult. Two distance measures, and may be very much alike, but their MDS graphs may be mirror images of each other (flipped or rotated as well, see Figure 14-13). Thus, the same fabs appear in a far different position on and . To get bounds a heuristic approach is taken. The Euclidean distances of from are taken along the X axis, , and along the Y axis, (where i is the simulation number and j is the fab number). Some of these distances will be extremely large (as in the case where is a mirror image of ) and others should be fairly small (when has fabs in the same general position as ). In this analysis the first 25% quantile of distances comes from the graphs that have the fabs in the same general position as (this should be checked individually depending on the data set). Rectangles are then formed around the using the 25% quantile distances to form the box (see Figure 14-14).

Figure 14-13. Similar Distances, but very different positions on the MDS graph

Figure 14-14. Creating the Error Rectangles around the MDS Fabs

Percent Alike Graphs - Creating Similarity Scores
The similarity measures are similar to those used in Multidimensional Scaling, but modified so that there is a similarity measure of fab i in group A with ALL of the fabs in group B, instead of similarity with just a single fab. This is done for each fab, i, in group A as follows:
1. Let = the number of factors fabs i and j have common scores for in organizational area X where ;
2. Let = the number of factors fabs i and j both have that are not missing in organizational area X where ;
3. in organizational area X, for all fabs ;
4. in organizational area X, for all fabs ;
The factors are summed first before finding a percentage because the number of missing factors in any particular area may be large enough to create extreme percentages (such as or if is small. To get a similarity measure of fab i in group A with all of the fabs (excepting i) in group A, the same technique for each fab and is used. These measures are called .
These measures need to be graphed so one can easily see if any differences exist. A bar graph was created with the following significant measures.
1. min( ) and min( ) over all ;
2. median( ) and median( ) over all ;
3. max( ) and max( ) over all .
Common Value Analysis - Technique Description
Suppose one wishes to find the individual factors of Group A that differ significantly from Group B. The crudity of the scoring is important in this analysis, because if a variable has too many values it is difficult to find a representative value for a group. Common Value Analysis works as follows:
1. Find the most common value for factor i in group A, . If more than half the group has missing values, the "common value" is missing as well.
2. Find the most common value for factor i in group B, . If more than half the group has missing values, the "common value" is missing as well.
3. If and neither "common value" is missing, label this factor a possible "interesting" variable.
The benefit of this analysis is to reduce the number of variables to a reasonable number that can be checked one by one.

End of Chapter 14

Go to Table of Contents for this Chapter
Go to Table of Contents for the CSM-HR Interim Report


© 2005 Institute for Research on Labor and Employment. 
2521 Channing Way # 5555 
Berkeley, CA 94720-5555