Classification and Structuring Methods in Data Analysis

Data analysis manuals (Aldenderfer and Blashfield, 1984; Everitt, 1993; Hair et al., 1992; Kim and Mueller, 1978; Lebart et al., 1984) provide a detailed presentation of the mathematical logic on which classification and structuring methods are based. We have chosen here to define these methods and their objectives, and consider the preliminary questions that would confront a researcher wishing to use them.

1. Definitions and Objectives

Classifying, condensing, categorizing, regrouping, organizing, structuring, summarizing, synthesizing and simplifying are just some of the procedures that can be done with a data set using classification and structuring methods. Taking this list as a starting point, we can formulate three propositions. First, the different methods of classification and structuring are aimed at condensing a relatively large set of data to make it more intelligible. Second, classifying data is a way of structuring it (that is, if not actually highlighting an inherent structure within the data, at least presenting it in a new form). Finally, structuring data (that is, highlighting key or general factors) is a way of classifying it – essentially by associating objects (observations, individuals, cases, variables, characteristics, criteria) with these key or general factors. Associating objects with particular dimensions or factors boils down to classifying into categories represented by these dimensions or factors.

The direct consequence of the above propositions is that, conceptually, the difference between methods of classification and methods of structuring is relatively slim. Although traditionally observations (individuals, cases, firms) are classified, and variables (criteria, characteristics) are structured, there is no reason, either conceptually or technically, why variables cannot be classified or observations structured.

While there are many different ways to classify and structure data, these methods are generally grouped into two types: cluster analysis and factor analysis. The main aim of cluster analysis is to group objects into homogeneous classes, with those objects in the same class being very similar and those in different classes being very dissimilar. For this reason, cluster analysis falls into the domain of ‘taxonomy’ – the science of classification. However, while it is possible to classify in a subjective and intuitive way, cluster analyses are automatic methods of classification using statistics. ‘Typology’, ‘cluster analysis’, ‘automatic classification’ and ‘numeric taxonomy’ are actually synonymous terms. Part of the reason for the diversity of terms is that cluster analyses have been used in many different disciplines such as biology, psychology, economics and management – where they are used, for example, to segment a firm’s markets, sectors or strategies. In management, cluster analyses are often used in exploratory research or as an intermediary step during confirmatory research.

Strategic management researchers have often needed to gather organizations together into large groupings to make them easier to understand. Even early works on strategic groups (Hatten and Schendel, 1977), organizational clusters (Miles and Snow, 1978; Mintzberg, 1989), taxonomies (Galbraith and Schendel, 1983) or archetypes (Miller and Friesen, 1978) were already following this line of thinking. Barney and Hoskisson (1990) followed by Ketchen and Shook (1996) have provided an in-depth discussion and critique of the use of these analyses.

The main objective of factor analysis is to simplify data by highlighting a small number of general or key factors. Factor analysis combines different statistical techniques to enable the internal structure of a large number of variables and/or observations to be examined, with the aim of replacing them with a small number of characteristic factors or dimensions.

Factor analysis can be used in the context of confirmatory or exploratory research (Stewart, 1981). Researchers who are examining the statistical validity of observable measurements of theoretical concepts (Hoskisson et al., 1993;

Venkatraman, 1989; Venkatraman and Grant, 1986) use factor analysis for confirmation. This procedure is also followed by authors who have prior knowledge of the structure of the interrelationships among their data, knowledge they wish to test. In an exploratory context, researchers do not specify the structure of the relationship between their data sets beforehand. This structure emerges entirely from the statistical analysis, with the authors commenting upon and justifying the results they have obtained. This approach was adopted by Garrette and Dussauge (1995) when they studied the strategic configuration of interorganizational alliances.

It is possible to combine cluster and factor analysis in one study. For example, Dess and Davis (1984) had recourse to both these methods for identifying generic strategies. Lewis and Thomas (1990) did the same to identify strategic groups in the British grocery sector. More recently, Dess et al. (1997) applied the same methodology to analyze the performance of different entrepreneurial strategies.

2. Preliminary Questions

Researchers wishing to use classification and structuring methods need to consider three issues: the content of the data to be analyzed, the need to prepare the data before analyzing it and the need to define the notion of proximity between sets of data.

2.1. Data content

Researchers cannot simply take the available data just as they find it and immediately apply classification and structuring methods. They have to think about data content and particularly about its significance and relevance. In assessing the relevance of our data, we can focus on various issues, such as identifying the objects to be analyzed, fixing spatial, temporal or other boundaries, or counting observations and variables.

The researcher must determine from the outset whether he wishes to study observations (firms, individuals, products, decisions, etc.) or their characteristics (variables). Indeed, the significance of a given data set can vary greatly depending on which objects (observations or variables) researchers prioritize in their analysis. A second point to clarify relates to the spatial, temporal or other boundaries of the data. Defining these boundaries is a good way of judging the relevance of a data set. It is therefore extremely useful to question whether the boundaries are natural or logical in nature, whether the objects of a data set are truly located within the chosen boundaries, and whether all the significant objects within the chosen boundaries are represented in the data set. These last two questions link the issue of data-set boundaries with that of counting objects (observations or variables). Studies focusing on strategic groups can provide a good illustration of these questions. In such work, the objects to be analyzed are observations rather than variables. The time frames covered by the data can range from one year to several. Most frequently, the empirical context of the studies consists of sectors, and the definition criteria are those of official statistical organizations (for example, Standard Industrial Classification – SIC).

Another criterion used in defining the empirical context is that of geographic or national borders. A number of research projects have focused on American, British or Japanese industries. Clearly, the researcher must consider whether it is relevant to choose national borders to determine an empirical context. When the sectors studied are global or multinational, such a choice is hardly appropriate. It is also valid to ask whether a sector or industry defined by an official nomenclature (for example SIC) is the relevant framework within which to study competitive strategies. One can check the relevance of such frameworks by questioning either experts or those actually involved in the system. Finally, one needs to ask whether data covering a very short time span is relevant, and to consider, more generally, the significance of cross-sectional data. When studying the dynamics of strategic groups or the relationship between strategic groups and performance, for example, it is important to study a longer time span.

As for the number and nature of observations and variables, these depend greatly on the way the data is collected. Many studies now make use of commercially established databases (Pims, Compustat, Value Line, Kompass, etc.), which give researchers access to a great amount of information. Determining the number of variables to include leads us to the problem of choosing which ones are relevant. Two constraints need to be respected: sufficiency and nonredundancy. The sufficiency constraint demands that no relevant variable should be omitted, and the non-redundancy constraint insists that no relevant variable should appear more than once, either directly or indirectly. These two constraints represent extreme requirements – in reality it is difficult to entirely fulfill them both, but clearly the closer one gets to fulfilling them, the better the results will be. To resolve these selection difficulties, the researcher can turn to theory, existing literature, or to expertise. Generally, it is preferable to have too many variables rather than too few, particularly in an exploratory context (Ketchen and Shook, 1996).

The problem of the number of observations to include poses the same constraints: sufficiency and non-redundancy. For example, in the study of strategic groups, the sufficiency constraint demands that all firms operating within the empirical context are included in the study. The non-redundancy constraint insists that no firm should appear among the observations more than once. The difficulty is greater here than when determining which variables to include. In fact, the increase in diversification policies, mergers, acquisitions and alliances makes it very difficult to detect relevant strategic entities (or strategic actors). One solution lies in basing the study on legal entities. As legal entities are subject to certain obligations in their economic and social activities, this choice at least has the merit of enabling access to a minimum of economic and social information relating to the study at hand. Here, even more than in the case of variables, sector-based expertise must be used. Identifying relevant observations (such as strategic actors in a study of strategic groups) is an essentially qualitative process.

As a rule, researchers need to consider whether they have enough observations at their disposal. For factor analyses, some specialists recommend more than 30 observations per variable, and even as many as 50 or 100. Others say there must be 30 or 50 more observations than there are variables. There are also those who recommend four or five times more observations than variables. Hair et al. (1992) have argued that these criteria are very strict – they point out that quite often, researchers have to handle data in which the number of observations is hardly double the number of variables. Generally, when the number of observations or variables seems insufficient, the researcher must be doubly careful in interpreting the results.

2.2. Preparing data

Preparing data ahead of applying classification and structuring methods is essentially a question of tackling the problems of missing values and outliers, and of standardizing variables.

Missing values The problem of missing values can be dealt with in a number of ways, depending both on the analysis envisaged and the number of observations or variables involved.

Cluster analysis programs automatically exclude observations in which any values are missing. The researcher can either accept this imposed situation, or attempt to estimate the missing values (for example, by replacing the missing value with an average or very common value). If the researcher replaces the missing values with a fixed value – using, for instance, the mean or the mode of the variable in question – there is a risk of creating artificial classes or dimensions. This is because having an identical value recurring often in the data set will increase the proximity of the objects affected.

The question of missing data is, therefore, all the more important if a large number of values are missing, or if these missing values relate to observations or variables that are essential to the quality of the analysis.

Outliers The question of how to treat outliers is also an important issue, as most of the proximity measurements from which classification and structuring algorithms are developed are very sensitive to the existence of such points. An outlier is an anomalous object, in that it is very different from the other objects in the database. The presence of outliers can greatly distort analysis results, transforming the scatter of points into a compact mass that is difficult to examine. For this reason it is recommended that the researcher eliminates them from the database during cluster analysis and reintegrates them after obtaining classes from less atypical data. Outliers can then supplement results obtained using less atypical data, and can enrich the interpretation of these results. For example, an outlier may have the same profile as the members of a class that has been derived through analysis of more typical data. In such a case, the difference is, at most, one of degree – and the outlier can be assigned to the class whose profile it matches. Equally, an outlier may have a profile markedly different from any of the classes that have resulted from the analysis of more typical data. Here, the difference is one of nature, and the researcher must explain the particular positioning of the outlier in relation to the other objects. Researchers can use their intuition, seek expert opinions on the subject, or refer to theoretical propositions which justify the existence or presence of an outlier.

Standardizing variables After attending to the questions of missing values and outliers, the researcher may need to carry out a third manipulation to prepare the data: he or she must now standardize, or normalize, his or her variables. This operation allows the same weight to be attributed to all of the variables that have been included in the analysis. It is a simple statistical operation that in most cases consists of centering and reducing variables around a zero mean, with a standard deviation equal to one. This operation is strongly recommended by certain authors – such as Ketchen and Shook (1996) – when database variables have been measured using different scales (for example, turnover, surface area of different factories in square meters, number of engineers, etc.). Although standardization is not essential if database variables have been measured using comparable scales, this has not prevented some researchers from conducting statistical analyses on untreated variables and then on standardized variables so as to compare the results. Here, the solution is to select the analysis with the greater validity.

Some specialists remain skeptical about how useful the last two preparatory steps really are (for example, Aldenderfer and Blashfield, 1984). Nevertheless, it is worthwhile for researchers to compare the results of analyses obtained with and without standardizing variables and integrating extreme data (outliers). If the results are found to be stable, the validity of the classes or dimensions identified is strengthened.

2.3. Data proximity

The notion of proximity is central to classification and structuring algorithms, all of which are aimed at grouping more similar objects together and separating those that are farthest removed from each other. Two types of measurements are generally employed to specify which measure of proximity to use: distance measurements and similarity measurements. In general, distance measurements are used for classification analyses and similarity measurements for factor analyses.

Researchers’ choices are greatly limited by the kind of analyses they intend carrying out and, above all, by the nature of their data (category or metric). With category data, the appropriate measurement to use is the distance of the chi-square. With metric data, the researcher can use the correlation coefficient for factor analyses and Euclidean distance for cluster analyses. Mahalanobis distance is recommended in place of Euclidean distance in the specific case of strong co-linearity among variables. It must be noted that factor analyses function exclusively with similarity measurements, whereas cluster analyses can be used with both distance measurements and, although it is very rare, similarity measurements (for example, the correlation coefficient).

In classifying observations, distance measurements will associate observations that are close across all of the variables while similarity measurements will associate observations that have the same profile – that is, that take their extreme values from the same variables. It can be said that similarity measurements refer to profile while distance measurements refer to position.

A researcher may then quite possibly obtain different results depending on the proximity measurement (similarity or distance) used. If the results of classification or structuring are stable whichever proximity measurements are used, a cluster or factor structure probably exists. If the results do not correspond, however, it could be either because the researcher measured different things, or because there is no real cluster or factor structure present.

Source: Thietart Raymond-Alain et al. (2001), Doing Management Research: A Comprehensive Guide, SAGE Publications Ltd; 1 edition.

1. Definitions and Objectives

2. Preliminary Questions

2.1. Data content

2.2. Preparing data

2.3. Data proximity

Leave a Reply Cancel reply