Journal of Diplomatic Language

JOURNAL OF DIPLOMATIC   LANGUAGE
JDL II:1 (2005)

 HAMLET: A Multidimensional Scaling Approach to Text-Oriented Policy Analysis       

by

Dr. Alan Brier
Associate Member, ESCRC National Centre for Research Methods
University of Southampton, UK

Bruno Hopp
Central Archive for Empirical Social Research
University of Cologne, Germany

This paper reviews the facilities available in the latest version of HAMLET II forMicrosoft Windows©, which are illustrated by an analysis of debates on criminality in the Lithuanian parliament, the Seimas, in May - July 1994. In particular, it demonstrates the application of Procrustean Individual Differences Scaling (PINDIS) to compare the speeches of the different political groups, including the final unanimous resolution, and identify the main sources of difference between these contributions.


Polonius: "What do you read, my lord?"
Hamlet: "Words, words, words."
Hamlet (II,ii,194)

Introduction

The quantitative approach described here is in a long-established tradition of Content Analysis[1], by which words which occur together in relatively close proximity in the same context are interpreted as relating to a common theme or concept in the discourse studied. In contrast to the period of initial formation, it is now open to researchers to conduct this sort of analysis, and to use it to compare different text sources, with the computing resources available on their desktop, as a matter of routine.

The HAMLET software originated as a modest tool to count joint occurrences of a words in a vocabulary list of interest to the researcher, within a specified unit of context, following the idea of Osgood and his colleagues (1957)[2], who proposed using a set of fixed units of 120 words to analyse the utterances of American psychoanalytic subjects. The present version (of HAMLET II) now offers the option of different context specifications for counting co-occurrences of words, according to the level of analysis required by the individual researcher.

After standardising the raw joint frequencies to take account of the varying overall frequencies of the words involved, the result can be regarded as a matrix of similarities between the set of words used, and submitted to some kind of structural or categorical analysis to make the relationships more visible. Earlier work in this tradition tended to apply factor analysis, but for textual analysis the model of non-metric multidimensional analysis is much more appropriate as it does not require an assumption that the variables are continuously measured. The basic MDS (Multi-Dimensional Scaling) model seeks to reproduce only the order of the distances between the points, representing the variables of the matrix, by reducing the dimensionality of the space within which they are represented. Not only does this seem more appropriate to the representation of relationships based on co-occurrences of words in natural languages. It has the additional practical benefit that these can usually be adequately visualized in no more than three dimensions, where factor analysis would require appreciably more. Finally, as Borg (1993, 1997) has convincingly demonstrated, interpretation of results of this kind of analysis can be made with regard to facet theory, where partitions and orientations of the space are relevant and its primary dimensions are simply a representational convenience in viewing the relationships between the points.

In terms of qualitative data analysis, the approach depends on the formulation of a list of theoretically-relevant words, the co-occurrences of which can be identified, with respect to a meaningful unit of context, in the texts to be submitted to analysis. Recent experience has shown that trying to dispense with this, and identifying instead all words subject to co-occurrence, for example, within sentences or paragraphs of a text, saves little time and is liable to produce more errors than a 'dictionary-based' approach.[3] The 'Search List' can be of varying complexity, and its constituent items can vary in their linguistic and grammatical nature according to the nature of the research itself. Its main items can become category names, each defined by an associated list of words assigned to and defining the category. Once this 'search list' has been defined, it can be applied to a series of texts, and used to compare them systematically, using Procrustean Individual Differences Scaling (PINDIS). HAMLET II for Windows is unique in offering the means to easily carry out these stages within a unique graphical user interface. Provision is made for the lexicographic conventions of major European languages, including those using Cyrillic and Greek characters. It is possible to define additional character sets for special purposes. For example, one recent application has been to old German epic texts which have to be transliterated[hb3] for computer-assisted analysis.

An example of using HAMLET for Windows

The following example illustrates the processes involved in applying HAMLET II. It is based on analyses of a series of debates on crime in the Lithuanian Parliament, the Seimas, initially reported by Dr.Aleksandras Dobryninas of the University of Vilnius[4] . The first two debates took place on May 18, the third on May 26, and the fourth on July 19, 1994. All sessions were broadcasted by Lithuanian National Radio and there were commentaries in the main Lithuanian newspapers and TV news programs.

The original research was concerned with the use of "fight"- and "control"-rhetorics and the issue of crime in Lithuanian political discourse. The hypothesis was that "fight"-rhetoric expressed the discourse of the previous communist ideology, and "control"-rhetoric the newly-emerging Western-style liberal and democratic orientation. A predominance of first could suggest a return to the previous semi-military form of control of society. The second type of rhetoric could be interpreted as showing an intention to adopt a Western type of social regulation[hb4] .

The material for analysis consisted of five groups:

 · speeches of representatives of the Executive and Judicial branches, (15,596 words);
 · speeches of members of Seimas belonging to the ruling majority party, (7,491 words);
 · speeches of members of Seimas belonging to left-centrist opposition, (5,254 words);
 · speeches of members of Seimas belonging to rightist opposition, (10,986 words);
 · the concluding Seimas Resolution of July 19, 1994 (prepared by the ruling majority Labour Democratic Party and accepted without a vote), (466 words)[5] .

Preliminary contextual analysis of all five texts was made using the HAMLET auxilliary Keyword-in-Context routine KWIC, and five groups of words related to 'crime', 'control' and 'fight' and their lemmata were initially selected, as follows (the asterisk at the end of most of the Lithuanian words indicates that words beginning with the letters shown were to be treated as equivalent - taking account of plural and declined forms of the same term):

Applying HAMLET with this vocabulary list to the speeches of members of the executive and judiciary produced the following statistics of individual and joint frequencies of occurrences within sentences :


15596 words read from the text file.
455 of these were in the search list, and
1206 context-units were counted.

JOINT FREQUENCIES .............................

for SENTENCES punctuated [ . ? ! ]

To obtain a clearer picture of the overall structure, HAMLET offers two forms of graphic representation of the information in this matrix. Since the words listed clearly occur with quite different individual frequencies, it is appropriate first to convert the raw frequencies into a coefficient of co-occurrence which takes this into account. The measure chosen here is the simple Jaccard coefficient of similarity for dichotomous data[6] :

Individual word frequencies (fi) are counted together with joint frequencies (fij) for all possible pairs of words, and the corresponding standardised joint frequencies are calculated, by default -

where fij and fi, fj refer respectively to joint and individual frequencies of words i and j in a given vocabulary list, expressed in units of context in each case.

The coefficient excludes consideration of occasions when neither property, in this case one of a pair of words, is present, and represents the probability of both of a pair of attributes being present in any pair of objects, when only those objects exhibiting one or the other are considered. It has an expected value of

E(sij) = fi . fj / [t(fi + fj) - fi . fj] ,

since the expected value of fij is fi · fj / t ,

where t is the total number of context-units counted in the text, and fi , fj and fij as before are the individual and joint frequencies of the words i and j, expressed in context-units.

As an alternative and for purposes of comparison, it is possible to employ instead Sokal's matching coefficient, in which the number of joint non-occurrences is included in the numerator and denominator of the calculation. In the terms already outlined, this coefficient is

cij = (fij + t - (fi + fj - fij)) / t

since the term to be added to the numerator and denominator is t-(fi+fj-fij).

Both coefficients are also, of course, indifferent to the order in which the words in each pair occur, and depend on a sensible choice of context-unit being made in reading the text. In all of the following results, the context units were sentences of the original text.

STANDARDISED JOINT INDEX VALUES ........................

The most recent HAMLET II version[7] offers two alternative methods of hierarchical clustering offered here applying different criteria in assigning the individual vocabulary entries to clusters: the "connectedness", "single link" or "nearest neigbour" method looks for the greatest similarity between an unassigned item and those contained in existing clusters; the "diameter", "complete linkage" or "furthest neigbour" method defines the similarity between groups as the similarity between their least similar pair of individual items:

For the same data, the "connectedness" method produces the following, slightly different, dendrogram, which is offered here for purposes of comparison. Where well-defined "compact spherical" clusters exist, these methods tend to produce similar results. The "connectedness" method, on the other hand, is useful for detecting the absence of any distinct structure.[8] For the present data, it is clear that there is a definite structure to be observed and that it makes sense to continue with the analysis.

The primary interest in developing HAMLET was, however, to be able to apply Multidimensional Scaling methods to the matrices of joint-occurrences derived as has been described. The scaling procedure in HAMLET for Windows is a specially- adapted version of the Michigan-Nijmegen Smallest Space Analysis routine (MINISSA) (Lingoes and Roskam, 1973)[9] . This offers a convenient and robust means of reducing the information contained in the matrix to a smaller dimensionality, while preserving as far as possible the relative magnitudes of the similarity values between the main vocabulary items. It would, after all, be difficult to regard these values as representing any more than an ordinal level of measurement. An additional advantage, in comparison with calculating the principal components of the matrix or applying a method of factor analysis, is that it is usually sufficient to represent the final configuration in a space of at most three dimensions.

Applying MINISSA to the above data produces the following representation reduced into three dimensions of the relationships between the terms considered in the speeches of members of the executive and judiciary. HAMLET allows the plot to be rotated and annotated, and the configuration to be moved in relation to the axes, which are essentially arbitrary in these results.

The following figure shows the same data reduced to a 2-dimensional configuration, with the clusters of the diameter method outlined. In this case, it can be seen that the diameter-method cluster analysis and MDS are producing more-or-less equivalent results.

The attraction of Multidimensional Scaling applied in textual analysis of this sort is immediately apparent: the items which are of central concern as reflected in the joint-occurrences of the defined vocabulary items tend to appear, literally, in the centre of the resulting configuration. In this case, "crime" clearly appears as the "central concern", which should be no surprise, given the nature of the speeches considered.

The same procedure can be applied to each of the sets of texts of the Seimas debates. Here, for example, is the result of MINISSA scaling of a matrix derived from the speeches of those Dobryninas calls the "rightist opposition", that is, the old Communists. It can be seen that the representation differs from that of the speeches of members of the executive and judiciary although there are similar features.

This time, "crime" and "fight" occupy the same position in the plot, "fight" being overwritten, "corruption" is closely related, and the language of "control", as expected, appears isolated at some distance from the other terms shown as it played little part in the contributions of this part of the political spectrum in Estonia.

The resolution of the Seimas passed at the end of the debates is of only 466 words, and the vocabulary applied in its analysis in Professor Dobryninas' project consisted of only 4 main entries, which leads to numerical problems in MINISSA if scaling is attempted into more than two dimensions. Bearing in mind these limitations, however, the two-dimensional scaling shows again that "fight" and "crime" have the closest association, while "control" and "corruption" each occur separately in the resolution. The resolution, too, appears heavily marked by the old usage.

Comparing texts

So far, we have been concerned only with representations of individual texts of the different political groupings. HAMLET further allows comparison of these representations, to produce a representation of the relationships between the different contributions to the debates and to identify more precisely the sources of differences. This is done with a version of Lingoes' and Borg's Procrustean Individual Differences Scaling (PINDIS)[10] , which takes as its input the configurations produced by MINISSA from the matrices of joint frequencies for the separate texts to be compared.

In the PINDIS procedure, the various individual configurations - representing the individual texts selected - are first centred at the origin and normed to unit length. They are then subjected to an initial series of transformations - rescaling, translation, rotation and reflection - which preserve the orderings of the distances between the points corresponding to the words of the original configurations. These are iteratively applied until the individual configurations have reached the greatest possible conformity with each other. The resulting 'centroid' configuration represents a kind of median of the individual configurations considered, and is subsequently used as a reference point for the examination of the similarities and differences between them. Alternatively, it is possible for the user to provide a hypothetical configuration as the reference configuration.

As only relatively small matrices are available for the present example, analysis has been confined to two-dimensional results, as these seem to be reasonably stable and meet acceptable criteria of fit in the initial MINISSA scalings.

The results of a PINDIS analysis are displayed in HAMLET in a series of graphics, beginning with the configuration adopted as the reference point for further comparisons. The centroid, if applied, will be of varying use in summarising the data, depending on the extent of the variation between the separate text sources considered. It is shown here for the sake of completeness.

The next graphic is probably the most useful as a representation of the process of comparison: it shows the ordering of the original configurations (the 'subjects') which are being compared, in the space represented by the dimensions of the centroid, or the initial hypothetical configuration, whichever is being applied. The plotted coordinates are the optimal normalized dimension weights required to bring the individual subjects into conformity with the centroid/hypothesis (i.e. multiplied by the corresponding column sums of squares of the centroid/hypothesis).

If the resolution is excluded from the analysis, this plot suggests that the discourse of the left-centrist opposition is the furthest removed from the other three, which themselves form a fairly tight cluster, with the Executive and Judiciary and ruling majority members in an intermediate position between it at the rightist opposition.

If, however, we are prepared to treat the categories not recorded in the configuration for the Seimas resolution as implicit, and therefore not mentioned, rather than ignored as insignificant, it becomes possible to include the resolution in a PINDIS analysis.[11] The interesting alternative result then shows the resolution itself also occupying an intermediate policy, and the rightist opposition with greater distance apart from majority. The language of the Executive and Judiciary can also been visualized as closer to that of the left-centrist opposition.

The detailed PINDIS output listing reports the results of submitting the individual configurations to a series of decreasingly stringent distortions,[12] which now permit relative distances to change in systematic ways. The results can can serve to highlight the precise nature of these policy changes.

The first of these, dimensional weighting, has already been encountered in plotting the subject space. This shows how far each subject can be interpreted using the same dimensions, by attributing to them different salience in the separate text sources. The next model combines differing dimensional weights with differing dimensional orientations, analogous to the introduction of oblique or correlated axes in factor analysis. This is the equivalent in PINDIS to the INDSCAL (Individual Differences Scaling) procedure in Multidimensional Scaling.[13]

Two further transformations are, however, of potentially greater interest in comparing configurations based on different texts. PINDIS additionally reports the results allowing differing vector weights, and vector orientations to apply. This is equivalent to moving the points representing the individual words or categories, in the various texts separately in pairwise relation to one-another to fit them more closely to the reference configuration. Substantial differences in the weights and/or direction cosines at this level immediately draw attention to differences in the contextual relationships of individual words in the texts compared. This offers a useful tool, for example, in monitoring subtle contextual changes in relevant language use between sources, as well as those occurring over time.

For simplicity, this model only will be considered here, to give an idea of how the detailed results may be read descriptively. A directional cosine of 1.0 means the orientation is not changed, while a value of -1.0 means the corresponding vector has been rotated through 180 degrees. The weights represent differences in length of the transformed vectors from their original positions. It is the rightist opposition which can be made to fit least well to the centroid by these transformations - the fit values reported are the correlations between the individual distance orderings and those of the centroid. The directional cosine of 0.0506 show that it is the use of the word "fight" which is most distinctive in their contributions to these debates, when compared with the centroid.

**** Analytic Solutions for Individual Configurations **** Perspective Models - Vector Weighting

1) Executive and Judiciary


   Fit --- S(VZ,X)= 0.879266

2) Ruling_majority


  Fit --- S(VZ,X)= 0.949822

3) Left-centrist_opposition


   Fit --- S(VZ,X)= 0.942617

4) Rightist_opposition


  Fit --- S(VZ,X)= 0.861445

5) Seimas Resolution


   Fit --- S(VZ,X)= 0.787837

The detailed PINDIS results also offer the most accurate means of comparing the configurations for each of the groups compared, as these have, after all, been rescaled and reoriented to remove any idiosyncratic effects which may have been present in the original MINISSA scalings. The following plot for the Executive and Judiciary, for example, is already familiar in its overall appearance due to the small number of categories which have been considered in this example. The minor shifts in orientation are characteristic of the variations which occur in applying multidimensional scaling techniques to data of this kind.

Each of these PINDIS models introduces further parameters into the estimation process in addition to the assumptions already applied in MINISSA in scaling the original matrices, so that the results frequently cannot be regarded as statistically reliable, but the method remains a convenient and useful form of description of relationships between the texts.

Conclusion

The general conclusion, not very surprisingly, supports Professor Dobryninas' original observations. The main focus of the speeches was depressingly familiar, and continued the well-established language of 'fighting crime' which he regarded at the time as characteristic of earlier times. It would be interesting, although beyond the scope of this brief review, to see how the methods described have been applied to subsequent developments in Lithuanian public political discourse.

Notes

[1] See, for example, Iker (1974), Weber (1990), Krippendorf (2003).

[2] Osgood et al. (1957), Brier (1985). See also Brier and Reiter (1989), Brier and Hauschild (1996).

[3] See Landmann and Zuell, (2004). They compared the results of applying their TEXTPACK (http://www.gesis.org/software/textpack/) and Woefel's Catpac (1998) to leading articles from the New York Times and the Frankfurter Allgemeine Zeitung during the invasion of Iraq in 2003.

[4] The data are from an original study by Professor. Aleksandras Dobryninas, University of Vilnius (1996,1997), which was part of a NATO research project "Democratic Changes and Crime Control in Lithuania: Compiling New Criminological Discourses". The authors are indebted to Professor. Dobryninas for permission to use his data.

[5] All speeches were recorded and published as official material in Lietuvos Respublikos Seimo ketvirtoji sesija, (1994) Nos. 133, 136, 155, Vilnius: Lituvos respublikos Seimas.

[6] The reader is referred Coxon(1982), chapter 2, Everitt and Rabe-Hesketh (1996), chapter 2, or to Sokal and Sneath (1963) for general treatments of measures of similarity between dichotomous variables.

[7] HAMLET II, currently available as a time-limited Beta version. See http://www.apb.cwc.com/homepage.htm. A version for Linux is in preparation.

[8] See, for example, Everitt (1974), pp.74-78.

[9] See Coxon (1982), chapter 3, Borg and Groenen (1997).

[10] Lingoes Borg (1979), pp.491?519; Gower and Dijksterkhuis (2004), pp.169-171, Borg and Groenen (1997).

[11] The authors are indebted to Dr.Ekkehard Mochmann, University of Cologne, for this suggestion, which has been incorporated into HAMLET II.

[12] R.Langeheine (1980) describes tests of significance to permit the evaluation both of single transformations in PINDIS and of improvements in fit between the various transformations. His tables offer criterion values to test the hypothesis that the fit obtained could be generated by purely random configurations. Each of the PINDIS models introduces further parameters into the estimation process in addition to the assumptions already applied in MINISSA in scaling the original matrices, so that the results frequently cannot be regarded as statistically reliable, and the method remains, at best, a convenient form of description of relationships between the texts.

[13] Carroll and Chang (1970), Gower and Dijksterhuis (2004), pp. 171-174 , Coxon (1982), pp.190-199, 230-232.

References

Borg I. (1992): Grundlagen und Ergebnisse der Facettentheorie, Verlag Hans Huber, Bern.

Borg I.and Groenen P. (1997): Modern Multidimensional Scaling: Theory and Applications, Springer, Berlin/Heidelberg.

Brier A.P. (1985): HAMLET: A Pascal Program to Count Joint Frequencies of Words in a Text, in: Siegener Periodicum für internationale empirische Sozialwissenschaft, 4,1, pp.177?196.

Brier A.P. and Reiter A. (1989): Methode zur Ermittlung von Ähnlichkeitswerten von Kontextwörtern und ihre Anwendung in einer ideologiekritischen Zeitschriftsanalyse,, in: Manfred Thaller, Albert Müller (Hg.), Computer in den Geisteswissenschaften: Konzepte und Berichte, Studien zur Historischen Sozialwissenschaft Band 7, Campus, Frankfurt/Main,.

Brier A.P. with Hauschild I. (1996): Dimensions of Change in Political Discourse. The Changing Social Order in Germany and the Changing Use of Language in Political Discourse - Some Remarks on an Interdisciplinary Research Project, in: Heinrich Best, Ulrike Becker, Arnaud Marks (eds) Social Sciences in Transition : Social Science Information Needs and Provision in a Changing Europe, Informationszentrum Sozialwissenschaften, Bonn, pp.305-326.

Carroll J.D. and Chang J.J. (1970): Analysis of individual differences in multidimensional scaling via n-way generalization of ‚Eckhart-Young' decomposition, in: Psychometrika, 35, pp.283-319.

Coxon A.P.M. (1982): The User's Guide to Multidimensional Scaling, London, Heinemann.

Dobryninas A. (1996) Democratic Change and Crime Control in Lithuania: Compiling New Criminological Discourses (Vilnius, Lithuania). NATO: Individual Democratic Institutions Research Fellowships1994-1996. http://www.nato.int/acad/fellow/94-96/dobrynin/home.htm

Dobryninas A. (1997) "Crime Control in Lithuanian Political Discourse", in Šiuolaikinio socialinio diskurso analize (A.J. Greimo centro studijos / 4: Semiotika), (Vilnius: Baltos lankos.), p. 26-40 (Lith.)

Everitt B. (1974): Cluster Analysis, London, Heinemann.

Everitt B.S. and Rabe-Hesketh S. (1996): The Analysis of Proximity Data, Arnold.

Gower J, and Dijksterkhuis G.B. (2004): Procrustes Problems, Oxford University Press.

Iker H.P., (1974): An historical note on the use of word?frequency contiguities in content analysis, in: Computers and the Humanities, 8.

Krippendorf K., (2003): Content Analysis : An Introduction to its Methodology, 2nd Edition, Thousand Oaks, CA., Sage Publications.

Landmann, J. and Zuell C. (2004): Computerunterstützte Inhaltsanalyse ohne Diktionär?, in: ZUMA- Nachrichten 54.

Langeheine R. (1980): Approximate Norms and Significance Tests for the LINGOES?BORG Procrustean Individual Differences Scaling (PINDIS), Institut für die Pädagogik der Naturwissenschaften, University of Kiel.

Lingoes J.C. and Roskam E.E. (1973): "A mathematical and emprical study of two multidimensional scaling algorithms, in: Psychometrika, 38 (Supplement).

Lingoes J.C. and Borg I. (1979): A direct approach to individual differences scaling using increasingly complex transformations, in: Psychometrika, 44, pp.491?519.

Osgood C, Suci G.J and Tannenbaum P.H. (1957): The Measurement of Meaning, Urbana, Ill., University of Illinois Press.

Weber R.P., (1990): Basic Content Analysis, Sage University Papers Series on Quantitative Applications in the Social Sciences, no. 07-49, 2nd. Edition , Thousand Oaks, CA., Sage Publications.

Woelfel J.K. (1998): User's Guide. Catpac II, New York, Rah Press.

Sokal R.R. and Sneath P.H. (1963): Principles of Numerical Taxonomy, N.Y., Freeman

Home
.