An Application of Cluster Analysis and Multiple Scaling to the Question of "Hands" and "Languages" in the Voynich Manuscript

M. D'Imperio
4000 Cathedral Ave. NW #106B
Washington, DC 20016
28 January 1992

ABSTRACT.

This paper describes the results of an exploratory study of Currier's theory using cluster analysis and multiple scaling, and presents additional results of a second study with expanded data sets, using multiple scaling alone. A clear separation was found between pages assigned by Currier to his statistical "languages" A and B. Biological-B pages formed the most compact cluster in all the plots. There was little indication of a separation between pages from different sections of the manuscript with different kinds of pictures. There was some evidence for a separation of hand 4 from hand 1 within the herbal section, language A, and hand 5 from hand 2 within the herbal section, language B. Suggestions for further research are offered.

[This paper is an extensively revised and updated version of an earlier paper in an in-house technical journal, dated 20 June 1978. It includes corrected and expanded data sets.]

On 30 November, 1976, a one-day seminar entitled "New Research on the Voynich Manuscript" was held in the Washington, DC metropolitan area. Two detailed presentations by Captain Prescott H. Currier provided high points of that occasion. In them, and in the supporting paper printed as Appendix A of the Proceedings, he set forth his theory that there were several different scribes involved in the production of the Voynich Manuscript, and that their individuality was attested not only by characteristic "hands", reliably distinguishable by eye, but also by statistically distinct "languages". If this hypothesis could be confirmed, it would provide students of the manuscript win an important new insight into the problem. This paper describes the results of an exploratory study of Currier's theory using cluster analysis and multiple scaling.

Currier's Hypothesis.

The Voynich Manuscript is a rather long document, comprising more than 200 pages (somewhere between 210 and 250, depending on how large fold-out pages are counted) of writing in an unknown script and bizarre colored drawings of a wide variety of subject matter. Currier's findings concern contrasts he has seen between sets of pages in certain sections of the manuscript, leading him to classify the pages into sub-groupings. This is an approach quite different from that of other students of the manuscript, who almost invariably have considered it the monolithic production of one author or scribe. Here are a few highlights drawn from Currier's exposition of his theory at the 1976 conference:

"The first twenty-five folios in the herbal section are obviously in one hand and one 'language', which I call 'A'....The second twenty-five folios are int two hands, very obviously the work of at least two different men [A and B]. In addition to this fact, the text of this second portion of the herbal section (that is, the next twenty-five or thirty folios) is in two 'languages' [A and B], and each 'language' is in its own hand. This means that, there being two authors of the second part of the herbal section, each one wrote in his own 'language'....Now with this information available, I went through the rest of the manuscript...and in four other places I discovered the same phenomena I associated with 'language' B.... The biological section is all in one 'language' (B) and one hand. [2, p. 20 ff.]

He has found indications of different hands and the two different "languages" in other sections of the manuscript (the pharmaceutical, astrological, and "recipe" or "star-paragraph" sections). His identifications of hands and "languages" are summarized in the Seminar Proceedings (2, p. 68). Currier's use of the word "language" in this context is not intended to imply different underlying natural languages (e.g., Greek as against Latin, or German as against French). He is referring to patterns of statistical characteristics that seem to be consistently associated with the hand of one writer as contrasted with the hand of the other writer. Certain symbols are more likely to occur together or to appear in certain positions of the text groups in folios showing one hand than in folios showing the other. In summing up his findings, Currier states that he feels quite certain of at least five, and perhaps as many as eight, different hands in the manuscript as a whole, but only two statistical "languages".

Several of us, after attending his presentation, confirmed Currier's suggestions to our own satisfaction by replicating his procedure of comparing pages in the herbal section (where the contrasts between the hand of Scribe 1 and that of Scribe 2 are especially striking), and verifying both the consistent differences in style of writing and the accompanying difference in occurrences of certain symbols. Nevertheless, since so many other approaches to the problems posed by the Voynich Manuscript have been fraught with subjectivity and self-delusion, it seemed important to place Currier's findings on a more objectively demonstrable and secure basis, and to attempt to verify them by an independent statistical study. If in fact it can be shown that different scribes used the writing system in consistently distinct ways, we may be able to exploit these differences to tell us something about the way the system worked and how it was constructed.

Cluster Analysis

In considering statistical tools for investigating Currier's hypothesis, I decided upon that of cluster analysis as an appropriate method. Cluster analysis algorithms are available as computer programs and are widely employed in the social and natural sciences for classifying collections of objects into subsets based on similarities and dissimilarities with respect to a list of scores or observations. the methods can also be used to reveal which of a group of objects is most like another single object in the group. So long as a set of observations has been made, such that every object under study has been scored, rated, or labelled for all the same properties or variables, the clustering techniques can be applied to reveal subgroups among the objects. Within each cluster, objects are more like each other with respect to those variables than like objects in other clusters. this approach seemed likely to reveal the sets of similar pages within the Voynich Manuscript that Currier's theory called for, if in fact they were reflected in the statistical properties of the text. (For the interested reader not already familiar with cluster analysis, a number of more or less readable works are available. The references I used, considerably out of date now perhaps, but still usable, were Everitt 1974 [3] and Anderburg 1973[1].)

There are numerous ways of carrying out cluster analysis, and the published computer programs embody various combinations of these, considered by their designers to offer some special advantage for certain applications. In general, however, the analysis involves the following stages: 1) deciding upon a group of objects that constitutes a good sample of the groupings or clusters hypothesized by the analyst; 2) deciding upon the observations to be made across all the objects; 3) taking the measurements, scores, rankings, labellings, etc. of each object for each observation; 4) choosing a measure of "distance" (dissimilarity) or, alternatively, a measure of association (similarity) appropriate to the case; 5) computing the distances (or associations) between each object and every other with respect to the observations; and finally, 6) applying the clustering algorithm to the triangular matrix of distances or associations resulting from step 5. The clustering procedures may be "agglomerative" (beginning with one object and iteratively joining other objects to it to form a cluster), or "divisive" (starting with all the objects in one big group and successively splitting them into dissimilar subgroups until no further splits can be made). Within the agglomerative methods, there are further choices among methods of linking new objects to existing clusters: "single linkage" methods focus upon the dissimilarity between nearest neighbors in a cluster, "complete linkage" methods focus upon the dissimilarity between the farthest neighbors, and "average linkage" methods upon the average dissimilarity among neighbors. As might be expected, some methods are better at stringing out objects in long, thin chains, while others are better at dealing with globular clumps.

Unavoidably, as is the case with many sophisticated statistical tools, there is a real danger of imposing spurious structure upon the data if the techniques are badly chosen or inappropriately used. A factor analysis or multiple correlation method will find "factors" or "vectors" of some sort in any data, however difficult they may be to interpret or put to use. Similarly, a cluster analysis will always find some "clusters", and it is up to us to pay attention to the indications of significance (the relative compactness of the clusters, the strength of their internal bonds, and the relative distance between clusters) as shown by the statistical measures which the programs provide as a part of the printout. The interpretation of results is unavoidably circular; we propose a certain structure in the group of objects under study, we perform the computation, and we are happy if we see what we expected, or at least something that makes sense in terms of our original hypothesis, however revised.

A frequent use of cluster analysis is in studying the genetic similarities among species of plants or animals, based on some set of chemical or physiological properties. Cluster analysis has been employed in statistical pattern recognition, to discriminate printed letters, geometric shapes, or other visual forms. In archaeology, it has been used to classify groups of artifacts gathered by surface collection over a site; clusters of similar objects concentrated within certain areas of a site were found to indicate different human activities (e.g., "women's" activities: cooking, spinning, making pots, vs "men's" activities: weapons manufacture, hunting, herding). A particularly interesting application of cluster analysis to Egyptian archaeology deserves special mention [4]. It seems worth describing at more length, since it demonstrates so dramatically the usefulness of these tools when appropriately applied.

In 1898, a large cache of Egyptian royal mummies was found in the Valley of the Kings. These mummies, having been plundered and vandalized by tomb robbers, had then been gathered together by a later Egyptian ruler, rewrapped, and deposited in two new hiding places. In the process of reburial, the identities of certain mummies were obscured (at least for the modern archaeologist). One in particular, referred to by archaeologists as the "elder lady", or more objectively, as Egyptian Museum Catalog Number 61070, was of special interest since it appeared from certain evidence (e.g., the position of the hands) to be that of a queen. It was suggested that this lady might be Queen Hatshepsut or Queen Tiye (mother of the heretical pharaoh Akhenaton). Various coordinated studies were made, including data from conventional full-body x-rays, standardized x-rays of the head known as "cephalograms", and scanning electron microprobe comparison of hair samples known to be from Queen Tiye and those from the unidentified mummy. Several different cluster analysis algorithms were applied to sets of scores obtained from cephalogram studies of the unknown lady and ten other mummies of Egyptian queens. the analysis showed clearly that the head measurements of the unknown matched those of Queen Tiye's mother more closely than those of any other queen. This finding was strongly supported by the close match between the hair samples known to belong to Queen Tiye (and obtained from a keepsake in the tomb of another family member) and hair from the unidentified "elder lady".

Application to Voynich Manuscript Pages

In the first exploratory study described in an earlier version of this paper, I applied four different clustering programs and one multiple scaling program to data from 40 manuscript pages chosen from the herbal section, and said by Currier to represent language A-hand 1 and language B -hand 2, and the biological section, language B-hand Z. Of all the programs I tried, the MINISSA multiple scaling program seemed to produce the clearest and most interesting results. All of the programs showed "Biological-B" to be the most cohesive cluster, and all discriminated most language B pages pretty clearly from language A. Since these results were based on simple monographic frequency counts of the initial 350 to 400 symbols on each page, without any of the more detailed digraphic or trigraphic properties or other positional properties of languages A and B described by Currier, I considered them to be encouraging. I undertook a more extensive study, still using monographic frequencies, and including pages from other sections and some other hands, for a total of 62 pages, and using the MINISSA program alone. This paper describes that study.

Results of the MINISSA Multiple Scaling Program

The MINISSA (Minnesota-Israel-Netherlands Integrated Smallest Space Analysis) program was run on my enlarged data set by a colleague, Mr. Douglas Cope, whose help both in running the programs and aiding me in interpreting the outputs I greatly appreciate. The program employed a somewhat different statistical model of the data; instead of finding clusters of objects in an abstract 'space", it mapped the "locations" of the objects within such a space: a "Euclidean metric space", whose two dimensions could be assigned a meaning appropriate to the hypothesis held by the investigator. I supplied 62 sets of frequency counts, for the 24 common symbols shown in Currier's transcription alphabet, taken from the first 350 to 400 characters on each of 62 pages. Figure 1a shows a list of these pages, with Currier's hand and language identifications and the presumed "subject matter" based on the nature of the pictures on the page. I made a second run, with 38 pages chosen from Herbal A and B only, to explore further the contrast between the two languages and four different hands within the herbal section of the manuscript (that is, presumably, holding the "subject matter" variable constant, if the pictures in fact have some relationship to the content of the text--an assumption questioned by some students). I hoped that I might reveal some clustering of the different "hands" within the two languages, as well as showing more clearly the separation between languages A and B. Figure 1b shows the pages included in this second study.

62-Point Data Set: Several Hands, Two Languages

Figure 2 shows a drawing adapted from the "vector plot" produced by the MINISSA program for the inclusive 62-point data set. the vertical axis seems to be related to the "language" distinction, though the actual separation between the A and B data points is a diagonal line, indicated by dashes on the figure, slanting from upper left to the lower extremity of the vertical axis. The horizontal axis does not seem to have much meaning in the "map", as about a third of the data points are above it, the rest below it. It may have a meaning with respect to "Biological B, hand Z", all of whose data points are relatively tightly grouped below the horizontal axis. Biological-B is the tightest, most significant cluster by far, as was the case in all the earlier exploratory studies I made. These pages probably represent the most homogeneous subset of text, and a good subject for analytic study.

The language A pages, from the herbal and pharmaceutical sections, are scattered over the left third of the plot, and do not present a very cohesive picture. the horizontal axis may have some meaning with respect to hand 4 as opposed to hand 1, since all the hand 4 folios (with some question as to hand for pharmacological-A) lie below that axis. Only two out of 38 language B data points lie slightly to the left of the diagonal line separating the rest from language A, and not one fo the 24 A data points lies over the line on the B side. The B data points, from the cosmological, herbal, biological, and pharmaceutical sections, are scattered pretty widely over the right-hand side of the plot. It would appear that the two statistical languages are clearly distinct, but we cannot say much about distinctions among the hands, based on the monographic frequency counts alone.

38-Point Data Set: Two Hands for Each Language

Figure 3 shows the plot for the more restricted data set. The vertical axis appears to separate the two languages very well; only one herbal-A page lies very slightly on the "B" side of the axis, and no herbal-B page lies on the "A" side. All the hand 4 pages fall below the horizontal axis again, but the other hands are all on both sides of it, and hands 1 and 2 are widely scattered over the space. The four hand 5 pages are all to the left of the hand 2 pages in herbal B. Again, it appears that the monographic frequency counts gave a clear separation of the two languages, but do not tell us much about the "subjects" or the hands within languages.

Conclusions

A clear separation was found between pages assigned by Currier to his statistical "languages" A and B. Biological-B pages formed the most compact cluster in all the plots. There was little indication of a separation between pages from different sections of the manuscript with different kinds of pictures. There was some evidence for a separation of hand 4 from hand 1 within the herbal section, language A, and hand 5 from hand 2 within the herbal section, language B.

Some Suggestions for Future Research

If all the pages were in machinable form, it might be worthwhile to make more runs contrasting various sets of pages, with more complete samples of each hand and "subject", and using today's presumably more powerful programs. A set of observations more closely related to the specific statistical contrasts between languages A and B might be developed, and used to produce plots that might verify Currier's identifications and, perhaps, identify the "language" of additional pages. The discriminating characteristics mentioned by Currier in his 1976 paper (pp. 63-65) were as follows:

     Herbal A                                    Herbal B
                                                          
     final {89} rare                       final {89} very frequent
     {SOE}, {SOR} common groups            {SOE}, {SOR} relatively rare
     {SAN}, {SAM} moderately common        {SAN}, {SAM} rare
     initial {SOP} common                  initial {SOP} rare
     initial {Q} common                    initial {Q} rare
     few "unattached finals"               many "unattached finals"
           ({AD AN AM A3; AJ AK AL A5; AE AG AH A1; AR AT AU A0}, and
     	 the same series with initial {O} instead of {A}, surrounded
                    by spaces)

     Characteristics especially strong in Biological-B:

     given a group beginning with {4O}, the preceding group almost
                   always ends in {9}.

     given a group beginning with {S/Z}, the preceding group often
                   ends in {E/N}.

     initial {S/Z} are less frequent at the beginning of lines.

I would like to see studies using counts of the occurrences of these features as input to clustering or multiple scaling programs as a start. Capt. Currier can probably suggest additional characteristics if he has made further studies since his earlier paper. In addition, we may be able to find additional discriminating featurs by carrying out exhaustive counts and analyses on sets of pages within the entire corpus of text, and within subsets indicated by clusters we may find.

References

1. Anderberg, M.R., Cluster Analysis for Applications, Academic Press, New York, 1973.

2. D'Imperio, M.E., "New Research on the Voynich Manuscript: Proceedings of a Seminar," Washington,DC, 30 November 1976.

3. Everitt, B.S., Cluster Analysis, Halstead Press, 1974.

4. Harris, J.E., et al., "Mummy of the 'Elder Lady' in the Tomb of Amenhotep II: Egyptian Museum Catalog Number 61070," Science, vol. 200, 9 June 1978, pp. 1149-1151.

        Fig. 1a - Pages Included in 62-Point Data Set

LANGUAGE A (24)

Herbal-A, Hand 1 (16)
        1R, 3R, 8R, 17R, 24R, 21R, 22R, 35V, 37V, 49R, 56V, 42R, 30R,
        56R, 49V

Herbal-A, Hand 4 (6)
        87R, 87V, 90R1, 90V1, 96R, 96V

Pharmaceutical-A, Hand 4 (2)
        99V, 100R

LANGUAGE B (38)

Herbal-B, Hand 2 (12)
        31R, 39R, 34R, 39V, 31V, 34V, 41R, 43R, 43V, 46R, 46V, 48V

Herbal-B, Hand Z (12)
        75R, 79V, 76R, 77R, 75V, 76V, 77V, 78R, 78V, 79R, 80R, 80V

Recipes (Star-Paragraphs)-B, Hands X and Y (8)

        103R, 103V, 104R, 104V, 105R, 105V, 106R, 106V

Cosmological-B, Hand 3 (2)
        85/86V4, 85/86V3


        Fig. 1b - Pages included in 38-Point Data Set

LANGUAGE A (22)

Herbal-A, Hand 1 (16)
        1R, 3R, 8R, 17R, 24R, 21R, 22R, 35V, 37V, 49R, 56V, 42R, 30R,
        56R, 49V

Herbal-A, Hand 4 (6)
        87R, 87V, 90R1, 90V1, 96R, 96V

LANGUAGE B (16)

Herbal-B, Hand 2 (12)
        31R, 39R, 34R, 39V, 31V, 34V, 41R, 43R, 43V, 46R, 46V, 48V

Herbal-B, Hand 5 (4)
        94R, 94V, 95R1, 95V1


        [Graphs not reproduced for Fig. 2 and Fig. 3.; maybe we should
         scan them and convert to Postscript somehow --- JJG]