M. D'Imperio 4000 Cathedral Ave. NW #106B Washington, DC 20016 28 January 1992
[This paper is an extensively revised and updated version of an earlier paper in an in-house technical journal, dated 20 June 1978. It includes corrected and expanded data sets.]
On 30 November, 1976, a one-day seminar entitled "New Research on the Voynich Manuscript" was held in the Washington, DC metropolitan area. Two detailed presentations by Captain Prescott H. Currier provided high points of that occasion. In them, and in the supporting paper printed as Appendix A of the Proceedings, he set forth his theory that there were several different scribes involved in the production of the Voynich Manuscript, and that their individuality was attested not only by characteristic "hands", reliably distinguishable by eye, but also by statistically distinct "languages". If this hypothesis could be confirmed, it would provide students of the manuscript win an important new insight into the problem. This paper describes the results of an exploratory study of Currier's theory using cluster analysis and multiple scaling.
"The first twenty-five folios in the herbal section are obviously in one hand and one 'language', which I call 'A'....The second twenty-five folios are int two hands, very obviously the work of at least two different men [A and B]. In addition to this fact, the text of this second portion of the herbal section (that is, the next twenty-five or thirty folios) is in two 'languages' [A and B], and each 'language' is in its own hand. This means that, there being two authors of the second part of the herbal section, each one wrote in his own 'language'....Now with this information available, I went through the rest of the manuscript...and in four other places I discovered the same phenomena I associated with 'language' B.... The biological section is all in one 'language' (B) and one hand. [2, p. 20 ff.]
He has found indications of different hands and the two different "languages" in other sections of the manuscript (the pharmaceutical, astrological, and "recipe" or "star-paragraph" sections). His identifications of hands and "languages" are summarized in the Seminar Proceedings (2, p. 68). Currier's use of the word "language" in this context is not intended to imply different underlying natural languages (e.g., Greek as against Latin, or German as against French). He is referring to patterns of statistical characteristics that seem to be consistently associated with the hand of one writer as contrasted with the hand of the other writer. Certain symbols are more likely to occur together or to appear in certain positions of the text groups in folios showing one hand than in folios showing the other. In summing up his findings, Currier states that he feels quite certain of at least five, and perhaps as many as eight, different hands in the manuscript as a whole, but only two statistical "languages".
Several of us, after attending his presentation, confirmed Currier's suggestions to our own satisfaction by replicating his procedure of comparing pages in the herbal section (where the contrasts between the hand of Scribe 1 and that of Scribe 2 are especially striking), and verifying both the consistent differences in style of writing and the accompanying difference in occurrences of certain symbols. Nevertheless, since so many other approaches to the problems posed by the Voynich Manuscript have been fraught with subjectivity and self-delusion, it seemed important to place Currier's findings on a more objectively demonstrable and secure basis, and to attempt to verify them by an independent statistical study. If in fact it can be shown that different scribes used the writing system in consistently distinct ways, we may be able to exploit these differences to tell us something about the way the system worked and how it was constructed.
There are numerous ways of carrying out cluster analysis, and the published computer programs embody various combinations of these, considered by their designers to offer some special advantage for certain applications. In general, however, the analysis involves the following stages: 1) deciding upon a group of objects that constitutes a good sample of the groupings or clusters hypothesized by the analyst; 2) deciding upon the observations to be made across all the objects; 3) taking the measurements, scores, rankings, labellings, etc. of each object for each observation; 4) choosing a measure of "distance" (dissimilarity) or, alternatively, a measure of association (similarity) appropriate to the case; 5) computing the distances (or associations) between each object and every other with respect to the observations; and finally, 6) applying the clustering algorithm to the triangular matrix of distances or associations resulting from step 5. The clustering procedures may be "agglomerative" (beginning with one object and iteratively joining other objects to it to form a cluster), or "divisive" (starting with all the objects in one big group and successively splitting them into dissimilar subgroups until no further splits can be made). Within the agglomerative methods, there are further choices among methods of linking new objects to existing clusters: "single linkage" methods focus upon the dissimilarity between nearest neighbors in a cluster, "complete linkage" methods focus upon the dissimilarity between the farthest neighbors, and "average linkage" methods upon the average dissimilarity among neighbors. As might be expected, some methods are better at stringing out objects in long, thin chains, while others are better at dealing with globular clumps.
Unavoidably, as is the case with many sophisticated statistical tools, there is a real danger of imposing spurious structure upon the data if the techniques are badly chosen or inappropriately used. A factor analysis or multiple correlation method will find "factors" or "vectors" of some sort in any data, however difficult they may be to interpret or put to use. Similarly, a cluster analysis will always find some "clusters", and it is up to us to pay attention to the indications of significance (the relative compactness of the clusters, the strength of their internal bonds, and the relative distance between clusters) as shown by the statistical measures which the programs provide as a part of the printout. The interpretation of results is unavoidably circular; we propose a certain structure in the group of objects under study, we perform the computation, and we are happy if we see what we expected, or at least something that makes sense in terms of our original hypothesis, however revised.
A frequent use of cluster analysis is in studying the genetic similarities among species of plants or animals, based on some set of chemical or physiological properties. Cluster analysis has been employed in statistical pattern recognition, to discriminate printed letters, geometric shapes, or other visual forms. In archaeology, it has been used to classify groups of artifacts gathered by surface collection over a site; clusters of similar objects concentrated within certain areas of a site were found to indicate different human activities (e.g., "women's" activities: cooking, spinning, making pots, vs "men's" activities: weapons manufacture, hunting, herding). A particularly interesting application of cluster analysis to Egyptian archaeology deserves special mention [4]. It seems worth describing at more length, since it demonstrates so dramatically the usefulness of these tools when appropriately applied.
In 1898, a large cache of Egyptian royal mummies was found in the Valley of the Kings. These mummies, having been plundered and vandalized by tomb robbers, had then been gathered together by a later Egyptian ruler, rewrapped, and deposited in two new hiding places. In the process of reburial, the identities of certain mummies were obscured (at least for the modern archaeologist). One in particular, referred to by archaeologists as the "elder lady", or more objectively, as Egyptian Museum Catalog Number 61070, was of special interest since it appeared from certain evidence (e.g., the position of the hands) to be that of a queen. It was suggested that this lady might be Queen Hatshepsut or Queen Tiye (mother of the heretical pharaoh Akhenaton). Various coordinated studies were made, including data from conventional full-body x-rays, standardized x-rays of the head known as "cephalograms", and scanning electron microprobe comparison of hair samples known to be from Queen Tiye and those from the unidentified mummy. Several different cluster analysis algorithms were applied to sets of scores obtained from cephalogram studies of the unknown lady and ten other mummies of Egyptian queens. the analysis showed clearly that the head measurements of the unknown matched those of Queen Tiye's mother more closely than those of any other queen. This finding was strongly supported by the close match between the hair samples known to belong to Queen Tiye (and obtained from a keepsake in the tomb of another family member) and hair from the unidentified "elder lady".
The language A pages, from the herbal and pharmaceutical sections, are scattered over the left third of the plot, and do not present a very cohesive picture. the horizontal axis may have some meaning with respect to hand 4 as opposed to hand 1, since all the hand 4 folios (with some question as to hand for pharmacological-A) lie below that axis. Only two out of 38 language B data points lie slightly to the left of the diagonal line separating the rest from language A, and not one fo the 24 A data points lies over the line on the B side. The B data points, from the cosmological, herbal, biological, and pharmaceutical sections, are scattered pretty widely over the right-hand side of the plot. It would appear that the two statistical languages are clearly distinct, but we cannot say much about distinctions among the hands, based on the monographic frequency counts alone.
Herbal A Herbal B final {89} rare final {89} very frequent {SOE}, {SOR} common groups {SOE}, {SOR} relatively rare {SAN}, {SAM} moderately common {SAN}, {SAM} rare initial {SOP} common initial {SOP} rare initial {Q} common initial {Q} rare few "unattached finals" many "unattached finals" ({AD AN AM A3; AJ AK AL A5; AE AG AH A1; AR AT AU A0}, and the same series with initial {O} instead of {A}, surrounded by spaces) Characteristics especially strong in Biological-B: given a group beginning with {4O}, the preceding group almost always ends in {9}. given a group beginning with {S/Z}, the preceding group often ends in {E/N}. initial {S/Z} are less frequent at the beginning of lines.I would like to see studies using counts of the occurrences of these features as input to clustering or multiple scaling programs as a start. Capt. Currier can probably suggest additional characteristics if he has made further studies since his earlier paper. In addition, we may be able to find additional discriminating featurs by carrying out exhaustive counts and analyses on sets of pages within the entire corpus of text, and within subsets indicated by clusters we may find.
2. D'Imperio, M.E., "New Research on the Voynich Manuscript: Proceedings of a Seminar," Washington,DC, 30 November 1976.
3. Everitt, B.S., Cluster Analysis, Halstead Press, 1974.
4. Harris, J.E., et al., "Mummy of the 'Elder Lady' in the Tomb of Amenhotep II: Egyptian Museum Catalog Number 61070," Science, vol. 200, 9 June 1978, pp. 1149-1151.
Fig. 1a - Pages Included in 62-Point Data Set LANGUAGE A (24) Herbal-A, Hand 1 (16) 1R, 3R, 8R, 17R, 24R, 21R, 22R, 35V, 37V, 49R, 56V, 42R, 30R, 56R, 49V Herbal-A, Hand 4 (6) 87R, 87V, 90R1, 90V1, 96R, 96V Pharmaceutical-A, Hand 4 (2) 99V, 100R LANGUAGE B (38) Herbal-B, Hand 2 (12) 31R, 39R, 34R, 39V, 31V, 34V, 41R, 43R, 43V, 46R, 46V, 48V Herbal-B, Hand Z (12) 75R, 79V, 76R, 77R, 75V, 76V, 77V, 78R, 78V, 79R, 80R, 80V Recipes (Star-Paragraphs)-B, Hands X and Y (8) 103R, 103V, 104R, 104V, 105R, 105V, 106R, 106V Cosmological-B, Hand 3 (2) 85/86V4, 85/86V3 Fig. 1b - Pages included in 38-Point Data Set LANGUAGE A (22) Herbal-A, Hand 1 (16) 1R, 3R, 8R, 17R, 24R, 21R, 22R, 35V, 37V, 49R, 56V, 42R, 30R, 56R, 49V Herbal-A, Hand 4 (6) 87R, 87V, 90R1, 90V1, 96R, 96V LANGUAGE B (16) Herbal-B, Hand 2 (12) 31R, 39R, 34R, 39V, 31V, 34V, 41R, 43R, 43V, 46R, 46V, 48V Herbal-B, Hand 5 (4) 94R, 94V, 95R1, 95V1 [Graphs not reproduced for Fig. 2 and Fig. 3.; maybe we should scan them and convert to Postscript somehow --- JJG]