15 March 2012

This series is intended to provide helpful insights into Materialytics Sequencing Station (M2S®). Beginning in 2002, we wanted to determine if it was possible to trace a gemstone back to its host country (''determine its provenance'') and ideally to its host mine. What we learned has significantly changed our approach to this and related questions.

We started work on material identification and sources by using laser spectroscopy, a relatively new technology, to determine the provenance of gems, given no documentation, just the stones themselves.  Actually, the laser spectroscopy idea is almost as old as lasers, but practical systems waited on development of a number of electronic and information processing technologies.

One reason for starting with gems was that a number of pioneers, notably among them Eduard Gübelin, had, with great effort, created a large, valuable body of literature on the topic of determining the origin of gems. These pioneers realized that gems were not only chemically distinctive, but contained inclusions that are clues to their origin. Researchers in the field made the most of this insight, and have compiled data with great effort over the decades, developing increasingly better techniques for chemical analysis of stones and comparison of their inclusions.

In the beginning, we expected to build on the work of others to advance geochemical analysis. We looked at key indicators that seemed important, for telltale ratios of elements and for inclusions that occurred only in association with certain sites. We did pretty well, using laser spectroscopy as our tool, and tweaking conventional mathematical analysis and comparison techniques to squeeze the most out of the data. Even so, we didn’t really do better than anybody else; our results were not practically useful. 

For one thing, the volume of data produced by our instruments far exceeded our data processing capabilities. Like everyone before us, we had to use expert human opinion (ours, and that of people who know a lot more about geochemistry than we) to decide what data were important to process, and which could safely be discarded…that is, to decide what was important and what was not. For every class of samples, it was necessary to make that decision, creating a new model that would reduce the work to a manageable amount.

That was a fatal flaw: all conclusions rested ultimately on somebody’s expert opinion, not on objective observation. We don’t know, and may never know, what is important and what is not. We backed off, and came at it from a different angle. If, in fact, materials from different sites are consistently different, and materials from the same site (the same mine, district, country, or region…take your pick of scale) are consistently similar, then the information necessary to determine origin must be available in the analytical data. We know we’re looking at it in the graphs our systems draw, but where is it? The graphs are full of what looks like “noise” that is typically discarded at once. Is the information in that noise? Is it in the relationships between areas of noise, between the apparent noise and the data we thought was important?

The key insight is that we don’t have to know. We just compare signals with signals, whether we understand them or not. That is, we store in a Reference Database the mathematical descriptions of a large number of samples whose provenance is well documented. (That’s a LARGE number, the bigger the better…thousands or tens of thousands when possible.) The description comprises everything in the data collected from each sample, the noise, the oddities we can’t account for, everything. We then analyze an unknown sample, develop a mathematical description of it in the same way, and compare that description with every description in the Reference Database. If the “unknown” matches a “known,” we assume that the samples came from the same place.

That assumption is based on further empirical indications. So far, in blind tests, when Reference Samples are re-tested as unknowns, they always match with themselves in the Reference Database, and samples known to come from the same deposit match with samples that come from the same place. Average accuracy over a wide range of materials in all testing exceeds 98%. Accuracy rises as the Reference Database grows.

At no point in this process do we decide what is important, and what is not. No models are made. The system is not told what to look for. The data guide the process. 

This approach applies not only to gems and other geomaterials (including meteorites; note that much of the world’s nickel is mined from the site of a large, ancient meteor strike), but to such things as manufactured items and pharmaceuticals. In a pilot study, we have even compared apples to apples and oranges to oranges, because the subject keeps coming up in jest. 

Our approach is conceptually simple, it may be a bit difficult to grasp at first, and is challenging to implement. Future brief explanations here will discuss that implementation.

(For clarification of terms, please review the definitions on the M2S Terminology page.)