The Overlooked Realities of Sampling Bias in the Fossil Record

Nussaibah’s talk at GSA Annual Meeting 2020

View abstract here


The estimated number of extant species on Earth is around 5 million and around 250,000 fossil species have been described so far. This represents only 5% of the total number of living species. But the fossil record covers billions of years of Earth’s history and today’s biota is only a snapshot.

It is of no doubt that the number of fossil species from a complete fossil record would completely overshadow the number of extant species. But this is not the case as the fossil record is riddled with biases. Biases in the fossil record include: Taphonomic biases which refer to the probability and quality of preservation of an organism. Physical biases such the geological setting and age of the rock. And finally, sampling biases such as geographical location and human efforts. This has led to the fossil record that we know of today and that is used to compute, for example, global diversity curves. While the biological and physical biases of the fossil record are fairly well understood, the human biases are not.

The aim of this study is to quantify the socio-economic biases that also contribute to the sampling issue of fossil datasets. For this, we use the Paleobiology Database, focusing on occurrences data from publications published between 1990 and 2020. We chose this database over other as this is the one used for global diversity analyses. We are also in the process of mining affiliation data from these publications. At this point in time, 60% of affiliations have been compiled. We also use economic data, namely the Gross Domestic Product or GDP, and funding allocated to research obtained from the World Bank. The English Proficiency Index from the largest world rankings of adult English speaking skills and the Global Peace index, as a proxy for political stability, was also used. The Global Peace Index ranges from 1 to around 3.5 with the highest values representing the least peaceful countries around the world, i.e. countries with ongoing conflicts or wars.

We specifically look at the relationship between the countries where the fossil collections are sampled or where the researchers are based and their associated socio-economic factors. For this we use structural equation modelling. Structural equation modelling is a multivariate statistical analysis that combines factor analysis and multiple regression analysis. We also look at research and collaboration networks and how this determines the sampling locations. This is a common way of analysing social structures.

Based on the Paleobiology database, we see that most of the fossil collections are located in North America and Europe. A small portion also comes from the East Asian and Pacific Region. The most sampled countries are also part of the groups of high income countries around the world suggesting that income group is a factor affecting fossil collection in the world.

From the publications in the Paleobiology Database, 82% of them were in English. The next major languages were French and German. The remaining 10% make up 13 other languages. Peer review journals usually cater to an English speaking audience so it comes to no surprise that English is the dominant language here.

The no of collections per sq km was positively correlated with research funding, that is, the more money spent on research for a country, the more fossil collections were observed in that country. Fossil collections also negatively correlated with the Global Peace Index, i.e. the number of fossil collections increased with political stability. There was also a correlation between research funding and English proficiency.

Based on the structural equation modelling, we identify two plausible socio-economic scenarios that lead to sampling bias. The first one, where GDP determines the amount of research funding allocated which in turn determines how much fossil data can be collected. This is also independently influenced by political stability. In the second scenario, GDP again determines the research funding. Fossil collection is related to both research funding and English proficiency. We also identify a relationship between English proficiency and research funding. This could be due to other factors currently not taken into consideration. The strength of structural equation modelling is that we can also include so-called latent variables which cannot be directly observed, such as quality of life or happiness. This is the strength of our method as it can be used to assess unobservable “latent” constructs where these latent variables are defined using one or several other variables. This is one of the ways that we plan to expand on this model.

Based on model selection using AIC and BIC values, we determine that the second scenario best explains the sampling issue. The AIC and BIC are methods to score and select models. The selected model is the one with the lowest AIC and BIC values.

However, it is also important to look at where researchers do their fieldwork to collect fossil data. Rounded arrows here show that fossils were collected from researchers from the same region while straight arrows represent researchers from other regions.

In Europe, most fossil data in the last 30 years were collected from researchers from European institutions. These researchers together with North American researchers, are also the ones contributing the most to fossil data collection in African regions with very little contribution from African researchers. On the other hand, in South America, local researchers are the ones driving the fossil data collection. Despite not being part of the groups of countries of high income, legislations about fossil patrimony heritage may have had an influence in restricting access of exports of fossils to other countries and investment in science funding promoted the contributions of local researchers in paleontological studies. We also plan to investigate the strength of these relationships by looking at the size and connectivity of these networks. This will allow us to determine the most important actors in driving data collection in a specific region or country. Our model focusing on the countries of researchers could explain more of the global distribution of fossil occurrences.

GDP and political stability influence research funding and higher GDP was also correlated with higher English proficiency. Both higher research funding and English proficiency then contributed to higher sampling from these researchers. It is also important to note that the Paleobiology Database community consists mostly of researchers from institutions located mainly in Europe and Northern America, which also comprises the higher English proficiency levels, higher GDP and higher political stability in the world. The bias in the data that we see could also result from the publication selected to be entered in the database rather than the fossil collection itself.

The published literature represents only a small proportion of the palaeontological data housed in museum collections. Other hidden data can be in the form of non-English publications which may have not been compiled due to their inaccessibility in terms of language barriers or because simply no online copy exists. This has been one issue that we encountered during the mining process of our affiliations data. This adds yet another dimension that has not been considered here.

But we can also not deny the colonial legacy in the field of paleontology. One example is the Brachiosaurus specimen on display at the Natural History Museum in Berlin. The skeleton was assembled from fossils of three individuals recovered by German paleontologists in the 1900s in what was known back then as German East Africa and is now Tanzania. In short, these fossils were acquired during the German occupation of Tanzania and, following the logic applied to other colonial period artifacts, the museum’s retention of the fossils makes it complacent with the colonial agenda. The question then becomes, why should a German museum have the right to hold and display these fossils? This is one of the reasons why some Tazanian politicians are demanding these fossils back. In addition, although German paleontologists have traditionally gotten credit for this discovery, it was in fact local residents, who knew the bones and used them in religious rites, who guided them to the find these fossils but their contributions have completely been erased from history. But this is not the only museum facing repatriation requests. This is also showing that considering fossil data from museums might just exacerbate the pattern that we observe here. It is clear that the distribution of fossil data across the world goes beyond just biological and physical processes.

We can see that the choices of fieldwork location of researchers from higher income countries have shaped the fossil occurrence distribution that we observe today. This can both explain the high number of fossil sites in these high income countries as well as the global sampling distribution. In many cases where researchers from high income countries carry out fieldwork in lower income countries, publications resulting from these endeavours show that very often no local researchers were involved. This pattern, which extends to more than just the three decades we focus on, has created a scientific hierarchy where paleontological knowledge is held by the high income countries, especially in Europe and North America. The first step to conduct research that is more ethical and democratic is to admit and acknowledge that there is this problem where knowledge in paleontology and across scientific disciplines is driven by power relations.