Research Article |
Corresponding author: Melanie Werner ( melanie.werner@uni-hamburg.de ) Academic editor: Janet Franklin
© 2024 Melanie Werner, Johannes Weidinger, Jürgen Böhner, Udo Schickhoff, Maria Bobrowski.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Werner M, Weidinger J, Böhner J, Schickhoff U, Bobrowski M (2024) Instagram data for validating Nothofagus pumilio distribution mapping in the Southern Andes: A novel ground truthing approach. Frontiers of Biogeography 17: e140606. https://doi.org/10.21425/fob.17.140606
|
The availability of valid, non-biased species occurrence data has always been a major challenge for biodiversity research and modelling studies. Data from open-source databases or remote sensing are promising approaches to increase the availability of species occurrence data. However, these data may contain spatial, temporal, and taxonomic biases or require ground truthing. In recent years, social media has received attention for its contribution to species occurrence data sampling and ground truthing approaches. The wide reach of social media platforms allows for rapid and large-scale analyses.
Here we introduce a novel Instagram ground truthing approach to validate the occurrence mapping of Nothofagus pumilio across its entire distribution range. The treeline species of the southern Andes has been extensively studied in small-scale studies, but large-scale modelling approaches are largely missing due to limited accessibility to treeline sites resulting in a lack of occurrence data. The content posted on the social media platform Instagram consists of images and videos in which the species N. pumilio and its location can be identified. By searching for suitable posts using hashtags and location tags, we created 1238 Instagram ground truthing points. We compared the performance of our dataset with open-source data from the Global Biodiversity Information Facility (GBIF) through visual, quantitative, and bias analyses, acknowledging that both social media-based and Citizen Science data are subject to sampling and spatial biases due to collection in human-accessible areas. The Instagram ground truthing points were subsequently used to validate remote sensing occurrence data, generated using Sentinel-2 level 2A data and Supervised Classification. The combined approach – Instagram ground truthing and remote sensing – allows for the collection of occurrence data over the entire latitudinal range of N. pumilio, covering approximately 2000 km.
The use of social media content provides potentially important contributions to species occurrence data sampling and ground truthing
In our study we introduce a novel ground truthing approach for species occurrence data sampling based on Instagram data
Instagram ground truthing points, combined with Supervised Classification generate species occurrence data of Nothofagus pumilio over its entire distribution range in the southern Andes
The performance of the Instagram ground truthing points is evaluated by comparison with existing data from the GBIF database.
Our Instagram ground truthing approach demonstrates a new way of sampling species occurrence data and can be applied to other suitable species and study areas.
Citizen Science, Ecological Modelling, Ground Truthing, Instagram, Occurrence Data Sampling, Remote Sensing, Social Media, Southern Andes, Supervised Classification
Quantifying spatial and temporal distribution of species and analysing underlying ecological requirements has become increasingly important in high elevation regions due to climate and environmental change (
Species occurrence data are mainly collected through field research and made available in publications and databases (
More recently, Citizen Science projects and social media are becoming crucial for surveying species occurrences (
While both Citizen Science and social media enhance the sampling of occurrence data, the data remain spatially biased, as records are predominantly collected from locations accessible to humans (
The availability of data for treeline species is limited; however, such species are likely to be suitable candidates for analysis using remote sensing and social media-based species occurrence data sampling. N. pumilio forms an abrupt treeline in the orotemperate belt of the southern Andes in mono-species forest stands (
In this study, we demonstrate the large-scale sampling of N. pumilio occurrence data using Sentinel-2 imagery and Supervised Classification. To validate the spatial occurrence data, we introduce a novel Instagram ground truthing approach, leveraging occurrence points derived from the social media platform Instagram (www.instagram.com). We hypothesise that this Instagram-based method, due to a high volume of potentially suitable posts and our sampling approach, will generate more occurrence points with reduced spatial bias compared to datasets from the open-source GBIF database. Spatial bias in the resulting species occurrence data is further mitigated by incorporating remote sensing data, which enhances both the quantity and spatial coverage of occurrence information. Unlike presence-only point datasets, remote sensing data provide presence-absence datasets, offering more comprehensive opportunities for ecological modelling.
Nothofagus pumilio
(Poepp et Endl.) Krasser (southern or lenga beech) is the dominant subalpine tree species in the southern Chilean and Argentinean Andes between 35°S and 56°S, encompassing a longitudinal distribution range of more than 500 km (
The distribution area of the species along the Andean cordillera follows an elevational gradient from north to south, while the west-east expansion is also dependent on precipitation. N. pumilio forms mono-species forests located between 1600 m up to 2000 m in the northern parts, whereas the elevational limit decreases to 400 m in the southernmost range at Tierra Del Fuego (
As a first step, we developed the Instagram ground truthing approach to ensure proper validation of large-scale remotely sensed occurrence data of N. pumilio. Additionally, the Instagram ground truthing points are quantitatively compared with existing occurrence data from the GBIF database. We used the social media platform Instagram (www.instagram.com) to sample the Instagram ground truthing points. Although other social media platforms have been utilised in studies sampling species occurrences, Instagram has largely been overlooked. Nonetheless, we identify a significant advantage in using Instagram. Instagram users have the possibility to post both photos and short videos. The platform’s lack of text-only posts makes it especially suitable for our approach. At the same time, Instagram is one of the largest social media platforms with 2 billion users worldwide (
We started the Instagram ground truthing approach in 2021 and repeated it in 2022. The analysis consisted of 3 steps: 1) Potential contributions from publicly accessible profiles were searched for using the search bar embedded in the Instagram user interface and two specific search options: hashtags (#nothofaguspumilio and #lenga for exact species information) and places or landscape features (by location tags, locations in hashtags or usernames for exact locations). 2) Posts were checked using a strict catalogue of criteria (Table
When we manually transferred the locations to a map in step 3), simple descriptions of the locations were not sufficient. The locations still had to be clearly traceable by landscape features (see Table
Criteria for selecting Instagram posts to generate Nothofagus pumilio occurrence data. All these criteria must be fulfilled for the image to be included in the analysis.
Criterion | Element or Example |
---|---|
Typical characteristics of Nothofagus pumilio | morphological characteristics (leaves, branches, habitus) |
autumn colouring | |
abrupt treeline | |
mono-species forest | |
Concrete location information | geographical tag |
location hashtag | |
location description in the caption | |
Recognisable landscape elements | glaciers |
mountain peaks or ranges | |
rivers, lakes | |
roads | |
tourist points, cities, villages | |
coastlines | |
Fitting hashtags | hashtags describing the location or the plant |
Picture criteria | Avoid persons in focus |
no photo montages | |
no emojis | |
no extreme (colour) falsifications |
If all conditions were met, we manually transferred the determined occurrence to a map with SAGA GIS (
Example of the Instagram ground truthing approach at Laguna Capri, Argentina. Nothofagus pumilio can be identified by its habitus and leaves in the foreground, its autumn-colouring and the abrupt treeline in the background. The lake itself and Mount Fitz Roy are reliable landscape elements. A location tag, the post description and hashtags also refer to the location (used with permission by Instagram user
Analysing multispectral, medium spatial resolution satellite data like Sentinel-2 leads to cost-efficient and robust results in tree species classification over large spatial extents (
We trained our Supervised Classification with training areas including three classes (1 = N. pumilio, 2 = Evergreen vegetation, 3 = Low vegetation). Training areas were created using summer, autumn and winter Sentinel-2 scenes at selected sites across the range. The winter data made evergreen vegetation clearly recognisable. Autumn colouring at the treeline indicated N. pumilio. We tested various classification algorithms for Supervised Classification, including well-performing standard algorithms like Maximum Likelihood, Minimum Distance, and Spectral Angle, as well as the decision tree-based Random Forest algorithm. The performance of these algorithms was measured using overall accuracy and the Kappa value (
We classified summer and autumn data separately and subsequently extracted and merged the result of the N. pumilio occurrence into one layer. The result was further refined using three different masks. As the classification of the class “low vegetation/grassland” was particularly reliable in the summer classification and the classification of the class “evergreen” in the autumn classification, the result was masked by the result of these classes. Therefore, any pixels that may have been misclassified have been removed. In the north of the study area, N. pumilio occurs only at higher elevations, so that other deciduous species at lower elevations were misclassified as N. pumilio. To remove this occurrence, a Digital Surface Model (DSM, ALOS Global Digital Surface Model “ALOS World 3D – 30m (AW3D30)”, Jaxa EORC 2023) was used to remove occurrences below high-elevation mono-species forests (thresholds: 800 m from 35°S to 40°S; 500 m from 40°S to 45°S; 250 m from 45° to 50°S).
(A) Sentinel-2 autumn image at the Perito Moreno Glacier, Argentina, (B) Scene Classification Layer of the Sentinel-2 scene; the green area, class 4, shows vegetation, (C) the masked Sentinel-2 scene and, (D) classification result with three classes (red = Nothofagus pumilio, dark green = evergreen vegetation, light green = low vegetation/grassland).
The large-scale remote sensing data on the occurrence of N. pumilio was validated using the Instagram ground truthing points. This process involved verifying whether the Instagram ground truthing points align with the spatial distribution derived by Supervised Classification. Additionally, we used occurrence data from the GBIF database to also validate the spatial distribution and to compare it with the Instagram ground truthing points visually, quantitatively and with a sampling bias analysis. The GBIF database provides data on species of all taxa according to the open-source principle. The Secretariat in Copenhagen coordinates data from various sources, such as museums, research publications, and Citizen Science projects, and makes them available (GBIF
ChatGPT (GPT-4 and GPT-3.5; available at https://chat.openai.com/) was used to enhance sentence structure and grammar in individual sentences.
Numerous posts found by hashtags and location tags were reviewed in 2021 and 2022, resulting in 1238 traceable occurrence points. In total we found 297 suitable posts published between 2017 and 2022. Most posts were published in 2021. A total of 460 points were placed at the actual location of the posts, and 778 points were placed in the visible background area (mainly autumn coloured treeline locations). Posts with specific species information using the hashtags #nothofaguspumilio or #lenga provided 61 occurrence points. Posts with detailed location information, where N. pumilio is recognisable, contributed significantly to the occurrence data.
Fig.
Visualised results of the “sampbias” analysis, indicating the sampling rate based on the influence of bias factors (gazetteers: cities, roads, rivers and lakes). In comparison, the IGTA dataset (A) displays more homogeneous sampling, whereas the GBIF dataset (B) shows undersampled areas, represented by dark blue regions.
We found that the OpenCV Supervised Classification and Random Forest algorithms demonstrated the best performance. For the summer classification, the overall accuracy was 0.93 with a Kappa value of 0.89, and for the autumn classification, the overall accuracy was 0.97 with a Kappa value of 0.96 (see Suppl. material
In particular, the area extending east of the Northern and Southern Patagonian Ice Fields shows an accurate classification result: Three tree species dominate these areas, with N. pumilio and N. antarctica being deciduous species and N. betuloides being an evergreen species (
Errors occur mainly due to high cloud cover and shadow effects. At the southern tip of Chile and Argentina, high cloud cover leads to data coverage problems. An area where the Sentinel-2 scene is not fully available in the sensing period of 2019 to 2022 and other scenes were almost completely covered by clouds is shown in Fig.
(A) Nothofagus pumilio occurrence determined by Supervised Classification and (B) occurrence of Nothofagus pumilio at Perito Moreno Glacier, with a classification result corresponding to the natural conditions, shown in comparison with the autumn Sentinel-2 scene in (C). All Instagram ground truthing points (blue, B) cover the identified occurrence.
(A) A data gap at the southern tip of Chile. This is caused by missing Sentinel-2 data in the sensing period between 2019 and 2022 and very high cloud cover. (B) A valley with mountain shadows. To avoid errors in the spectral signals, these areas were removed during analysis using the Sentinel-2 Scene Classification Layer. However, this leads to a gap in the classification result and errors in the validation with the Instagram ground truthing points (blue).
We validated the classification result with the Instagram ground truthing and GBIF points by checking whether the points match the spatial occurrence. Out of 1238 Instagram ground truthing points, 1142 points are congruent with the remote sensing data, which is 92.25 %. 96 points (7.75 %) lie outside the areas classified as N. pumilio. These errors are probably due to mountain shadows and missing data, as we show in the results. Of the GBIF points, 157 (28.14 %) align with the spatial occurrence, while 401 (71.86 %) do not. However, many of these points lie just outside the determined spatial occurrence. Errors can also occur due to shadows and missing data. Other reasons may include the uncertainty of the coordinates, the image being recorded on roads or paths next to the occurrence rather than directly in the plant stand, or individual trees or stands being recorded in urban areas, evergreen forest stands, or open areas with low vegetation, which the classification does not categorise as N. pumilio areas. Fig.
(A) Example of Nothofagus pumilio occurrence points from the GBIF database (GBIF points, yellow) with an uncertainty in coordinates, so that the points are located in a lake and not at the actual sampling location and (B) GBIF points in Ushuaia, where some raster cells, at locations of GBIF points, were not classified as vegetation.
The availability of sufficient, non-biased species occurrence data has always been a major problem for ecological modelling studies (
We include the sampling of ground truthing points on Instagram, and thus the re-use of social media posts in the realm of Citizen Science, although this is controversially discussed. According to the vignette study by
N. pumilio
is suitable for the Instagram approach due to its distinctive characteristics and visibility in satellite images at the treeline in mono-species forest stands. Additionally, its occurrence in national parks, where tourists often take photos, increases the number of Instagram posts. The species’ autumn colouring further enhances its aesthetic appeal, leading to even more posts. These advantages, also benefit occurrence data sampling in unstructured Citizen Science projects. In unstructured projects, user behaviour of Citizen Science participants leads to observations with spatial bias (
Data from Citizen Science projects are improving and even approaching the quality of expert data (
The limitation of the Instagram ground truthing approach is the time-consuming and non-automated process. While sampling by Citizen Science projects is a way of data collection in which data can be collected particularly quickly and cost-effectively with a large reach (
Another limitation is the accuracy of the recording date. Caution is needed regarding the exact timing of when a photo was taken and posted on Instagram, as the publication date may not correspond to the actual date the photo was captured. In our analysis, we primarily focused on recent posts, covering the period from 2017 to 2022, while the GBIF data includes records dating as far back as 1981. Verifying the accuracy of the Instagram date is essential (if necessary, by contacting the post creators), particularly for temporal distribution analyses of species. “Historical” data may not be available on Instagram at all. Additionally, discrepancies in acquisition dates between Sentinel-2 and Instagram data may introduce potential sources of error in the validation process. We used Sentinel-2 data from 2019 to 2022, while Instagram posts date back to 2017. Changes in forest stands, such as deforestation or forest fires, could result in discrepancies.
The occurrence in mono-species forest stands at the treeline and the phenology of N. pumilio allows a precise creation of training areas and a reliable result of the Supervised Classification. With the Sentinel-2 level 2A data in a resolution of 20 m we achieved an accurate classification over a very large study area of about 2000 km latitudinal extent in the southern Andes. Such a resolution is sufficient for subsequent modelling, which often uses climate data at a resolution of 30 arcseconds (~1 km), for example, WorldClim and Bioclim data (e.g.,
A different source of error is that in some cases stands of N. antarctica were classified as stands of N. pumilio in high and low elevation areas. The two deciduous species are very similar, both phenologically and ecologically, and often share the same range. Hybrids of the two species are also possible (
In more southerly areas, N. pumilio and N. antarctica dominate as deciduous species. In the north, however, many other deciduous species occur at lower elevations, while N. pumilio occurs only at the treeline. For this reason, an elevational correction of the result was necessary. Occurrences below subalpine forest stands of N. pumilio have been removed. Individual stands below these are also less relevant for treeline modelling studies. The thresholds (800 m, 500 m, 250 m) were estimated from literature data and from the classification result in order to obtain the most accurate classification result with the least data loss.
Citizen Science and social media-based occurrence sampling is developing and improving rapidly, becoming an important source of species occurrence data, especially for large-scale modelling approaches where alternatives are limited. However, resulting data are not free from bias and need to be filtered and verified before being used in applications such as ecological models. We conclude that using social media posts on Instagram in a structured Instagram ground truthing approach leads to less-biased occurrence data for N. pumilio in comparison with GBIF data. Sampling biases are further minimised by combining the Instagram ground truthing method with Supervised Classification, as large-scale occurrence data are generated across the entire distribution range of the species, rather than just in urban or tourist locations where most pictures are taken building the basis for Citizen Scientist observations or Instagram posts. We further conclude that the Instagram ground truthing approach is a novel method that can complement occurrence data sampling methods and be applied to other suitable species. However, it is essential that landscape elements are visible in the posts, which is more likely for landscape images and less so for detailed images of smaller herbaceous plant or animal species. Future work could focus on creating an automated search for Instagram posts using Instagram API and AI technology to replace the time-consuming manual search and further increase the availability of suitable posts. We believe that using social media can unlock significant potential for species occurrence data sampling and thus promote research on species in remote and high-elevation regions. Furthermore, the spatial occurrence data of N. pumilio enables presence-absence modelling approaches, that can provide detailed insights into the current and future distribution of N. pumilio.
We would like to thank all the Instagram users who have supported us with their posts, with information about the location where their pictures were taken and sharing information about the occurrence of N. pumilio and N. antarctica. Special thanks to user fernando.v.fotografia for allowing us to publish his post.
All authors participated during project design. MW conceived and developed the idea, collected, and analysed the data and drafted the first manuscript, JW and MB were significantly involved in the conception of the idea and the development and implementation of the analyses, MB also managed the whole project, JB and US were the reviewers and editors of the manuscript, providing their expertise on the topic and assistance with the literature. All authors have read and agreed to the published version of the manuscript.
The Instagram ground truthing points and the spatial species occurrence data generated in our study are currently in the process of being published with the open access data provider of the University of Hamburg.
Supplementary material: The following materials are available as part of the online article at https://escholarship.org/uc/fb.
Results of the “sampbias” analysis and information on the performance of the Supervised Classification results testing different classification algorithms (.xlsx)