Flagging potentially “non-wild” species’ records

I often work with occurrence data of species collected by numerous scientists and volunteers. While invaluable, sometimes this data can contain records that were collected from non-wild sources that for most studies, would be inappropriate to include. Real examples I have seen include herbarium specimens georeferenced to the store at which the plant was bought, polar bears and giraffes georeferenced to the zoos in which they died, and a bridal bouquet flower georeferenced to the location of a wedding I officiated (and no, I did not collect it and certainly would not have georeferenced it!).

Below are an unexhaustive set of “flagwords” that can be used to scan the “notes” fields of occurrence records to help flag potentially undesirable records. Records flagged as such should each be investigated–in some cases, these keywords will catch legitimate records (e.g., “zoo” will flag phrases like, “deposited in XYZ Zoological Museum”). The “grow” keyword for plants can flag a lot of legitimate records (e.g., “growing in a cedar glade”). However, I have found that it can catch cases you would not have thought to search for. For example, we recently found one of our models was based on a specimen “growing on the campus of at XYZ University”… I would not have thought to search for “campus,” but “grow” probably would have caught that.

Having worked with animals and plants, I list below flag words for each. I am sure there are keywords that would be appropriate for other types of organisms (e.g., “petri” for fungi and bacteria). I have formatted these assuming you are working in R.

Flag words for animals (be sure to also search for non-English equivalents of these for the locale from which records were obtained):

flags <- c('zoo', 'cage', 'aquarium', 'experiment', 'captive', 'pet', 'tame', 'house', 'bought', 'store')

Flag words for plants (be sure to also search for non-English equivalents of these for the locale from which records were obtained):

flags <- c('garden', 'arboretum', 'zoo', 'campus', 'nursery', 'greenhouse', 'glasshouse', 'shadehouse', 'green house', 'glass house', 'shade house', 'pot', 'experiment', 'treatment', 'bought', 'store', 'bouquet', 'vase')

Here is a simple R block to flag potentially concerning records. I assume the data frame with the occurrence data is called “x” and it has a column named “notes.” If there is more than one relevant column, you should repeat this script or adjust it to search all relevant fields at once. The script returns the notes that contain the potentially concerning text.

concerns <- integer() # stores index of concerning records

for (flag in flags) {
   
   concern <- which(grepl(tolower(x$notes), pattern = flag, fixed = TRUE))
   concerns <- c(concerns, concern)

}

concerns <- sort(unique(concerns)) # vector of unique indices with concerning records
x$notes[concerns] # view concerning notes and manually screen

~ Adam

Please do not georeference specimens collected from wedding bouquets.
Please do not georeference specimens collected from wedding bouquets.