1 Abstract

As pollinators and producers of numerous human-consumed products, honey bees have great ecological, economic and health importance. The composition of their bacteriota, for which the available knowledge is limited, is essential for their body’s functioning. Based on our survey, we performed a metagenomic analysis of samples collected by repeated sampling. We used geolocations that represent the climatic types of the study area over two nutritionally extreme periods (March and May) of the collection season. Regarding bacteriome composition, a significant difference was found between the samples from March and May. The samples’ bacteriome from March showed a significant composition difference between cooler and warmer regions. However, there were no significant bacteriome composition differences among the climatic classes of samples taken in May. Based on our results, one may conclude that the composition of healthy core bacteriomes in honey bees varies depending on the climatic and seasonal conditions. This is likely due to climatic factors and vegetation states determining the availability and nutrient content of flowering plants. The results of our study prove that in order to gain a thorough understanding of a microbiome’s natural diversity, we need to obtain the necessary information from extreme ranges within the host’s healthy state.

Apis mellifera

Apis mellifera

In this study we sequenced 40 bee metagenome and sought the answers for the following:

  • Spatial difference in metagenomes
  • Geographical pattern of viruses and bacteria
  • Environmental pattern in metagenome
  • Temporal/Periodical change in metagenome

To allow for the investigation of the temporal variation in metagenome we did two sampling on the same twenty bee family. Selection of the sampling locations (bee families) were done with a stratified random method (e.g. GRTS in spsurvey package). After contacting the local beekeepers, we did one sampling in March and one in May. For a tutorial on how the selection of the sampling sites worked see the section of Selecting sampling locations

The published paper can be found here:

Natural diversity of the honey bee (Apis mellifera) gut bacteriome in various climatic and seasonal states

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0273844

Data Availability: The short read data of samples are publicly available and accessible through the PRJNA685398 from the NCBI Sequence Read Archive (SRA).

Citation: Papp M, Békési L, Farkas R, Makrai L, Judge MF, Maróti G, et al. (2022) Natural diversity of the honey bee (Apis mellifera) gut bacteriome in various climatic and seasonal states. PLoS ONE 17(9): e0273844. https://doi.org/10.1371/journal.pone.0273844

2 Sampling locations

Sampling design and sample collection

The study’s main goal was to understand the natural variability in the gut bacteriome of healthy honey bees (Apis mellifera). To measure seasonal variation, two sampling occasions were planned, one at the onset and one at the peak of the honey producing season. However, as season is not the only variable that could be considered when one is interested in the factors that could affect the bacteriome (climate could be an important environmental factor as well), we have determined our samples to be representative of Hungary on the climate level. To obtain such samples, we conducted a stratified spatial random sampling [23] as detailed below.

We gathered the 10 year average of the yearly growing degree days (GDD) with base 10 [24, 25] and the yearly total precipitation data for all the 175 local administrative units (level 1, hereafter refered to as LAU) in Hungary. Meteorological data for the period 2008–2017 was gathered from the ERA-Interim reanalysis data repository [26] by the spatial resolution of 0.125°. We defined the two categories for our environmental variables as cooler-warmer and less-more for GDD and precipitation respectively.

Regarding GDD, the lower two quartiles were classified as cooler and the upper two quartiles as warmer. For precipitation, the yearly mean below the country-wide median was assumed as less and above the median as more. Each LAU was categorised by its own climatic variables (Fig 1). We created separate strata for each combinations of our two environmental variables.

The figure shows the location of the sampling locations in March (top) and in May (bottom). Some of the bee families have been moved during these 2 months and some of them even switched category according to precipitation or GDD. This has been accounted for in the study.

More detail about sampling site selection is here here

3 Clean the initial excel table

We provide the cleaned the data file, but here is the code used for cleaning the initial data to give feeling about the possible errors that might have entered the pipeline.

# The results arrived in an excel sheet
# Here we clean it and check for discrepancies

library(RSQLite)
library(sp)
library(sf)

m = matrix(c(1, 'Bácsalmási', 'JO',  
             2, 'Bicskei', 'IP',  
             3, 'Ceglédi', 'JO',  
             4, 'Celldömölki', 'IP',  
             5, 'Csongrádi', 'JO',  
             6, 'Derecskei', 'JP',  
             7, 'Dombóvári', 'JP',  
             8, 'Edelényi', 'IP',  
             9, 'Egri', 'IO',  
             10, 'Hajdúböszörményi', 'JP',  
             11, 'Hatvani', 'IO',  
             12, 'Kiskunhalasi', 'JO',  
             13, 'Miskolci', 'IP',  
             14, 'Nagyatádi', 'JP',  
             15, 'Sümegi', 'IP',  
             16, 'Szentlőrinci', 'JP',  
             17, 'Tatai', 'IO',  
             18, 'Törökszentmiklósi', 'JO',  
             19, 'Vásárosnaményi', 'IP',  
             20, 'Záhonyi', 'IP'), nc=3, byrow=T)

smpls = data.frame(id=as.numeric(m[,1]), jaras=m[,2], grp=m[,3])
smpls$gdd = substr(smpls$grp,1,1)
smpls$prec = substr(smpls$grp,2,2)

i = 1
library(readxl)
dat = as.data.frame(read_excel('../data/minden_minta.xls', sheet=i))
mintak = data.frame(meheszet=i, geo=c(dat[3,2], dat[3,3]), datum=c(dat[4,2], dat[4,3]))

for(i in 2:20){
  dat = as.data.frame(read_excel('../data/minden_minta.xls', sheet=i))
  mintak = rbind(mintak, data.frame(meheszet=i, geo=c(dat[3,2], dat[3,3]), datum=c(dat[4,2], dat[4,3])))
}

mintak[which(mintak$meheszet==7),'geo'] = '46.4063333; 18.2502167'
# mintak[which(mintak$meheszet==14),'geo'] = '47.9440500; 20.3596167' # indicated badly in the table
mintak[which(mintak$meheszet==14),'geo'] = '46.20143; 17.45692'
mintak[which(mintak$meheszet==16),'geo'] = '46.0808333; 17.9911111'

mintak[16,'geo'] = mintak[15,'geo']

mintak$geo = gsub(',', ';', mintak$geo)

mintak$geo[mintak$geo=="N 46Ëš25.752'      E 20Ëš04.431'"] = '46.4292000; 20.0738500'
mintak$geo[mintak$geo=="47 35.644     19 18.077"] = '47.5940667; 19.3012833'
mintak$geo[mintak$geo=="N 47Ëš56'38''                                                E 20Ëš21'23''"] = '47.9438889; 20.3563889'
mintak$geo[mintak$geo=="48 03.737   20 21.424"] = '48.0622833; 20.3570667'
mintak$geo[mintak$geo=="N 47Ëš46'31''                                                E 19Ëš40'12''"] = '47.7752778; 19.6700000'
mintak$geo[mintak$geo=="47 48 17          19 41 39"] = '47.8047222; 19.6941667'

mintak$geo = gsub(' ', '', mintak$geo)
m = matrix(unlist(lapply(lapply(strsplit(mintak$geo, ';'), as.numeric), sort)), nc=2, byrow=T)

mintak$lon = m[,1]
mintak$lat = m[,2]
mintak$sid = rep(c(1,2), 20)

# There were replacements of families between the two sampling dates
mintak$csaladok = 'kezdő'
mintak$csaladok[mintak$sid==2] = 'ua'

mintak$csaladok[which(mintak$meheszet==2 & mintak$sid==2)] = 'switch, got in the pool'
mintak$csaladok[which(mintak$meheszet==13 & mintak$sid==2)] = 'switch, did not get in the pool'
mintak$csaladok[which(mintak$meheszet==6 & mintak$sid==2)] = 'was elsewhere, did not get in the pool'
mintak$csaladok[which(mintak$meheszet==16 & mintak$sid==2)] = 'was elsewhere, got in the pool'

# Load geographical data
drv = dbDriver("SQLite") 
con = dbConnect(drv, dbname='../data/db.sqlite', loadable.extensions=TRUE)
lext = dbGetQuery(con, "SELECT load_extension('/usr/lib/x86_64-linux-gnu/mod_spatialite.so')")

library(RpostGIS) # From https://github.com/solymosin/RpostGIS

sql = "select id as gid, name, st_astext(geom) as geom from admin7"
jaras.tab = dbGetQuery(con, sql)
jaras.sp = wkt2sp(geoms=jaras.tab, gcol='geom', idcol='gid')
proj4string(jaras.sp) = CRS('+init=epsg:4326')
jaras.sf = st_as_sf(jaras.sp)
st_crs(jaras.sf) = 4326

sql = "select id as gid, name, st_astext(st_transform(geom,4326)) as geom from admin8b where name='Budapest'"
bp = dbGetQuery(con, sql)
bp = wkt2sp(geoms=bp, gcol='geom', idcol='gid')
proj4string(bp) = CRS('+init=epsg:4326')
bp = st_as_sf(bp)

# create new columns
jaras.sf = rbind(jaras.sf, bp)

save.image(file='../_checkpoint/1_cleaned_data.RData')

4 Prepare dataframe from weather data for further usage

library(RSQLite)
library(weathermetrics)
library(sf)
library(spsurvey)
library(raster)
library(tmap)
library(RpostGIS)
options(stringsAsFactors=F)
drv = dbDriver("SQLite")
con = dbConnect(drv, dbname='../data/db.sqlite', loadable.extensions=TRUE)
lext = dbGetQuery(con, "SELECT load_extension('/usr/lib/x86_64-linux-gnu/mod_spatialite.so')")
 
sql = "select id as gid, name, st_astext(geom) as geom from admin7"
jaras.tab = dbGetQuery(con, sql)
jaras.sp = wkt2sp(geoms=jaras.tab, gcol='geom', idcol='gid')
proj4string(jaras.sp) = CRS('+init=epsg:4326')
jaras.sf = st_as_sf(jaras.sp)
st_crs(jaras.sf) = 4326
jaras.sf = st_transform(jaras.sf, 23700)
 
jaras.c = st_centroid(jaras.sf)
jaras.c.sp = as_Spatial(jaras.c)

##################################################################################
 
jaras.c.sp@data$selected=1:175
 
##################################################################################
fs = list.files(path='../weatherdata', pattern='HU_SFC.nc$', full.names=T)
fs = fs[grep('interim_0.125', fs)]

res.m = matrix(nr=length(jaras.sp), nc=length(fs))

for(n in 1:length(fs)){
 f = fs[n]
 T2m = brick(f, varname='2t')
 j.lst = extract(T2m, jaras.sp)
 for(i in 1:length(j.lst)){
   pj = apply(j.lst[[i]], 2, mean)
   h6 = matrix(pj, nc=4, byrow=T)
   rownames(h6) = gsub('00.00.00', '', gsub('X', '', matrix(names(pj), nc=4, byrow=T)[,1]))
   napi = convert_temperature(rowMeans(h6), old_metric='kelvin', new_metric='celsius')
   gdd = napi-10
   gdd[which(gdd<0)]=0
   res.m[i,n] = sum(gdd)
 }

}

jaras.sf$gdd = rowMeans(res.m)

fs = list.files(path='../weatherdata', pattern='HU_SFC.prec.nc$', full.names=T)

fs = fs[grep('interim_0.125', fs)]

res.m = matrix(nr=length(jaras.sp), nc=length(fs))
for(n in 1:length(fs)){
 f = fs[n]
 tp = brick(f, varname='tp')
 j.lst = extract(tp, jaras.sp)
 for(i in 1:length(j.lst)){
   pj = apply(j.lst[[i]], 2, mean)
   h6 = matrix(pj, nc=2, byrow=T)
   rownames(h6) = substr(gsub('X', '', matrix(names(pj), nc=2, byrow=T)[,1]), 1,11)
   res.m[i,n] = sum(h6)*1000
 }
}
jaras.sf$y10prec = rowMeans(res.m)
# Here we select sampling locations
# Since it depends on the selected levels for each stratum, the random seed of grts and more inportantly the actual human/bee family factor the original data is saved into data/Original_env.RData
jaras.sf$gddkatb = cut(jaras.sf$gdd, breaks=c(0, 1306, 1650, 2000))
levels(jaras.sf$gddkatb) = c('-', 'I', 'J')

jaras.sf$preckatb = cut(jaras.sf$y10prec, breaks=c(0,533,580,741))
levels(jaras.sf$preckatb) = c('-', 'O', 'P')

jaras.sf$kombib = paste(jaras.sf$gddkatb, jaras.sf$preckatb, sep='')
jaras.c.sp@data$kombib = jaras.sf$kombib

round((table(jaras.c.sp@data$kombib)/175)*20)
## 
## IO IP JO JP 
##  3  7  5  5
strata_n = c(IO=3, IP=7, JO=5, JP=5)

set.seed(12)

set.seed(1970)
library(spsurvey)
strat_eqprob = grts(jaras.sf, n_base=strata_n, stratum_var = "kombib")

library(tmap)
tm_shape(jaras.sf) +
  tm_polygons("kombib",
              title="GDD x TP",
              legend.format=list(text.separator='-'), textNA='No', lwd=0
  )  +
  tm_shape(strat_eqprob$sites_base) +
  tm_dots(size=1.0) +
  tm_layout(inner.margins=0, legend.position = c(.85, .1), outer.margins=0, frame=F)