In [82]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[82]:

Investigating the occurrece of Q613H Delta variant

Other notation of the same mutation: p.Gln613His

Caveat:

This notebook was run on 15th October 2021, when 613,984 samples (from the 1,654,147 of sequenced samples at covid19dataportal) have been processed by the variant calling pipeline. The weekly number of samples is not uniform and we have a lag, so many of the recent samples are not processed yet.

In [1]:
import pandas as pd
import pandas.io.sql as sqlio
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import psycopg2
In [2]:
def psql(query):
    conn = psycopg2.connect("dbname=coveo user=****** password=******* host=postgres.coveo port=5432")
    dummy = sqlio.read_sql_query(query, conn)
    conn.close()
    return dummy

Delta variant defining mutations

There are four so called defining mutations. We have thelineage_def table in the SQL database that contains info for all catalogued lineages.

In [23]:
query = '''
SELECT * FROM lineage_def
WHERE variant_id='Delta'
'''
lineage_delta = psql(query)
lineage_delta
Out[23]:
variant_id pango type_variant amino_acid_change protein_codon_position ref_protein alt_protein gene effect snpeff_original_mut ref_pos_alt ref alt pos description
0 Delta B.1.617.2 VOC T19R 19.0 T R S missense_variant None C21618G C G 21618 This variant was first detected in India and r...
1 Delta B.1.617.2 VOC L452R 452.0 L R S missense_variant None T22917G T G 22917 This variant was first detected in India and r...
2 Delta B.1.617.2 VOC T478K 478.0 T K S missense_variant None C22995A C A 22995 This variant was first detected in India and r...
3 Delta B.1.617.2 VOC P681R 681.0 P R S missense_variant None C23604G C G 23604 This variant was first detected in India and r...

Delta variants with 1-4 defining mutations

First check how many Delta variant mutations are present in samples at each week. Here we collect information by combining the vcf, the lineage_def and the meta tables. They contain the genomic position and mutation, variant definition and dates and other metadata, repectively. In the figures we mark May 2021 as the date when Delta variants start to dominate.

In [61]:
%%time
query = '''
SELECT m.*, l.amino_acid_change, v.af, v.dp 
     FROM vcf v JOIN lineage_def l on v.pos=l.pos and v.ref=l.ref and v.alt=l.alt 
     JOIN meta m on v.ena_run = m.ena_run
     WHERE ann_num=1 
     AND af>3/SQRT(dp)
     AND variant_id = 'Delta'
     AND date_isoyear IN (2020,2021);
'''

delta_all = psql(query)
print(delta_all.shape)

# (481998, 42)
# CPU times: user 8.34 s, sys: 1.63 s, total: 9.98 s
# Wall time: 25.3 s
(481998, 42)
CPU times: user 8.34 s, sys: 1.63 s, total: 9.98 s
Wall time: 25.3 s

Convert year and week number to proper date type.

In [ ]:
_years = delta_all.date_isoyear.values.astype(int).astype(str)
_weeks = delta_all.date_isoweek.values.astype(int).astype(str)
_sep = ['-']*len(_years)
_dayofweek = ['-1']*len(_years)
_ = np.core.defchararray.add(np.core.defchararray.add(np.core.defchararray.add(_years,_sep),_weeks),_dayofweek)
weekDate = [datetime.datetime.strptime(_d, "%Y-%W-%w") for _d in _]   
delta_all['weekDate'] = weekDate

Count the number of samples that contain the different defining mutations.

In [64]:
delta_all_weekly = delta_all.groupby(['weekDate','amino_acid_change']).size().reset_index().rename(columns={0:'cnt'})
In [101]:
fig, ax = plt.subplots(2,1,figsize=[16,8])
fig.suptitle('Weekly # of samples with various Delta defining mutations on linear and logarithmic scales')


for aa in lineage_delta.amino_acid_change.values:
    _ = delta_all_weekly[delta_all_weekly.amino_acid_change==aa]
    ax[0].plot(_.weekDate,_.cnt,label=aa)
    
ax[0].set_xlabel("date")
ax[0].set_ylabel("# of samples")
ax[0].legend()
ax[0].set_xlim(datetime.datetime.strptime("2020-1-1", "%Y-%W-%w"), datetime.datetime.strptime("2021-40-1", "%Y-%W-%w"))
ax[0].axvline(x=datetime.datetime.strptime("2021-18-1", "%Y-%W-%w"),c='gray')

for aa in lineage_delta.amino_acid_change.values:
    _ = delta_all_weekly[delta_all_weekly.amino_acid_change==aa]
    ax[1].semilogy(_.weekDate,_.cnt,label=aa)
    
ax[1].set_xlabel("date")
ax[1].set_ylabel("# of samples")
ax[1].legend()
ax[1].set_xlim(datetime.datetime.strptime("2020-1-1", "%Y-%W-%w"), datetime.datetime.strptime("2021-40-1", "%Y-%W-%w"))
ax[1].axvline(x=datetime.datetime.strptime("2021-18-1", "%Y-%W-%w"),c='gray')
Out[101]:
<matplotlib.lines.Line2D at 0x7f3cc06f8190>

In one sample one or several of the mutations can be present. Check the number of defining mutations.

In [77]:
_ = delta_all.groupby(['weekDate','ena_run']).size().reset_index().rename(columns={0:'n'})
number_of_deltadefining = _.groupby(['weekDate','n']).size().reset_index().rename(columns={0:'cnt'})
In [102]:
fig, ax = plt.subplots(2,1,figsize=[16,8])
fig.suptitle('Weekly # of samples with 1,2,3 or all 4 Delta defining mutations on linear and logarithmic scales')


for i in range(1,5):
    _ = number_of_deltadefining[number_of_deltadefining.n==i]
    ax[0].plot(_.weekDate,_.cnt,label=i)
    
ax[0].set_xlabel("date")
ax[0].set_ylabel("# of samples")
ax[0].legend()
ax[0].set_xlim(datetime.datetime.strptime("2020-1-1", "%Y-%W-%w"), datetime.datetime.strptime("2021-40-1", "%Y-%W-%w"))
ax[0].axvline(x=datetime.datetime.strptime("2021-18-1", "%Y-%W-%w"),c='gray')

for i in range(1,5):
    _ = number_of_deltadefining[number_of_deltadefining.n==i]
    ax[1].semilogy(_.weekDate,_.cnt,label=i)
    
ax[1].set_xlabel("date")
ax[1].set_ylabel("# of samples")
ax[1].legend()
ax[1].set_xlim(datetime.datetime.strptime("2020-1-1", "%Y-%W-%w"), datetime.datetime.strptime("2021-40-1", "%Y-%W-%w"))
ax[1].axvline(x=datetime.datetime.strptime("2021-18-1", "%Y-%W-%w"),c='gray')
Out[102]:
<matplotlib.lines.Line2D at 0x7f3cbf504250>

Observations:

  • For most samples all the defining mutations are present, but there are few samples that have one or less than all four of the mutations earlier. Especially L452R emerged around 2020-09 and appeared in ~1000 samples per week before the "complete" Delta variants with all four defining mutations took over.

  • Before the complete Delta variant appeared the individual defining mutations appered in different samples. I.e. the different mutations (maybe except T19R) were present for a few weeks, but in separate samples (see the number of coincident mutations on the second set of figures above).

  • Samples with 1,2 and 3 (less than the all four) defining mutations are still present in ~100 samples per week. These may be lack of detections or true variants.

Search for samples with Q613H (Gln613His) mutation

The vcf table in the database contains per basis information and the meta table contians dates, and other metadata. Since the tables are indexed, it is very fast to search the whole database, we can get the information in less than 10 seconds.

In [83]:
%%time
query = '''SELECT * 
 FROM vcf JOIN meta on vcf.ena_run = meta.ena_run WHERE hgvs_p = 'p.Gln613His' 
     AND ann_num=1 
     AND af>3/SQRT(dp)
     AND clean_host = 'Homo sapiens'
     AND date_isoyear IN (2020,2021)
     '''
Q613H = psql(query)
print(Q613H.shape)

# (1172, 74)
# CPU times: user 53.9 ms, sys: 2.53 ms, total: 56.4 ms
# Wall time: 7.18 s
(1172, 74)
CPU times: user 53.9 ms, sys: 2.53 ms, total: 56.4 ms
Wall time: 7.18 s

In our dataset have found 1172 samples with Q613H mutation.

In [84]:
Q613H.head()
Out[84]:
ena_run chrom pos ref alt qual filter dp af sb ... collected_by broker_name center_name sample_capture_status fastq_ftp collection_date_submitted checklist clean_collection_date date_isoweek date_isoyear
0 ERR5289876 NC_045512 23401 G T NaN PASS 413 0.949153 NaN ... not provided COVID-19 Genomics UK Consortium Department of Pathology, University of Cambridge active surveillance in response to outbreak ftp.sra.ebi.ac.uk/vol1/fastq/ERR528/006/ERR528... 2021-01-16 ERC000033 2021-01-16 2 2021
1 SRR14403671 NC_045512 23401 G T 6175.0 PASS 234 0.764957 0.0 ... Quest Diagnostics Incorporated None None None ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/071/SRR144... 2021-02-11 None 2021-02-11 6 2021
2 ERR5329912 NC_045512 23401 G T 49314.0 PASS 40169 0.998183 0.0 ... None None Wellcome Sanger Institute active surveillance in response to outbreak ftp.sra.ebi.ac.uk/vol1/fastq/ERR532/002/ERR532... 2021-02-05 ERC000011 2021-02-05 5 2021
3 ERR5742282 NC_045512 23401 G T 49314.0 PASS 12000 0.998417 0.0 ... None None Wellcome Sanger Institute active surveillance in response to outbreak ftp.sra.ebi.ac.uk/vol1/fastq/ERR574/002/ERR574... 2021-04-07 ERC000011 2021-04-07 14 2021
4 ERR5905646 NC_045512 23401 G T 49314.0 PASS 18144 0.996473 0.0 ... None None Wellcome Sanger Institute active surveillance in response to outbreak ftp.sra.ebi.ac.uk/vol1/fastq/ERR590/006/ERR590... 2021-04-20 ERC000011 2021-04-20 16 2021

5 rows × 74 columns

Convert year and week number to proper date type.

In [85]:
_years = Q613H.date_isoyear.values.astype(int).astype(str)
_weeks = Q613H.date_isoweek.values.astype(int).astype(str)
_sep = ['-']*len(_years)
_dayofweek = ['-1']*len(_years)
_ = np.core.defchararray.add(np.core.defchararray.add(np.core.defchararray.add(_years,_sep),_weeks),_dayofweek)
weekDate = [datetime.datetime.strptime(_d, "%Y-%W-%w") for _d in _]   
Q613H['weekDate'] = weekDate
In [100]:
fig, ax = plt.subplots(1,1,figsize=[16,4])
fig.suptitle('Allele frequency of all the samples with Q613H mutations')

ax.plot(Q613H.weekDate,Q613H.af,'o', alpha=0.3)
ax.set_xlabel("date")
ax.set_ylabel("AF")
ax.set_xlim(datetime.datetime.strptime("2020-1-1", "%Y-%W-%w"), datetime.datetime.strptime("2021-40-1", "%Y-%W-%w"))
ax.axvline(x=datetime.datetime.strptime("2021-18-1", "%Y-%W-%w"),c='gray')
Out[100]:
<matplotlib.lines.Line2D at 0x7f3cc0bc3bd0>

Observation:

Q613H mutations were present in (relatively) many samples before the onset of Delta variant.

By country statistics

Normalized Delta sample share percentages by countries of origin

In [147]:
delta_all.groupby('clean_country').size()/delta_all.shape[0]*100
Out[147]:
clean_country
Angola             0.047511
Bangladesh         0.015975
Canada             0.001660
Estonia            0.017842
Greece             0.291495
India              0.000830
Iraq               0.000207
Ireland            2.307064
Israel             0.002905
Malawi             0.012241
Mozambique         0.003320
Netherlands        0.118050
Pakistan           0.012033
Peru               0.001245
Portugal           0.068672
Romania            0.000830
Rwanda             0.006224
Slovakia           0.081743
South Africa       0.285893
Spain              0.454359
Switzerland        0.005809
Turkey             0.002282
USA                8.342773
United Kingdom    87.913850
Zimbabwe           0.005187
dtype: float64

Normalized Q613H sample share percentages by countries of origin

In [148]:
Q613H.groupby('clean_country').size()/Q613H.shape[0]*100
Out[148]:
clean_country
Australia          0.085324
Austria            0.085324
Brazil             0.085324
Estonia            0.085324
Greece             0.682594
India              0.341297
Ireland            0.426621
Malawi             0.170648
Netherlands        0.085324
Pakistan           0.085324
Poland             0.170648
Portugal           0.682594
Rwanda             2.645051
South Africa       1.365188
Spain              1.109215
Turkey             0.085324
USA               19.624573
United Kingdom    71.672355
Zimbabwe           0.511945
dtype: float64

Observation

  • USA, Rwanda, South Africa, Spain are somewhat over represented in the Q613H set, w.r.t. Delta statistics.

Check samples with both Q613H and Delta defining mutations

Calculate the number of delta defining mutations ...

In [125]:
_definingMutCnt = delta_all.groupby('ena_run').size().reset_index().rename(columns={0:'n'})

... and merge it with the Q613H table

In [131]:
delta_Q613H = pd.merge(Q613H.iloc[:,1:][['ena_run','weekDate','dp','af']],_definingMutCnt,left_on='ena_run',right_on='ena_run',how='left')
delta_Q613H['n'] = delta_Q613H['n'].replace(np.nan, 0)

... and count weekly stats.

In [135]:
weekly_delta_Q613H = delta_Q613H.groupby(['weekDate','n']).size().reset_index().rename(columns={0:'cnt'})
In [150]:
fig, ax = plt.subplots(2,1,figsize=[16,8])
fig.suptitle('Samples with Q613H mutation and 0,1,2,3 or all 4 Delta defining mutations')

pid = 1
for i in range(0,5):
    _ = weekly_delta_Q613H[weekly_delta_Q613H.n==i]
    ax[pid].plot(_.weekDate,_.cnt,'o-',label=i)
    
ax[pid].set_xlabel("date")
ax[pid].set_ylabel("# of samples")
ax[pid].legend()
ax[pid].set_xlim(datetime.datetime.strptime("2020-1-1", "%Y-%W-%w"), datetime.datetime.strptime("2021-40-1", "%Y-%W-%w"))
ax[pid].axvline(x=datetime.datetime.strptime("2021-18-1", "%Y-%W-%w"),c='gray')

pid = 0
_ = weekly_delta_Q613H.groupby('weekDate').sum().reset_index().rename(columns={0:'cnt'})

ax[pid].plot(_.weekDate,_.cnt,'o-',label='All Q613H')
    
ax[pid].set_xlabel("date")
ax[pid].set_ylabel("# of samples")
ax[pid].legend()
ax[pid].set_xlim(datetime.datetime.strptime("2020-1-1", "%Y-%W-%w"), datetime.datetime.strptime("2021-40-1", "%Y-%W-%w"))
ax[pid].axvline(x=datetime.datetime.strptime("2021-18-1", "%Y-%W-%w"),c='gray')
Out[150]:
<matplotlib.lines.Line2D at 0x7f3cc471bb10>

Observations:

  • Q613H mutations were present in (relatively) many samples before the onset of Delta variant but their overall number is very small relative to the total number of samples.
  • With acknowledging the caveats that we have a significant backlog in sample processing, the overall number of Q613H samples are decreasing. On the other hand the most recent samples that have Q613H usually also have all the four Delta defining mutations, too.
  • An interpretation: before May 2021, Q613H could survive alone, but now it cannot without the Delta mutations.