Emerging mutations in all samples

Version 2021.11.03 GISAID samples

Goal:

Find emerging mutations that were not present few weeks ago, but now start to dominate many samples. We do not distinguish type of mutations (indels, etc.), only count any type at the position where it is registered.

Data

We use the multiple sequence alingment (MSA) and metadata of all GISAID SARS-CoV-2 sequences. As of 2021.11.03 it is 4.8M samples. For all positions of all samples in the MSA FASTA files the difference from the reference EPI_ISL_402124 is recorded and loaded into an SQL database. For this dataset this provided almost 4 billion rows. Since many of these are early deletions ('-') or non-detections ('n') we focus only on more real mutations, and reduce the data table to 158M mutations.

Analysis

We looked for mutations, that have very low prevalence before August 2021 (week 32 of 2021) and higher prevalence (at least 40x affected samples than before) after that. This date is selected as a date that is beyond the relatively wide spread of the original Delta variant.

Technical isues:

  • Due to the downloaded data format, the initial database upload and indexing is long, may take several hours
  • Regular update may be faster if the only updates of MSA and metadata files' can be obtained. (TODO: now it seems that only th ecumulative is available)
  • After loading and indexind SQL database is fast an efficient for such studies. Most queries run in few minutes or seconds.

Results:

We could identify mutations with very recent rapid growth. The two top positions are interestingly not in the S (spike) gene:

  • position:15520, 657 samples before and 38862 samples after. Pol (RNA-directed RNA polymerase nsp12) gene. Responsible for replication and transcription of the viral RNA genome.
  • position:8834, 324 samples before and 24139 samples after. NSP4 (Non-structural protein 4) gene. Participates in the assembly of virally-induced cytoplasmic double-membrane vesicles necessary for viral replication.

As a function of time there is clear rapid growing trend. Some other positions also show growing with lower overall count.

Acknowledgement

We gratefully acknowledge both the originating and submitting laboratories for the sequence data in GISAID EpiCoV on which the SARS-CoV-2 variants data are partially based.

In [6]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[6]:
In [3]:
# load python libraries
import pandas as pd
import pandas.io.sql as sqlio
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import psycopg2

import datetime
In [4]:
# this connects to database and executes a query
def psql(query):
    conn = psycopg2.connect(
            dbname='csabai',
            host='pg.coveo',
            user='csabai',
            password='ohXai7ee'
        )
    dummy = sqlio.read_sql_query(query, conn)
    conn.close()
    return dummy
In [5]:
def psql_execute(query):
    conn = None

    try:
        conn = psycopg2.connect(
            dbname='csabai',
            host='pg.coveo',
            user='csabai',
            password='ohXai7ee'
        )
        conn.autocommit = True
    except:
        print("Connection failed.")

    if(conn != None):
        cursor = conn.cursor()
        cursor.execute(query)
        
    conn.close()
    return True

Number of rows and processed and all samples

In [223]:
query='''SELECT COUNT(DISTINCT seqid) AS number_of_samples FROM merged_gisaid_20211103;'''
psql(query)
Out[223]:
number_of_samples
0 4798115
In [222]:
query='''SELECT reltuples AS number_of_mutations FROM pg_class where relname = 'merged_gisaid_20211103';'''
psql(query)
Out[222]:
number_of_mutations
0 158626000.0

Select sample statistics before and after the pivot day

In [157]:
%%time
query = '''
DROP TABLE IF EXISTS after_week32_gisaid_20211103;
SELECT pos, SUM(cnt) AS cnt, AVG(CAST(cnt AS REAL)) AS avg 
INTO after_week32_gisaid_20211103
FROM perweek_perpos_gisaid_20211103
WHERE date_isoyear=2021 AND date_isoweek>32 GROUP BY pos;

DROP TABLE IF EXISTS before_week32_gisaid_20211103;
SELECT pos, SUM(cnt) AS cnt, AVG(CAST(cnt AS REAL)) AS avg 
INTO before_week32_gisaid_20211103
FROM perweek_perpos_gisaid_20211103 
WHERE date_isoyear BETWEEN 2020 AND 2021 AND date_isoweek<=32 GROUP BY pos;
'''
psql_execute(query)

query = ''' SELECT * FROM after_week32_gisaid_20211103'''
after32_n = psql(query)
after32_n = after32_n.sort_values('pos')

query = ''' SELECT * FROM before_week32_gisaid_20211103'''
before32_n = psql(query)
before32_n= before32_n.sort_values('pos')
CPU times: user 81.8 ms, sys: 24.1 ms, total: 106 ms
Wall time: 1.29 s
In [224]:
_ = pd.merge(before32_n, after32_n, how='outer', on='pos')
_ = _.fillna(0)
_['cnt_x_norm'] = _['cnt_x']/_.cnt_x.sum()
_['cnt_y_norm'] = _['cnt_y']/_.cnt_y.sum()
_['avg_x_norm'] = _['avg_x']/_.cnt_x.sum()
_['avg_y_norm'] = _['avg_y']/_.cnt_y.sum()

Compare mutations before and after the pivot day

We look for mutation positions with very few samples before, and many samples after the pivot day. These points would be in the upper left corner of the plot. We are not interested in positions that already have high prevelance for a long time.

In [241]:
plt.figure(figsize=[8,8])
plt.plot(_.cnt_x,_.cnt_y,'.')
#plt.loglog(np.array([0,4e-2]),np.array([0,4e-2]),'-')
plt.xlabel('# of mutated samples before')
plt.ylabel('# ofmutated samples after')
plt.title('Each dot corresponds to a SARS-CoV-2 genome position. \n Axes are the counts of samples that have the given mutation')
#plt.plot(np.array([1,1e6]),np.array([1,1e6]),'-')
plt.plot(np.array([1,1e6]),np.array([1,1e6]),'-',label='1x')
plt.legend()
Out[241]:
<matplotlib.legend.Legend at 0x7f29c086e1d0>
In [242]:
plt.figure(figsize=[8,8])
plt.plot(_[_.cnt_x<1000].cnt_x,_[_.cnt_x<1000].cnt_y,'.')

plt.xlabel('# of mutated samples before')
plt.ylabel('# ofmutated samples after')
plt.title('Zoomed version. Each dot corresponds to a SARS-CoV-2 genome position. \n Axes are the counts of samples that have the given mutation')
plt.plot(np.array([1,1e3]),np.array([1,40*1e3]),'-',label='40x')
plt.legend()
Out[242]:
<matplotlib.legend.Legend at 0x7f29b94c6610>
In [261]:
plt.figure(figsize=[8,8])
plt.loglog(_.avg_x,_.avg_y,'.')

#plt.plot(np.array([1,3e6]),np.array([1,3e6]),'-')

plt.xlabel('# of mutated samples before')
plt.ylabel('# ofmutated samples after')
plt.title('Normalized, Log scale. Each dot corresponds to a SARS-CoV-2 genome position. \n Axes are the normalized counts of samples that have the given mutation')
#plt.plot(np.array([1,1e6]),np.array([1,1e6]),'-')
plt.loglog(np.array([0,1e5]),np.array([0,40*1e5]),'-',label='40x')
plt.loglog(np.array([0,1e5]),np.array([0,1e5]),'-',label='1x')
plt.legend()
Out[261]:
<matplotlib.legend.Legend at 0x7f29b3481e90>

The most quickly emerging mutations

In [196]:
_[(_.cnt_x<1000) & (_.cnt_y>40*_.cnt_x)].sort_values('cnt_y')
Out[196]:
pos cnt_x avg_x cnt_y avg_y cnt_x_norm cnt_y_norm avg_x_norm avg_y_norm
11195 11195 14.0 1.166667 871.0 174.200000 1.410650e-07 0.000017 1.175541e-08 0.000003
14551 14551 10.0 1.000000 2090.0 298.571429 1.007607e-07 0.000041 1.007607e-08 0.000006
2307 2307 60.0 2.142857 3102.0 282.000000 6.045641e-07 0.000060 2.159157e-08 0.000005
8834 8834 324.0 11.172414 24139.0 2011.583333 3.264646e-06 0.000468 1.125740e-07 0.000039
15520 15520 657.0 38.647059 38862.0 3238.500000 6.619977e-06 0.000754 3.894104e-07 0.000063
In [197]:
_[(_.avg_x<1000) & (_.avg_y>40*_.avg_x)].sort_values('avg_y')
Out[197]:
pos cnt_x avg_x cnt_y avg_y cnt_x_norm cnt_y_norm avg_x_norm avg_y_norm
13797 13797 38.0 2.000000 1045.0 87.083333 3.828906e-07 0.000020 2.015214e-08 0.000002
10967 10967 22.0 1.466667 553.0 92.166667 2.216735e-07 0.000011 1.477823e-08 0.000002
12421 12421 60.0 2.000000 941.0 134.428571 6.045641e-07 0.000018 2.015214e-08 0.000003
11195 11195 14.0 1.166667 871.0 174.200000 1.410650e-07 0.000017 1.175541e-08 0.000003
23506 23506 113.0 4.913043 1305.0 217.500000 1.138596e-06 0.000025 4.950416e-08 0.000004
71 71 120.0 2.608696 2628.0 219.000000 1.209128e-06 0.000051 2.628539e-08 0.000004
7352 7352 75.0 3.125000 2480.0 248.000000 7.557051e-07 0.000048 3.148771e-08 0.000005
2307 2307 60.0 2.142857 3102.0 282.000000 6.045641e-07 0.000060 2.159157e-08 0.000005
14551 14551 10.0 1.000000 2090.0 298.571429 1.007607e-07 0.000041 1.007607e-08 0.000006
25434 25434 180.0 4.390244 5132.0 427.666667 1.813692e-06 0.000100 4.423640e-08 0.000008
24984 24984 209.0 5.500000 5281.0 440.083333 2.105898e-06 0.000102 5.541837e-08 0.000009
8834 8834 324.0 11.172414 24139.0 2011.583333 3.264646e-06 0.000468 1.125740e-07 0.000039
15520 15520 657.0 38.647059 38862.0 3238.500000 6.619977e-06 0.000754 3.894104e-07 0.000063
In [200]:
posList = _[(_.avg_x<1000) & (_.avg_y>40*_.avg_x)].sort_values('avg_y',ascending=False).pos.values
posList
Out[200]:
array([15520,  8834, 24984, 25434, 14551,  2307,  7352,    71, 23506,
       11195, 12421, 10967, 13797])
In [169]:
%%time

query ='''
SELECT m.*,l.location,wl.cnt as wl_cnt, w.cnt as w_cnt 
FROM merged_gisaid_20211103 m JOIN location l ON m.location_code=l.location_code 
JOIN samplesperweek_perlocation_gisaid_20211103 wl 
ON wl.date_isoyear=m.date_isoyear AND wl.date_isoweek=m.date_isoweek AND wl.location_code=m.location_code
JOIN samplesperweek_gisaid_20211103 w 
ON w.date_isoyear=m.date_isoyear AND w.date_isoweek=m.date_isoweek  
    WHERE pos IN (13797, 10967, 12421, 11195, 23506, 71,  7352,  2307, 14551, 25434, 24984,  8834, 15520)
'''
posAll40x_w32 = psql(query)
CPU times: user 861 ms, sys: 160 ms, total: 1.02 s
Wall time: 1.94 s
In [170]:
posAll40x_w32['weekdate'] = [datetime.datetime.strptime('-'.join([str(_d[0]),str(_d[1]),'1']), "%Y-%W-%w") for _d in posAll40x_w32[['date_isoyear','date_isoweek']].values]
In [263]:
_ = posAll40x_w32.groupby(['pos','weekdate','w_cnt']).size().reset_index().rename(columns={0:'cnt1'})
_ = _.sort_values('weekdate')

Percentage of total sequences at a given collectin_date that carries mutation at the position indicated in the figure title

In [265]:
d1 =  datetime.datetime.strptime("2020-05-01", "%Y-%m-%d")
d2 =  datetime.datetime.strptime("2021-11-10", "%Y-%m-%d")
for p in posList:
    _1 = _[(_.pos==p) & (_.weekdate>d1) & (_.weekdate<d2)]
    plt.figure(figsize=[16,8])
    plt.plot(_1.weekdate,_1.cnt1/_1.w_cnt,'.-',lw=3)
    plt.xlim(d1,d2)
    plt.xlabel('pos')
    plt.ylabel('Affected sample ratio')
    #plt.ylim(0,max(_1.cnt1/_1.w_cnt))
    plt.ylim(0,1)
    plt.title('position='+str(p))
    plt.show()

Emerging mutations on logarithmic scale

In [271]:
d1 =  datetime.datetime.strptime("2020-05-01", "%Y-%m-%d")
d2 =  datetime.datetime.strptime("2021-11-10", "%Y-%m-%d")
for p in posList:
    _1 = _[(_.pos==p) & (_.weekdate>d1) & (_.weekdate<d2)]
    plt.figure(figsize=[16,8])
    plt.semilogy(_1.weekdate,_1.cnt1/_1.w_cnt,'.-',lw=3)
    plt.xlim(d1,d2)
    plt.xlabel('pos')
    plt.ylabel('Affected sample ratio')
    #plt.ylim(0,max(_1.cnt1/_1.w_cnt))
    plt.ylim(1e-5,1)
    plt.title('Log y scale.   position='+str(p))
    plt.show()

Locations

The non-uniform, non-representative collection and sharing of data makes location identification problematic

In [272]:
posAll40x_w32.columns
Out[272]:
Index(['pos', 'ref', 'alt', 'seqid', 'virus_name', 'accession_id',
       'collection_date', 'submission_date', 'patient_age', 'n_content',
       'gc_content', 'sequence_length', 'type_code', 'location_code',
       'additional_location_information_code', 'host_code', 'gender_code',
       'clade_code', 'pango_lineage_code', 'pangolin_version_code',
       'variant_code', 'is_reference_code', 'is_complete_code',
       'is_high_coverage_code', 'is_low_coverage_code', 'date_isoweek',
       'date_isoyear', 'location', 'wl_cnt', 'w_cnt', 'weekdate'],
      dtype='object')
In [349]:
dPivot = datetime.datetime.strptime("2021-08-01", "%Y-%m-%d")
_ = posAll40x_w32[ (posAll40x_w32.pos==15520) & (posAll40x_w32.weekdate>dPivot)]
loc = _.groupby(['weekdate','location','wl_cnt']).size().reset_index().rename(columns={0:'cnt_perweek_perlocation'})
In [350]:
loc['country'] = [c[1] for c in loc.location.str.split(' / ').values]
In [351]:
loc_country = loc.groupby(['weekdate','country']).sum().reset_index()
loc_country['normCnt'] = loc_country['cnt_perweek_perlocation']/loc_country['wl_cnt']
loc_country = loc_country.sort_values('normCnt',ascending=False)
In [352]:
locSum = loc_country.groupby('country').sum().reset_index().sort_values('normCnt',ascending=False)
locSum.head(20)
Out[352]:
country wl_cnt cnt_perweek_perlocation normCnt
4 Pakistan 92 44 2.364789
10 United Kingdom 430890 37955 1.887371
9 USA 8311 624 0.901479
1 France 1560 231 0.852746
8 U.S. Virgin Islands 57 18 0.842857
6 Reunion 57 11 0.638528
2 Martinique 22 2 0.196429
3 Norway 1203 8 0.092637
0 Finland 568 7 0.058439
5 Puerto Rico 67 2 0.029851
7 Spain 44 1 0.022727
In [354]:
 posAll40x_w32[ (posAll40x_w32.pos==15520)].sort_values('weekdate',ascending=True)[['location','weekdate']].head(20)
Out[354]:
location weekdate
64573 Asia / Georgia / Tbilisi 2020-09-07
19358 North America / USA / Arizona 2020-09-28
58914 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-10-19
60645 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-10-19
62026 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-10-19
60139 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-10-19
61362 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-10-19
70688 Europe / France / Auvergne-Rhone-Alpes / Isere 2020-10-19
43214 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-10-26
26001 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-10-26
29316 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-10-26
31718 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-10-26
15169 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-10-26
61587 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-11-09
60834 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-11-09
89069 Europe / France / Auvergne-Rhone-Alpes / Isere 2020-11-09
62814 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-11-09
86989 Europe / France / Auvergne-Rhone-Alpes / Haute... 2020-11-09
28777 Europe / France / Auvergne-Rhone-Alpes / Rhone 2020-11-16
57053 North America / USA / Arizona 2020-11-23
In [ ]: