Emerging mutations in all samples¶

Version 2021.11.03 GISAID samples¶

Goal:¶

Find emerging mutations that were not present few weeks ago, but now start to dominate many samples. We do not distinguish type of mutations (indels, etc.), only count any type at the position where it is registered.

Data¶

We use the multiple sequence alingment (MSA) and metadata of all GISAID SARS-CoV-2 sequences. As of 2021.11.03 it is 4.8M samples. For all positions of all samples in the MSA FASTA files the difference from the reference EPI_ISL_402124 is recorded and loaded into an SQL database. For this dataset this provided almost 4 billion rows. Since many of these are early deletions ('-') or non-detections ('n') we focus only on more real mutations, and reduce the data table to 158M mutations.

Analysis¶

We looked for mutations, that have very low prevalence before August 2021 (week 32 of 2021) and higher prevalence (at least 40x affected samples than before) after that. This date is selected as a date that is beyond the relatively wide spread of the original Delta variant.

Technical isues:¶

Due to the downloaded data format, the initial database upload and indexing is long, may take several hours
Regular update may be faster if the only updates of MSA and metadata files' can be obtained. (TODO: now it seems that only th ecumulative is available)
After loading and indexind SQL database is fast an efficient for such studies. Most queries run in few minutes or seconds.

Results:¶

We could identify mutations with very recent rapid growth. The two top positions are interestingly not in the S (spike) gene:

position:15520, 657 samples before and 38862 samples after. Pol (RNA-directed RNA polymerase nsp12) gene. Responsible for replication and transcription of the viral RNA genome.
position:8834, 324 samples before and 24139 samples after. NSP4 (Non-structural protein 4) gene. Participates in the assembly of virally-induced cytoplasmic double-membrane vesicles necessary for viral replication.

As a function of time there is clear rapid growing trend. Some other positions also show growing with lower overall count.

Acknowledgement¶

We gratefully acknowledge both the originating and submitting laboratories for the sequence data in GISAID EpiCoV on which the SARS-CoV-2 variants data are partially based.

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

# load python libraries
import pandas as pd
import pandas.io.sql as sqlio
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import psycopg2

import datetime

# this connects to database and executes a query
def psql(query):
    conn = psycopg2.connect(
            dbname='csabai',
            host='pg.coveo',
            user='csabai',
            password='ohXai7ee'
        )
    dummy = sqlio.read_sql_query(query, conn)
    conn.close()
    return dummy

def psql_execute(query):
    conn = None

    try:
        conn = psycopg2.connect(
            dbname='csabai',
            host='pg.coveo',
            user='csabai',
            password='ohXai7ee'
        )
        conn.autocommit = True
    except:
        print("Connection failed.")

    if(conn != None):
        cursor = conn.cursor()
        cursor.execute(query)
        
    conn.close()
    return True

Number of rows and processed and all samples¶

query='''SELECT COUNT(DISTINCT seqid) AS number_of_samples FROM merged_gisaid_20211103;'''
psql(query)

query='''SELECT reltuples AS number_of_mutations FROM pg_class where relname = 'merged_gisaid_20211103';'''
psql(query)

Select sample statistics before and after the pivot day¶

%%time
query = '''
DROP TABLE IF EXISTS after_week32_gisaid_20211103;
SELECT pos, SUM(cnt) AS cnt, AVG(CAST(cnt AS REAL)) AS avg 
INTO after_week32_gisaid_20211103
FROM perweek_perpos_gisaid_20211103
WHERE date_isoyear=2021 AND date_isoweek>32 GROUP BY pos;

DROP TABLE IF EXISTS before_week32_gisaid_20211103;
SELECT pos, SUM(cnt) AS cnt, AVG(CAST(cnt AS REAL)) AS avg 
INTO before_week32_gisaid_20211103
FROM perweek_perpos_gisaid_20211103 
WHERE date_isoyear BETWEEN 2020 AND 2021 AND date_isoweek<=32 GROUP BY pos;
'''
psql_execute(query)

query = ''' SELECT * FROM after_week32_gisaid_20211103'''
after32_n = psql(query)
after32_n = after32_n.sort_values('pos')

query = ''' SELECT * FROM before_week32_gisaid_20211103'''
before32_n = psql(query)
before32_n= before32_n.sort_values('pos')

CPU times: user 81.8 ms, sys: 24.1 ms, total: 106 ms
Wall time: 1.29 s

_ = pd.merge(before32_n, after32_n, how='outer', on='pos')
_ = _.fillna(0)
_['cnt_x_norm'] = _['cnt_x']/_.cnt_x.sum()
_['cnt_y_norm'] = _['cnt_y']/_.cnt_y.sum()
_['avg_x_norm'] = _['avg_x']/_.cnt_x.sum()
_['avg_y_norm'] = _['avg_y']/_.cnt_y.sum()

Compare mutations before and after the pivot day¶

We look for mutation positions with very few samples before, and many samples after the pivot day. These points would be in the upper left corner of the plot. We are not interested in positions that already have high prevelance for a long time.

plt.figure(figsize=[8,8])
plt.plot(_.cnt_x,_.cnt_y,'.')
#plt.loglog(np.array([0,4e-2]),np.array([0,4e-2]),'-')
plt.xlabel('# of mutated samples before')
plt.ylabel('# ofmutated samples after')
plt.title('Each dot corresponds to a SARS-CoV-2 genome position. \n Axes are the counts of samples that have the given mutation')
#plt.plot(np.array([1,1e6]),np.array([1,1e6]),'-')
plt.plot(np.array([1,1e6]),np.array([1,1e6]),'-',label='1x')
plt.legend()

<matplotlib.legend.Legend at 0x7f29c086e1d0>

plt.figure(figsize=[8,8])
plt.plot(_[_.cnt_x<1000].cnt_x,_[_.cnt_x<1000].cnt_y,'.')

plt.xlabel('# of mutated samples before')
plt.ylabel('# ofmutated samples after')
plt.title('Zoomed version. Each dot corresponds to a SARS-CoV-2 genome position. \n Axes are the counts of samples that have the given mutation')
plt.plot(np.array([1,1e3]),np.array([1,40*1e3]),'-',label='40x')
plt.legend()

<matplotlib.legend.Legend at 0x7f29b94c6610>

plt.figure(figsize=[8,8])
plt.loglog(_.avg_x,_.avg_y,'.')

#plt.plot(np.array([1,3e6]),np.array([1,3e6]),'-')

plt.xlabel('# of mutated samples before')
plt.ylabel('# ofmutated samples after')
plt.title('Normalized, Log scale. Each dot corresponds to a SARS-CoV-2 genome position. \n Axes are the normalized counts of samples that have the given mutation')
#plt.plot(np.array([1,1e6]),np.array([1,1e6]),'-')
plt.loglog(np.array([0,1e5]),np.array([0,40*1e5]),'-',label='40x')
plt.loglog(np.array([0,1e5]),np.array([0,1e5]),'-',label='1x')
plt.legend()

<matplotlib.legend.Legend at 0x7f29b3481e90>

The most quickly emerging mutations¶

_[(_.cnt_x<1000) & (_.cnt_y>40*_.cnt_x)].sort_values('cnt_y')

_[(_.avg_x<1000) & (_.avg_y>40*_.avg_x)].sort_values('avg_y')

posList = _[(_.avg_x<1000) & (_.avg_y>40*_.avg_x)].sort_values('avg_y',ascending=False).pos.values
posList

array([15520,  8834, 24984, 25434, 14551,  2307,  7352,    71, 23506,
       11195, 12421, 10967, 13797])

%%time

query ='''
SELECT m.*,l.location,wl.cnt as wl_cnt, w.cnt as w_cnt 
FROM merged_gisaid_20211103 m JOIN location l ON m.location_code=l.location_code 
JOIN samplesperweek_perlocation_gisaid_20211103 wl 
ON wl.date_isoyear=m.date_isoyear AND wl.date_isoweek=m.date_isoweek AND wl.location_code=m.location_code
JOIN samplesperweek_gisaid_20211103 w 
ON w.date_isoyear=m.date_isoyear AND w.date_isoweek=m.date_isoweek  
    WHERE pos IN (13797, 10967, 12421, 11195, 23506, 71,  7352,  2307, 14551, 25434, 24984,  8834, 15520)
'''
posAll40x_w32 = psql(query)

CPU times: user 861 ms, sys: 160 ms, total: 1.02 s
Wall time: 1.94 s

posAll40x_w32['weekdate'] = [datetime.datetime.strptime('-'.join([str(_d[0]),str(_d[1]),'1']), "%Y-%W-%w") for _d in posAll40x_w32[['date_isoyear','date_isoweek']].values]

_ = posAll40x_w32.groupby(['pos','weekdate','w_cnt']).size().reset_index().rename(columns={0:'cnt1'})
_ = _.sort_values('weekdate')

Percentage of total sequences at a given collectin_date that carries mutation at the position indicated in the figure title¶

d1 =  datetime.datetime.strptime("2020-05-01", "%Y-%m-%d")
d2 =  datetime.datetime.strptime("2021-11-10", "%Y-%m-%d")
for p in posList:
    _1 = _[(_.pos==p) & (_.weekdate>d1) & (_.weekdate<d2)]
    plt.figure(figsize=[16,8])
    plt.plot(_1.weekdate,_1.cnt1/_1.w_cnt,'.-',lw=3)
    plt.xlim(d1,d2)
    plt.xlabel('pos')
    plt.ylabel('Affected sample ratio')
    #plt.ylim(0,max(_1.cnt1/_1.w_cnt))
    plt.ylim(0,1)
    plt.title('position='+str(p))
    plt.show()

Emerging mutations on logarithmic scale¶

d1 =  datetime.datetime.strptime("2020-05-01", "%Y-%m-%d")
d2 =  datetime.datetime.strptime("2021-11-10", "%Y-%m-%d")
for p in posList:
    _1 = _[(_.pos==p) & (_.weekdate>d1) & (_.weekdate<d2)]
    plt.figure(figsize=[16,8])
    plt.semilogy(_1.weekdate,_1.cnt1/_1.w_cnt,'.-',lw=3)
    plt.xlim(d1,d2)
    plt.xlabel('pos')
    plt.ylabel('Affected sample ratio')
    #plt.ylim(0,max(_1.cnt1/_1.w_cnt))
    plt.ylim(1e-5,1)
    plt.title('Log y scale.   position='+str(p))
    plt.show()

Locations¶

The non-uniform, non-representative collection and sharing of data makes location identification problematic

posAll40x_w32.columns

Index(['pos', 'ref', 'alt', 'seqid', 'virus_name', 'accession_id',
       'collection_date', 'submission_date', 'patient_age', 'n_content',
       'gc_content', 'sequence_length', 'type_code', 'location_code',
       'additional_location_information_code', 'host_code', 'gender_code',
       'clade_code', 'pango_lineage_code', 'pangolin_version_code',
       'variant_code', 'is_reference_code', 'is_complete_code',
       'is_high_coverage_code', 'is_low_coverage_code', 'date_isoweek',
       'date_isoyear', 'location', 'wl_cnt', 'w_cnt', 'weekdate'],
      dtype='object')

dPivot = datetime.datetime.strptime("2021-08-01", "%Y-%m-%d")
_ = posAll40x_w32[ (posAll40x_w32.pos==15520) & (posAll40x_w32.weekdate>dPivot)]
loc = _.groupby(['weekdate','location','wl_cnt']).size().reset_index().rename(columns={0:'cnt_perweek_perlocation'})

loc['country'] = [c[1] for c in loc.location.str.split(' / ').values]

loc_country = loc.groupby(['weekdate','country']).sum().reset_index()
loc_country['normCnt'] = loc_country['cnt_perweek_perlocation']/loc_country['wl_cnt']
loc_country = loc_country.sort_values('normCnt',ascending=False)

locSum = loc_country.groupby('country').sum().reset_index().sort_values('normCnt',ascending=False)
locSum.head(20)

 posAll40x_w32[ (posAll40x_w32.pos==15520)].sort_values('weekdate',ascending=True)[['location','weekdate']].head(20)

	pos	cnt_x	avg_x	cnt_y	avg_y	cnt_x_norm	cnt_y_norm	avg_x_norm	avg_y_norm
11195	11195	14.0	1.166667	871.0	174.200000	1.410650e-07	0.000017	1.175541e-08	0.000003
14551	14551	10.0	1.000000	2090.0	298.571429	1.007607e-07	0.000041	1.007607e-08	0.000006
2307	2307	60.0	2.142857	3102.0	282.000000	6.045641e-07	0.000060	2.159157e-08	0.000005
8834	8834	324.0	11.172414	24139.0	2011.583333	3.264646e-06	0.000468	1.125740e-07	0.000039
15520	15520	657.0	38.647059	38862.0	3238.500000	6.619977e-06	0.000754	3.894104e-07	0.000063

	pos	cnt_x	avg_x	cnt_y	avg_y	cnt_x_norm	cnt_y_norm	avg_x_norm	avg_y_norm
13797	13797	38.0	2.000000	1045.0	87.083333	3.828906e-07	0.000020	2.015214e-08	0.000002
10967	10967	22.0	1.466667	553.0	92.166667	2.216735e-07	0.000011	1.477823e-08	0.000002
12421	12421	60.0	2.000000	941.0	134.428571	6.045641e-07	0.000018	2.015214e-08	0.000003
11195	11195	14.0	1.166667	871.0	174.200000	1.410650e-07	0.000017	1.175541e-08	0.000003
23506	23506	113.0	4.913043	1305.0	217.500000	1.138596e-06	0.000025	4.950416e-08	0.000004
71	71	120.0	2.608696	2628.0	219.000000	1.209128e-06	0.000051	2.628539e-08	0.000004
7352	7352	75.0	3.125000	2480.0	248.000000	7.557051e-07	0.000048	3.148771e-08	0.000005
2307	2307	60.0	2.142857	3102.0	282.000000	6.045641e-07	0.000060	2.159157e-08	0.000005
14551	14551	10.0	1.000000	2090.0	298.571429	1.007607e-07	0.000041	1.007607e-08	0.000006
25434	25434	180.0	4.390244	5132.0	427.666667	1.813692e-06	0.000100	4.423640e-08	0.000008
24984	24984	209.0	5.500000	5281.0	440.083333	2.105898e-06	0.000102	5.541837e-08	0.000009
8834	8834	324.0	11.172414	24139.0	2011.583333	3.264646e-06	0.000468	1.125740e-07	0.000039
15520	15520	657.0	38.647059	38862.0	3238.500000	6.619977e-06	0.000754	3.894104e-07	0.000063

	country	wl_cnt	cnt_perweek_perlocation	normCnt
4	Pakistan	92	44	2.364789
10	United Kingdom	430890	37955	1.887371
9	USA	8311	624	0.901479
1	France	1560	231	0.852746
8	U.S. Virgin Islands	57	18	0.842857
6	Reunion	57	11	0.638528
2	Martinique	22	2	0.196429
3	Norway	1203	8	0.092637
0	Finland	568	7	0.058439
5	Puerto Rico	67	2	0.029851
7	Spain	44	1	0.022727

	location	weekdate
64573	Asia / Georgia / Tbilisi	2020-09-07
19358	North America / USA / Arizona	2020-09-28
58914	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-10-19
60645	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-10-19
62026	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-10-19
60139	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-10-19
61362	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-10-19
70688	Europe / France / Auvergne-Rhone-Alpes / Isere	2020-10-19
43214	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-10-26
26001	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-10-26
29316	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-10-26
31718	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-10-26
15169	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-10-26
61587	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-11-09
60834	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-11-09
89069	Europe / France / Auvergne-Rhone-Alpes / Isere	2020-11-09
62814	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-11-09
86989	Europe / France / Auvergne-Rhone-Alpes / Haute...	2020-11-09
28777	Europe / France / Auvergne-Rhone-Alpes / Rhone	2020-11-16
57053	North America / USA / Arizona	2020-11-23