Flu NA data exploration

Short summary

In this notebook, we explore the relation between mutations and antigenic properties of flu NA protein. In collaboration with Ron Fouchier (EMC), Daniel Remondini (UNIBO), Francesco Durazzi (UNIBO)

1 Data

1.1 The titer table

Titer data for 301 strains measured against 25 sera. Date of the flu strains are encoded in the sample names.

Code

###---#| output: false
# df = pd.read_json('../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8.ace')
df = pd.read_csv('../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_dist.csv')
df = df.rename(columns={"Unnamed: 0":"strain"})
df

	strain	F12050	F12052	F12021	F12026	F12027	F12029	F12031	F12034	F12035	...	F12048	F13039	F13041	F13045	F14004	F20028	F20030	F20031	F20034	F20035
0	M1/57	1.574416	3.402557	2.898778	1.942968	7.229467	7.093932	7.939687	9.517563	10.359909	...	9.615901	12.123658	12.828431	12.156852	13.593330	11.677755	14.704255	15.289147	15.123345	16.697156
1	B1/68	2.600630	1.430592	3.880875	2.615082	6.864780	6.216929	6.578444	7.853448	7.738927	...	8.441434	9.317535	10.502744	9.998033	11.486862	9.712056	12.089350	12.497773	12.437677	13.903903
2	BI/16190/68	3.011607	1.195503	4.316032	2.675801	6.538964	5.848507	6.188585	7.434459	7.276061	...	8.140292	8.973482	10.118619	9.659428	11.182185	9.304252	11.655434	12.054789	12.000709	13.458979
3	BI/21793/72	7.559219	4.792659	8.788941	5.179885	1.874458	1.322527	1.207710	2.748615	5.021993	...	7.520932	9.757911	9.416751	9.660916	11.522350	7.563519	10.007944	10.281167	10.315149	11.478862
4	BI/1761/76	7.689574	4.992261	8.939485	5.262361	1.545108	1.211213	1.464863	2.929424	5.314439	...	7.522217	9.992182	9.556238	9.789711	11.645269	7.641819	10.178269	10.485362	10.496395	11.672114
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
296	rec A/Hong Kong/2671/2019	15.341979	12.378078	16.883226	13.812782	10.841371	10.086912	10.042302	9.007271	6.301289	...	8.565258	5.863700	4.496721	6.152295	6.982223	5.022823	1.221563	1.358903	0.831859	1.841209
297	rec A/SouthAustralia/34/2019	15.753855	12.787633	17.182149	14.297536	11.494104	10.685924	10.389738	9.263233	6.418597	...	9.602043	6.270547	5.450944	7.065334	7.865231	6.094661	2.147825	0.836129	1.483691	0.680975
298	rec A/Netherlands/16/2020	15.090016	12.109388	16.643504	13.518047	10.459409	9.719283	9.698070	8.667638	6.036871	...	8.283305	5.851700	4.367475	6.031341	6.947526	4.717432	1.189683	1.515029	0.981041	2.132923
299	rec A/Netherlands/122/2020	16.855485	13.849411	18.172795	15.362140	12.403461	11.611856	11.043661	9.755267	7.122762	...	11.133307	7.770089	7.105910	8.720389	9.514380	7.596464	3.792909	2.257337	3.140030	1.052983
300	rec A/Cambodia/e0826360/2020	16.228734	13.254948	17.613645	14.776721	11.951957	11.143226	10.745410	9.561260	6.756559	...	10.239361	6.802866	6.088806	7.695222	8.460796	6.726722	2.793297	1.354244	2.126825	0.184532

301 rows × 25 columns

Code

df.iloc[:,1:].shape, df.iloc[:,1:].drop_duplicates().shape

1.2 Amino acid sequences

Aligned neuraminidase (NA) sequennces for the 301 flu strains.

BI/16190/68 is one of the earliest cases, we may take that as reference (Same was used in the HA study)

Code

fastaNames = ! grep ">" ../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_alignment.fasta | sed 's/>//g'
tableNames = df['strain']
#tableNames[:5]
set(fastaNames)==set(tableNames.values) # check if both data sources have the same samples

Code

from Bio import SeqIO

ref_id = 'BI/16190/68'

fasta_seq = []
fasta_id = []

fnam = "../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_alignment.fasta"

fasta_sequences = SeqIO.parse(open(fnam),'fasta')       
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    fasta_seq.append(sequence)
    fasta_id.append(name)
    if name==ref_id:
        ref_seq1nt = sequence

fasta_df = pd.DataFrame(np.column_stack([fasta_id,fasta_seq]),columns=['strain','seq'])

Code

# SeqIO does not parse names with whitespaces
fasta_df['strain'] = fastaNames
fasta_df

	strain	seq
0	AK/4/93	MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECNSPP...
1	AM/1609/77	MNPNQKIITIGSVSLTIATVCFLMQIAILVTTVTLHFKQYECDSPP...
2	AM/4112/92	MNPNQKIITIGSVSLTIATICFLMQITILVTTVTLHFKQYECNSPP...
3	AT/211/89	MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECSSPP...
4	AT/3572/88	MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECSSPP...
...	...	...
296	rec A/Switzerland/8060/2017	MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
297	rec A/Switzerland/9715293/2013	MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
298	wt A/Netherlands/2249/2013	MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
299	wt A/Perth/16/2009	MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
300	wt A/Victoria/361/2011	MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...

301 rows × 2 columns

Code

fasta_df.iloc[:,1:].shape, fasta_df.iloc[:,1:].drop_duplicates().shape

Code

df = pd.merge(fasta_df,df,on="strain")
df.head(3)

Code

years = df['strain'].str.split("/").str[-1].values.astype(int)
years = np.array([ y+(y<2000)*1900 for y in [y+(y<23)*2000 for y in years ]], dtype=int)
years

Code

dfy = df.copy()
#dfy.iloc[:,2:-1] = dfy.iloc[:,2:-1].reindex(sorted(dfy.iloc[:,2:-1].columns), axis=1)
dfy['year'] = years
dfy = dfy.sort_values('year')
fig, ax = plt.subplots(1,1,figsize=(16,8))
names = dfy.iloc[:,2:-1].columns
strains = dfy['strain'].values
yy = dfy.year.values 
img = ax.imshow(dfy.iloc[:,2:-1].T,aspect='auto',interpolation='nearest')
plt.title("All titer values")
plt.xlabel("strains ordered by years")
plt.ylabel("sera")
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
#ax.set_xticks(range(len(strains)))
#ax.set_xticklabels(strains,rotation=45,size=3)
ax.set_xticks(range(len(yy)))
ax.locator_params(nbins=10, axis='x')
ax.set_xticklabels(yy,rotation=45)
_ =fig.colorbar(img,label="measured titer")

1.3 Remove duplicates

Code

print("Number of all strains:",len(df))
df = df.drop_duplicates(subset=['seq'])
print("Number strains after removing duplicates:",len(df))

Number of all strains: 301
Number strains after removing duplicates: 203

Code

years = df['strain'].str.split("/").str[-1].values.astype(int)
years = np.array([ y+(y<2000)*1900 for y in [y+(y<23)*2000 for y in years ]], dtype=int)

1.4 MDS projection of titer data

There are 25 measured values for each strain. These 25 dimensional data vectors are projected down to 3D with multidimensional scaling (MDS). This method tries to keep the topology indformation of the original data. Please note that a new MDS projection of the same data may provide a different projection depending on the random_state seed value.

Code

embedding = MDS(n_components=3, random_state=42)
#Xmds_transformed = embedding.fit_transform(Xmds[:,np.argsort(model2.feature_importances_)[::-1][:200]])
titer_mds = embedding.fit_transform(df.iloc[:,2:].values)
titer_mds.shape
# Wall time: 57.6 s

Code

%matplotlib inline
fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0].scatter(titer_mds[:,0],titer_mds[:,1],c=years, cmap=cmap, s=siz)
axs[0].set_xlabel("titer MDS0")
axs[0].set_ylabel("titer MDS1")
#axs[0].set_title('titer01')

axs[1].scatter(titer_mds[:,0],titer_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[1].set_title('titer02')
axs[1].set_xlabel("titer MDS0")
axs[1].set_ylabel("titer MDS2")

im = axs[2].scatter(titer_mds[:,1],titer_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[2].set_title('titer12')
axs[2].set_xlabel("titer MDS1")
axs[2].set_ylabel("titer MDS2")

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')
_ = plt.suptitle("3D MDS projections of the titer data")

Try rotating to optimal view in 3D!

Code

%matplotlib notebook

Code

fig = px.scatter_3d(x=titer_mds[:,0],y=titer_mds[:,1],z=titer_mds[:,2],
              color=years)
fig.show()

2 Blosum representation of amino acids and mutations

In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. There are different versions, we will use the BLOSUM62

Code

# https://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt

#   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
blosum62_raw = np.array ([
['A',  4, -1, -2, -2,  0, -1, -1,  0, -2, -1, -1, -1, -1, -2, -1,  1,  0, -3, -2,  0, -2, -1,  0], 
['R', -1,  5,  0, -2, -3,  1,  0, -2,  0, -3, -2,  2, -1, -3, -2, -1, -1, -3, -2, -3, -1,  0, -1], 
['N', -2,  0,  6,  1, -3,  0,  0,  0,  1, -3, -3,  0, -2, -3, -2,  1,  0, -4, -2, -3,  3,  0, -1], 
['D', -2, -2,  1,  6, -3,  0,  2, -1, -1, -3, -4, -1, -3, -3, -1,  0, -1, -4, -3, -3,  4,  1, -1], 
['C',  0, -3, -3, -3,  9, -3, -4, -3, -3, -1, -1, -3, -1, -2, -3, -1, -1, -2, -2, -1, -3, -3, -2], 
['Q', -1,  1,  0,  0, -3,  5,  2, -2,  0, -3, -2,  1,  0, -3, -1,  0, -1, -2, -1, -2,  0,  3, -1], 
['E', -1,  0,  0,  2, -4,  2,  5, -2,  0, -3, -3,  1, -2, -3, -1,  0, -1, -3, -2, -2,  1,  4, -1], 
['G',  0, -2,  0, -1, -3, -2, -2,  6, -2, -4, -4, -2, -3, -3, -2,  0, -2, -2, -3, -3, -1, -2, -1], 
['H', -2,  0,  1, -1, -3,  0,  0, -2,  8, -3, -3, -1, -2, -1, -2, -1, -2, -2,  2, -3,  0,  0, -1], 
['I', -1, -3, -3, -3, -1, -3, -3, -4, -3,  4,  2, -3,  1,  0, -3, -2, -1, -3, -1,  3, -3, -3, -1], 
['L', -1, -2, -3, -4, -1, -2, -3, -4, -3,  2,  4, -2,  2,  0, -3, -2, -1, -2, -1,  1, -4, -3, -1], 
['K', -1,  2,  0, -1, -3,  1,  1, -2, -1, -3, -2,  5, -1, -3, -1,  0, -1, -3, -2, -2,  0,  1, -1], 
['M', -1, -1, -2, -3, -1,  0, -2, -3, -2,  1,  2, -1,  5,  0, -2, -1, -1, -1, -1,  1, -3, -1, -1], 
['F', -2, -3, -3, -3, -2, -3, -3, -3, -1,  0,  0, -3,  0,  6, -4, -2, -2,  1,  3, -1, -3, -3, -1], 
['P', -1, -2, -2, -1, -3, -1, -1, -2, -2, -3, -3, -1, -2, -4,  7, -1, -1, -4, -3, -2, -2, -1, -2], 
['S',  1, -1,  1,  0, -1,  0,  0,  0, -1, -2, -2,  0, -1, -2, -1,  4,  1, -3, -2, -2,  0,  0,  0], 
['T',  0, -1,  0, -1, -1, -1, -1, -2, -2, -1, -1, -1, -1, -2, -1,  1,  5, -2, -2,  0, -1, -1,  0], 
['W', -3, -3, -4, -4, -2, -2, -3, -2, -2, -3, -2, -3, -1,  1, -4, -3, -2, 11,  2, -3, -4, -3, -2], 
['Y', -2, -2, -2, -3, -2, -1, -2, -3,  2, -1, -1, -2, -1,  3, -3, -2, -2,  2,  7, -1, -3, -2, -1], 
['V',  0, -3, -3, -3, -1, -2, -2, -3, -3,  3,  1, -2,  1, -1, -2, -2,  0, -3, -1,  4, -3, -2, -1], 
['B', -2, -1,  3,  4, -3,  0,  1, -1,  0, -3, -4,  0, -3, -3, -2,  0, -1, -4, -3, -3,  4,  1, -1], 
['Z', -1,  0,  0,  1, -3,  3,  4, -2,  0, -3, -3,  1, -1, -3, -1,  0, -1, -3, -2, -2,  1,  4, -1], 
['X',  0, -1, -1, -1, -2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -2,  0,  0, -2, -1, -1, -1, -1, -1]
])

blosum62 = pd.DataFrame(blosum62_raw).rename(columns={0:'AminoAcid'})
cols = blosum62.columns.drop('AminoAcid')
blosum62[cols] = blosum62[cols].apply(pd.to_numeric, errors='coerce')

blosum62['-'] = -np.diag(blosum62_raw[:,1:].astype(int))
#blosum62.loc[len(blosum62)] = list(np.append('-',np.append(-np.diag(blosum62_raw[:,1:].astype(int)),0) ))
_ = list(-np.diag(blosum62_raw[:,1:].astype(np.int64)))
blosum62.loc[len(blosum62)] = blosum62.loc[len(blosum62)-1]
blosum62.iloc[len(blosum62)-1,0] = '-'
blosum62.iloc[len(blosum62)-1,1:] = _ + [0]

aminoAcids = list(blosum62.AminoAcid.values)
blosum62_matrix = blosum62.iloc[:,1:].values.astype(float)
#
blosum62.columns = np.concatenate([['AminoAcids'],aminoAcids])

blosum62

	AminoAcids	A	R	N	D	C	Q	E	G	H	...	P	S	T	W	Y	V	B	Z	X	-
0	A	4	-1	-2	-2	0	-1	-1	0	-2	...	-1	1	0	-3	-2	0	-2	-1	0	-4
1	R	-1	5	0	-2	-3	1	0	-2	0	...	-2	-1	-1	-3	-2	-3	-1	0	-1	-5
2	N	-2	0	6	1	-3	0	0	0	1	...	-2	1	0	-4	-2	-3	3	0	-1	-6
3	D	-2	-2	1	6	-3	0	2	-1	-1	...	-1	0	-1	-4	-3	-3	4	1	-1	-6
4	C	0	-3	-3	-3	9	-3	-4	-3	-3	...	-3	-1	-1	-2	-2	-1	-3	-3	-2	-9
5	Q	-1	1	0	0	-3	5	2	-2	0	...	-1	0	-1	-2	-1	-2	0	3	-1	-5
6	E	-1	0	0	2	-4	2	5	-2	0	...	-1	0	-1	-3	-2	-2	1	4	-1	-5
7	G	0	-2	0	-1	-3	-2	-2	6	-2	...	-2	0	-2	-2	-3	-3	-1	-2	-1	-6
8	H	-2	0	1	-1	-3	0	0	-2	8	...	-2	-1	-2	-2	2	-3	0	0	-1	-8
9	I	-1	-3	-3	-3	-1	-3	-3	-4	-3	...	-3	-2	-1	-3	-1	3	-3	-3	-1	-4
10	L	-1	-2	-3	-4	-1	-2	-3	-4	-3	...	-3	-2	-1	-2	-1	1	-4	-3	-1	-4
11	K	-1	2	0	-1	-3	1	1	-2	-1	...	-1	0	-1	-3	-2	-2	0	1	-1	-5
12	M	-1	-1	-2	-3	-1	0	-2	-3	-2	...	-2	-1	-1	-1	-1	1	-3	-1	-1	-5
13	F	-2	-3	-3	-3	-2	-3	-3	-3	-1	...	-4	-2	-2	1	3	-1	-3	-3	-1	-6
14	P	-1	-2	-2	-1	-3	-1	-1	-2	-2	...	7	-1	-1	-4	-3	-2	-2	-1	-2	-7
15	S	1	-1	1	0	-1	0	0	0	-1	...	-1	4	1	-3	-2	-2	0	0	0	-4
16	T	0	-1	0	-1	-1	-1	-1	-2	-2	...	-1	1	5	-2	-2	0	-1	-1	0	-5
17	W	-3	-3	-4	-4	-2	-2	-3	-2	-2	...	-4	-3	-2	11	2	-3	-4	-3	-2	-11
18	Y	-2	-2	-2	-3	-2	-1	-2	-3	2	...	-3	-2	-2	2	7	-1	-3	-2	-1	-7
19	V	0	-3	-3	-3	-1	-2	-2	-3	-3	...	-2	-2	0	-3	-1	4	-3	-2	-1	-4
20	B	-2	-1	3	4	-3	0	1	-1	0	...	-2	0	-1	-4	-3	-3	4	1	-1	-4
21	Z	-1	0	0	1	-3	3	4	-2	0	...	-1	0	-1	-3	-2	-2	1	4	-1	-4
22	X	0	-1	-1	-1	-2	-1	-1	-1	-1	...	-2	0	0	-2	-1	-1	-1	-1	-1	1
23	-	-4	-5	-6	-6	-9	-5	-5	-6	-8	...	-7	-4	-5	-11	-7	-4	-4	-4	1	0

24 rows × 25 columns

2.1 BLOSUM62 substitution w.r.t the ref seq

BI/16190/68 is one of the earliest cases, we may take that as reference. Same was used in the HA study. We calculate the BLOSUM62 substitution vectors for the mutations in each strain with respect to this reference.

Please note that in the FASTA file there were spurious single amino-acid extra deletions at the last positions, they were deleted.

Intra-sequence deletions are encoded with the negative value of the matching AA.

Code

dummy = []
for x in df.iterrows():
    idx1 = x[1]['strain']
    #print(idx1)
    # there are sporadic lower case letters and probably false indels
    seq1 = x[1]['seq'].upper()
    if len(seq1)>len(ref_seq1nt):
        seq1 = seq1.replace('-','')
#    else: # spurious single deletions are replaced with ref
#        ii = seq1.find('-')
#        if ii>=0:
#            s = list(seq1)
#            s[ii] = ref_seq1nt[ii]
#            seq1 = ''.join(s)
    dummy.append([idx1,
                  [ blosum62_matrix[aminoAcids.index(seq1[i]),aminoAcids.index(ref_seq1nt[i])]  for i in range(len(seq1)) ]])
        
ref_blosum62 = pd.DataFrame(dummy,columns=['strain','blosum62'])

#CPU times: user 128 ms, sys: 15.7 ms, total: 143 ms

Code

embedding = MDS(n_components=3, random_state=42)
blosum_mds= embedding.fit_transform(np.row_stack(ref_blosum62['blosum62'].values))
blosum_mds.shape
#1min 16s

The sequence projections are similar, but not the same as the titer MDS projections. But we did not expect it either as beyond rotations and mirror images, top-bottom flip, other small differences are expected. On the other hand these projections also nicely correlate with the date, maybe even better than the titer projections. This can be understood, as sequences evolve with time, but sometimes antigenic changes do not follow.

Code

%matplotlib inline
fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0].scatter(blosum_mds[:,0],blosum_mds[:,1],c=years, cmap=cmap, s=siz)
#axs[0].set_title('blosum01')
axs[0].set_xlabel("BLOSUM62 MDS0")
axs[0].set_ylabel("BLOSUM62 MDS1")

axs[1].scatter(blosum_mds[:,0],blosum_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[1].set_title('blosum02')
axs[1].set_xlabel("BLOSUM62 MDS0")
axs[1].set_ylabel("BLOSUM62 MDS2")


im = axs[2].scatter(blosum_mds[:,1],blosum_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[2].set_title('blosum12')
axs[2].set_xlabel("BLOSUM62 MDS1")
axs[2].set_ylabel("BLOSUM62 MDS2")

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')
_ = plt.suptitle("Projections of the BLOSUM62 encoded sequence data")

Try rotating to optimal view in 3D!

Code

fig = px.scatter_3d(x=blosum_mds[:,0],y=blosum_mds[:,1],z=blosum_mds[:,2],
              color=years)
fig.show()

2.2 Predict date from titer

Up to now we used unsupervised projections. Now we will perform basic machine learning regressions. Part of the data will be used as training set to fit the relations, than it will be evaluated on an independent test set. The train test split can be done randomly or based on dates. The first option makes the predictions easier as similar data can be present in the training and test set. Extrapolating future properties would be the most interesting, interpolation is easier.

First let check if we can estimate the date from the titer values. This may not has any use, but can inform us on the relation between properties and in a sense measure the information content.

Code

%matplotlib inline
X = df.iloc[:,2:].values 
X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)

plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()

Train RMS: 0.39037686470911254
Test RMS: 3.866197790653565  MAD: 2.5511748530778915

As expected it is hard to extrapolate future as the training set does not contain future dates. The exact values are not correct, but a trend can be seen.

Code


X = df.iloc[:,2:].values 
idx = years<2000
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()

Train RMS: 0.3762572373169865
Test RMS: 17.695494435932396  MAD: 15.699180703931628

2.3 Predict date from amino acids

The same as before, but now the model’s inputs are not the titer values, but the BLOSUM62 encoded sequences.

Code


X = np.row_stack(ref_blosum62['blosum62'].values)
X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)

plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Blosum62")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()

Train RMS: 0.2968988425143859
Test RMS: 1.8825734305831914  MAD: 0.9007159532010938

2.4 Important positions

The model allows us to list the AA positions that were most influential in estimating the date of the strain.

Code


#  - train on all data
plt.figure(figsize=(12,6))
model.fit(X, years)
plt.bar(x=range(len(model.feature_importances_)),height=model.feature_importances_)
plt.plot(model.feature_importances_,'o')
plt.xlabel("NA - amino acid position")
plt.ylabel("Feature importance")

Text(0, 0.5, 'Feature importance')

Code

top20 = np.argsort(model.feature_importances_)[::-1]
dftop_date = pd.DataFrame(index=top20.astype(int), data=model.feature_importances_[top20],columns=['Importance']).head(20).T
dftop_date.rename_axis(columns="Position")

Position	142	380	140	148	368	248	42	214	400	337	401	366	371	146	463	345	434	220	198	389
Importance	0.33694	0.172388	0.103847	0.071032	0.060787	0.03746	0.035222	0.02284	0.017074	0.015146	0.012363	0.012339	0.010792	0.008597	0.008354	0.007542	0.007469	0.007191	0.005671	0.005365

2.5 Extrapolate future

As for the titer it is not possible to guess future dates, but we see even better correlated trend.

Code



X = np.row_stack(ref_blosum62['blosum62'].values)
idx = years<2000
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5,c='orange', label= 'test')
plt.title("Date split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()

Train RMS: 0.3508946620338886
Test RMS: 15.45656849472422  MAD: 13.64992307797976

2.6 Interpolate

Now not the last decades are kept as test set, but dates between 1990 and 2000. Interpolation is pretty good for this range.

Code


X = np.row_stack(ref_blosum62['blosum62'].values)
idx = (years<1990)|(years>2000)
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange',label= 'test')
plt.title("Date split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()

Train RMS: 0.28871900588519556
Test RMS: 2.0734810260544148  MAD: 1.6127714884907578

2.7 Predict titer MDS 3D map coordinates from BLOSUM62 (not that important)

Use BLOSUM62 encoded sequences as input and try to estimate the 3 MDS titer coordinates. The task is not easy, now the model also tries to mimic the MDS projection, that is a complex nonlinear transformation.

For the training set the correlation is very good, but that may be result of overfitting. For the test set there are larger errors.

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
y = titer_mds.copy()
X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []

for i in range(3):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    print('Train RMS:',RMS_train)
    print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)

fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
for i in range(3):
    axs[i].scatter(y_train[:,i],trains[i], alpha=1, label='train')
    axs[i].scatter(y_test[:,i],tests[i], alpha=0.5, c='orange', label= 'test')
    axs[i].set_title("Titer MDS "+str(i)+" from Blosum62")
_ = plt.legend()

Train RMS: 0.8467581574082075
Test RMS: 2.3886533922034943  MAD: 1.7520059195020274
Train RMS: 1.5348564419999207
Test RMS: 3.141217283960353  MAD: 2.3909111793389775
Train RMS: 1.490509233475044
Test RMS: 3.3366229858250156  MAD: 2.5586576079737684

Code


%matplotlib inline
fig, axs = plt.subplots(2, 3, figsize=(16,12), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0,0].scatter(titer_mds[:,0],titer_mds[:,1],c=years, cmap=cmap, s=siz)
axs[0,0].set_title('titer01')
axs[0,1].scatter(titer_mds[:,0],titer_mds[:,2],c=years, cmap=cmap, s=siz)
axs[0,1].set_title('titer02')
im = axs[0,2].scatter(titer_mds[:,1],titer_mds[:,2],c=years, cmap=cmap, s=siz)
axs[0,2].set_title('titer12')

axs[1,0].scatter(trains[0],trains[1],c=year_train, cmap=cmap, s=siz)
axs[1,0].scatter(tests[0],tests[1],c=year_test, cmap=cmap, s=siz)
axs[1,0].set_title('predicted titer01')

axs[1,1].scatter(trains[0],trains[2],c=year_train, cmap=cmap, s=siz)
axs[1,1].scatter(tests[0],tests[2],c=year_test, cmap=cmap, s=siz)
axs[1,1].set_title('predicted titer02')

im = axs[1,2].scatter(trains[1],trains[2],c=year_train, cmap=cmap, s=siz)
im = axs[1,2].scatter(tests[1],tests[2],c=year_test, cmap=cmap, s=siz)
axs[1,2].set_title('predicted titer12')

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')

2.7.1 Importance scores

Here we can get importance scores for all 3D coordinate model. They differ, as for each if them different positions seem to be important.

Code

# train on all
X = np.row_stack(ref_blosum62['blosum62'].values)
y = titer_mds.copy()

model = GradientBoostingRegressor()

nTiter = y.shape[1]
importance_blosum_MDS3D = []

for i in range(nTiter):
    model.fit(X, y[:,i])
    importance_blosum_MDS3D.append(model.feature_importances_)
importance_blosum_MDS3D = np.array(importance_blosum_MDS3D)

Code

fig, ax = plt.subplots(1,1,figsize=(16,4))
names = np.arange(len(importance_blosum_MDS3D),dtype=int).astype(str)
img = ax.imshow((importance_blosum_MDS3D+1e-7),aspect='auto',interpolation='nearest')
plt.title("")
plt.xlabel("NA amino acid position")
plt.ylabel("Titer ID")
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
_ =fig.colorbar(img,label="importance")

Code

x = importance_blosum_MDS3D.mean(axis=0)
srt = np.argsort(x)[::-1]
dftop_MDS = pd.DataFrame(index=srt.astype(int), data=x[srt],columns=['Importance']).head(20).T
dftop_MDS.rename_axis(columns="Position")

Position	368	335	42	244	312	337	198	369	246	80	436	401	220	17	400	140	247	92	430	346
Importance	0.244614	0.069826	0.064236	0.061688	0.05997	0.053575	0.048406	0.037802	0.032628	0.029656	0.026239	0.020708	0.018504	0.016499	0.013726	0.011957	0.010052	0.008089	0.007759	0.00667

2.8 Predict individual titer values from blosum - Random split

This is more interesting: can we predict the original 25 titer values (not the projected MDS) from the sequences?

Code


X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()
X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]

for i in range(nTiter):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    #RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    #RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    #MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    #print('Train RMS:',RMS_train)
    #print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)


#fig, axs = plt.subplots(6, 4, figsize=(16,6), constrained_layout=True)
names = df.iloc[:,2:].columns
plt.figure(figsize=(16,16), constrained_layout=True)
for i in range(nTiter):
    ax = plt.subplot(6, 4, i + 1)
    ax.scatter(y_train[:,i],trains[i], alpha=0.5, label='train')
    ax.scatter(y_test[:,i],tests[i], alpha=0.5, c='orange' , label= 'test')
    #rTrain = np.corrcoef(y_train[:,i],trains[i])[0,1]
    rmsTest = np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
    rTest = np.corrcoef(y_test[:,i],tests[i])[0,1]
    ax.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
    ax.legend()
_ = plt.suptitle("Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)
#plt.legend()

The correlation is not bad. Let’s see cumulatively for all the titer values in one plot.

Code


rmsTest = np.sqrt(np.mean((y_test.flatten()-np.array(tests).T.flatten())**2))
#np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
rTest = np.corrcoef(y_test.flatten(),np.array(tests).T.flatten())[0,1]

rmsTrain = np.sqrt(np.mean((y_train.flatten()-np.array(trains).T.flatten())**2))
rTrain = np.corrcoef(y_train.flatten(),np.array(trains).T.flatten())[0,1]

#plt.figure(figsize=(6,6), constrained_layout=True)
fig, axs = plt.subplots(1, 2, figsize=(12,6), constrained_layout=True)

axs[0].scatter(y_train.flatten(),np.array(trains).T.flatten(), alpha=0.5, label='train')
axs[0].set_title("TRAIN  RMS:"+str(rmsTrain)[:4]+"    R:"+str(rTrain)[:4])
axs[0].set_xlabel("True titer")
axs[0].set_ylabel("Estimated titer")
axs[0].legend()
axs[1].scatter(y_test.flatten(),np.array(tests).T.flatten(), alpha=0.5, c='orange', label= 'test')
axs[1].set_title("TEST RMS:"+str(rmsTest)[:4]+"    R:"+str(rTest)[:4])
axs[1].set_xlabel("True titer")
axs[1].set_ylabel("Estimated titer")
#plt.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
axs[1].legend()
_ = plt.suptitle("Random split - Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)

2.9 Predict individual titer values from blosum - date split

Same as above but not random train-test split rather date range 1990-2000 is in the test and all others are in the train set.

Code


X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()

idx = (years<1990)|(years>2000)
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = y[idx]
y_test = y[~idx]

#X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
#year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
#idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]
for i in range(nTiter):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    #RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    #RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    #MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    #print('Train RMS:',RMS_train)
    #print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)

#fig, axs = plt.subplots(6, 4, figsize=(16,6), constrained_layout=True)
names = df.iloc[:,2:].columns
plt.figure(figsize=(16,16), constrained_layout=True)
for i in range(nTiter):
    ax = plt.subplot(6, 4, i + 1)
    ax.scatter(y_train[:,i],trains[i], alpha=0.5, label='train')
    ax.scatter(y_test[:,i],tests[i], alpha=0.5, c='orange' , label= 'test')
    #rTrain = np.corrcoef(y_train[:,i],trains[i])[0,1]
    rmsTest = np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
    rTest = np.corrcoef(y_test[:,i],tests[i])[0,1]
    ax.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
    ax.legend()
_ = plt.suptitle("Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)
#plt.legend()

Code


rmsTest = np.sqrt(np.mean((y_test.flatten()-np.array(tests).T.flatten())**2))
#np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
rTest = np.corrcoef(y_test.flatten(),np.array(tests).T.flatten())[0,1]

rmsTrain = np.sqrt(np.mean((y_train.flatten()-np.array(trains).T.flatten())**2))
rTrain = np.corrcoef(y_train.flatten(),np.array(trains).T.flatten())[0,1]

#plt.figure(figsize=(6,6), constrained_layout=True)
fig, axs = plt.subplots(1, 2, figsize=(12,6), constrained_layout=True)

axs[0].scatter(y_train.flatten(),np.array(trains).T.flatten(), alpha=0.5, label='train')
axs[0].set_title("TRAIN before 1990 and after 2000    RMS:"+str(rmsTrain)[:4]+"    R:"+str(rTrain)[:4])
axs[0].set_xlabel("True titer")
axs[0].set_ylabel("Estimated titer")
axs[0].legend()
axs[1].scatter(y_test.flatten(),np.array(tests).T.flatten(), alpha=0.5, c='orange', label= 'test')
axs[1].set_title("TEST 1990-2000    RMS:"+str(rmsTest)[:4]+"    R:"+str(rTest)[:4])
axs[1].set_xlabel("True titer")
axs[1].set_ylabel("Estimated titer")
#plt.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
axs[1].legend()
_ = plt.suptitle("Date split - Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)

2.10 Importance scores

Thes scores may be more relevant for immune properties. Interesting to see that for different sera, different positions come up with high importance scores. Now there is no train-test split, we use all the data.

Code

# train on all
X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]
importance_blosum_titer = []
for i in range(nTiter):
    model.fit(X, y[:,i])
    importance_blosum_titer.append(model.feature_importances_)
importance_blosum_titer = np.array(importance_blosum_titer)

Code

fig, ax = plt.subplots(1,1,figsize=(16,8))
names = df.iloc[:,2:].columns
img = ax.imshow((importance_blosum_titer+1e-7),aspect='auto',interpolation='nearest')
plt.title("")
plt.xlabel("NA amino acid position")
plt.ylabel("Titer ID")
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
_ =fig.colorbar(img,label="importance")

We can get a cumulative top list by averaging the importance scores for all the titer predictions.

Code

x = importance_blosum_titer.mean(axis=0)
srt = np.argsort(x)[::-1]
dftop_blosum_titer = pd.DataFrame(index=srt.astype(int), data=x[srt],columns=['Importance']).head(20).T
dftop_blosum_titer.rename_axis(columns="Position")

Position	337	335	22	343	80	198	369	17	92	312	436	41	368	220	42	247	196	338	244	149
Importance	0.129865	0.091392	0.074904	0.058259	0.047636	0.036571	0.036069	0.035766	0.035186	0.033318	0.033074	0.033059	0.031427	0.022868	0.020601	0.015602	0.014866	0.014212	0.013396	0.012325