Flu NA data exploration

Short summary

In this notebook, we explore the relation between mutations and antigenic properties of flu NA protein. In collaboration with Ron Fouchier (EMC), Daniel Remondini (UNIBO), Francesco Durazzi (UNIBO)

1 Data

1.1 The titer table

Titer data for 301 strains measured against 24 sera. Date of the flu strains are encoded in the sample names.

Code

###---#| output: false
# df = pd.read_json('../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8.ace')
df = pd.read_csv('../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_dist.csv')
df = df.rename(columns={"Unnamed: 0":"strain"})
df

	strain	F12050	F12052	F12021	F12026	F12027	F12029	F12031	F12034	F12035	...	F12048	F13039	F13041	F13045	F14004	F20028	F20030	F20031	F20034	F20035
0	M1/57	1.574416	3.402557	2.898778	1.942968	7.229467	7.093932	7.939687	9.517563	10.359909	...	9.615901	12.123658	12.828431	12.156852	13.593330	11.677755	14.704255	15.289147	15.123345	16.697156
1	B1/68	2.600630	1.430592	3.880875	2.615082	6.864780	6.216929	6.578444	7.853448	7.738927	...	8.441434	9.317535	10.502744	9.998033	11.486862	9.712056	12.089350	12.497773	12.437677	13.903903
2	BI/16190/68	3.011607	1.195503	4.316032	2.675801	6.538964	5.848507	6.188585	7.434459	7.276061	...	8.140292	8.973482	10.118619	9.659428	11.182185	9.304252	11.655434	12.054789	12.000709	13.458979
3	BI/21793/72	7.559219	4.792659	8.788941	5.179885	1.874458	1.322527	1.207710	2.748615	5.021993	...	7.520932	9.757911	9.416751	9.660916	11.522350	7.563519	10.007944	10.281167	10.315149	11.478862
4	BI/1761/76	7.689574	4.992261	8.939485	5.262361	1.545108	1.211213	1.464863	2.929424	5.314439	...	7.522217	9.992182	9.556238	9.789711	11.645269	7.641819	10.178269	10.485362	10.496395	11.672114
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
296	rec A/Hong Kong/2671/2019	15.341979	12.378078	16.883226	13.812782	10.841371	10.086912	10.042302	9.007271	6.301289	...	8.565258	5.863700	4.496721	6.152295	6.982223	5.022823	1.221563	1.358903	0.831859	1.841209
297	rec A/SouthAustralia/34/2019	15.753855	12.787633	17.182149	14.297536	11.494104	10.685924	10.389738	9.263233	6.418597	...	9.602043	6.270547	5.450944	7.065334	7.865231	6.094661	2.147825	0.836129	1.483691	0.680975
298	rec A/Netherlands/16/2020	15.090016	12.109388	16.643504	13.518047	10.459409	9.719283	9.698070	8.667638	6.036871	...	8.283305	5.851700	4.367475	6.031341	6.947526	4.717432	1.189683	1.515029	0.981041	2.132923
299	rec A/Netherlands/122/2020	16.855485	13.849411	18.172795	15.362140	12.403461	11.611856	11.043661	9.755267	7.122762	...	11.133307	7.770089	7.105910	8.720389	9.514380	7.596464	3.792909	2.257337	3.140030	1.052983
300	rec A/Cambodia/e0826360/2020	16.228734	13.254948	17.613645	14.776721	11.951957	11.143226	10.745410	9.561260	6.756559	...	10.239361	6.802866	6.088806	7.695222	8.460796	6.726722	2.793297	1.354244	2.126825	0.184532

301 rows × 25 columns

1.2 Amino acid sequences

Aligned neuraminidase (NA) sequennces for the 301 flu strains.

BI/16190/68 is one of the earliest cases, we may take that as reference (Same was used in the HA study)

Code

fastaNames = ! grep ">" ../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_alignment.fasta | sed 's/>//g'
tableNames = df['strain']
#tableNames[:5]
set(fastaNames)==set(tableNames.values) # check if both data sources have the same samples

Code

from Bio import SeqIO

ref_id = 'BI/16190/68'

fasta_seq = []
fasta_id = []

fnam = "../NA_data/Mless_PA45792_HK5694_PE1609_5Num_Titers_lessCO286_NL11801_NL36306_OK588_clusters8_alignment.fasta"

fasta_sequences = SeqIO.parse(open(fnam),'fasta')       
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    fasta_seq.append(sequence)
    fasta_id.append(name)
    if name==ref_id:
        ref_seq1nt = sequence

fasta_df = pd.DataFrame(np.column_stack([fasta_id,fasta_seq]),columns=['strain','seq'])

Code

# SeqIO does not parse names with whitespaces
fasta_df['strain'] = fastaNames
fasta_df

	strain	seq
0	AK/4/93	MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECNSPP...
1	AM/1609/77	MNPNQKIITIGSVSLTIATVCFLMQIAILVTTVTLHFKQYECDSPP...
2	AM/4112/92	MNPNQKIITIGSVSLTIATICFLMQITILVTTVTLHFKQYECNSPP...
3	AT/211/89	MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECSSPP...
4	AT/3572/88	MNPNQKIITIGSVSLTIATICFLMQIAILVTTVTLHFKQYECSSPP...
...	...	...
296	rec A/Switzerland/8060/2017	MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
297	rec A/Switzerland/9715293/2013	MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
298	wt A/Netherlands/2249/2013	MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
299	wt A/Perth/16/2009	MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...
300	wt A/Victoria/361/2011	MNPNQKIITIGSVSLTISTICFFMQIAILITTVTLHFKQYEFNSPP...

301 rows × 2 columns

Code

df = pd.merge(fasta_df,df,on="strain")
df.head(3)

Code

years = df['strain'].str.split("/").str[-1].values.astype(int)
years = np.array([ y+(y<2000)*1900 for y in [y+(y<23)*2000 for y in years ]], dtype=int)
years

1.3 MDS projection of titer data

There are 24 measured values for each strain. These 24 dimensional data vectors are projected down to 3D with multidimensional scaling (MDS). This method tries to keep the topology indformation of the original data. Please note that a new MDS projection of the same data may provide a different projection depending on the random_state seed value.

Code

embedding = MDS(n_components=3, random_state=42)
#Xmds_transformed = embedding.fit_transform(Xmds[:,np.argsort(model2.feature_importances_)[::-1][:200]])
titer_mds = embedding.fit_transform(df.iloc[:,2:].values)
titer_mds.shape
# Wall time: 57.6 s

Code

%matplotlib inline
fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0].scatter(titer_mds[:,0],titer_mds[:,1],c=years, cmap=cmap, s=siz)
axs[0].set_xlabel("titer MDS0")
axs[0].set_ylabel("titer MDS1")
#axs[0].set_title('titer01')

axs[1].scatter(titer_mds[:,0],titer_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[1].set_title('titer02')
axs[1].set_xlabel("titer MDS0")
axs[1].set_ylabel("titer MDS2")

im = axs[2].scatter(titer_mds[:,1],titer_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[2].set_title('titer12')
axs[2].set_xlabel("titer MDS1")
axs[2].set_ylabel("titer MDS2")

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')
_ = plt.suptitle("3D MDS projections of the titer data")

Try rotating to optimal view in 3D!

Code

%matplotlib notebook

Code

fig = px.scatter_3d(x=titer_mds[:,0],y=titer_mds[:,1],z=titer_mds[:,2],
              color=years)
fig.show()

2 Blosum representation of amino acids and mutations

In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based on local alignments. There are different versions, we will use the BLOSUM62

Code

# https://www.ncbi.nlm.nih.gov/Class/FieldGuide/BLOSUM62.txt

#   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
blosum62_raw = np.array ([
['A',  4, -1, -2, -2,  0, -1, -1,  0, -2, -1, -1, -1, -1, -2, -1,  1,  0, -3, -2,  0, -2, -1,  0], 
['R', -1,  5,  0, -2, -3,  1,  0, -2,  0, -3, -2,  2, -1, -3, -2, -1, -1, -3, -2, -3, -1,  0, -1], 
['N', -2,  0,  6,  1, -3,  0,  0,  0,  1, -3, -3,  0, -2, -3, -2,  1,  0, -4, -2, -3,  3,  0, -1], 
['D', -2, -2,  1,  6, -3,  0,  2, -1, -1, -3, -4, -1, -3, -3, -1,  0, -1, -4, -3, -3,  4,  1, -1], 
['C',  0, -3, -3, -3,  9, -3, -4, -3, -3, -1, -1, -3, -1, -2, -3, -1, -1, -2, -2, -1, -3, -3, -2], 
['Q', -1,  1,  0,  0, -3,  5,  2, -2,  0, -3, -2,  1,  0, -3, -1,  0, -1, -2, -1, -2,  0,  3, -1], 
['E', -1,  0,  0,  2, -4,  2,  5, -2,  0, -3, -3,  1, -2, -3, -1,  0, -1, -3, -2, -2,  1,  4, -1], 
['G',  0, -2,  0, -1, -3, -2, -2,  6, -2, -4, -4, -2, -3, -3, -2,  0, -2, -2, -3, -3, -1, -2, -1], 
['H', -2,  0,  1, -1, -3,  0,  0, -2,  8, -3, -3, -1, -2, -1, -2, -1, -2, -2,  2, -3,  0,  0, -1], 
['I', -1, -3, -3, -3, -1, -3, -3, -4, -3,  4,  2, -3,  1,  0, -3, -2, -1, -3, -1,  3, -3, -3, -1], 
['L', -1, -2, -3, -4, -1, -2, -3, -4, -3,  2,  4, -2,  2,  0, -3, -2, -1, -2, -1,  1, -4, -3, -1], 
['K', -1,  2,  0, -1, -3,  1,  1, -2, -1, -3, -2,  5, -1, -3, -1,  0, -1, -3, -2, -2,  0,  1, -1], 
['M', -1, -1, -2, -3, -1,  0, -2, -3, -2,  1,  2, -1,  5,  0, -2, -1, -1, -1, -1,  1, -3, -1, -1], 
['F', -2, -3, -3, -3, -2, -3, -3, -3, -1,  0,  0, -3,  0,  6, -4, -2, -2,  1,  3, -1, -3, -3, -1], 
['P', -1, -2, -2, -1, -3, -1, -1, -2, -2, -3, -3, -1, -2, -4,  7, -1, -1, -4, -3, -2, -2, -1, -2], 
['S',  1, -1,  1,  0, -1,  0,  0,  0, -1, -2, -2,  0, -1, -2, -1,  4,  1, -3, -2, -2,  0,  0,  0], 
['T',  0, -1,  0, -1, -1, -1, -1, -2, -2, -1, -1, -1, -1, -2, -1,  1,  5, -2, -2,  0, -1, -1,  0], 
['W', -3, -3, -4, -4, -2, -2, -3, -2, -2, -3, -2, -3, -1,  1, -4, -3, -2, 11,  2, -3, -4, -3, -2], 
['Y', -2, -2, -2, -3, -2, -1, -2, -3,  2, -1, -1, -2, -1,  3, -3, -2, -2,  2,  7, -1, -3, -2, -1], 
['V',  0, -3, -3, -3, -1, -2, -2, -3, -3,  3,  1, -2,  1, -1, -2, -2,  0, -3, -1,  4, -3, -2, -1], 
['B', -2, -1,  3,  4, -3,  0,  1, -1,  0, -3, -4,  0, -3, -3, -2,  0, -1, -4, -3, -3,  4,  1, -1], 
['Z', -1,  0,  0,  1, -3,  3,  4, -2,  0, -3, -3,  1, -1, -3, -1,  0, -1, -3, -2, -2,  1,  4, -1], 
['X',  0, -1, -1, -1, -2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -2,  0,  0, -2, -1, -1, -1, -1, -1]
])

blosum62 = pd.DataFrame(blosum62_raw).rename(columns={0:'AminoAcid'})
cols = blosum62.columns.drop('AminoAcid')
blosum62[cols] = blosum62[cols].apply(pd.to_numeric, errors='coerce')

blosum62['-'] = -np.diag(blosum62_raw[:,1:].astype(int))
#blosum62.loc[len(blosum62)] = list(np.append('-',np.append(-np.diag(blosum62_raw[:,1:].astype(int)),0) ))
_ = list(-np.diag(blosum62_raw[:,1:].astype(np.int64)))
blosum62.loc[len(blosum62)] = blosum62.loc[len(blosum62)-1]
blosum62.iloc[len(blosum62)-1,0] = '-'
blosum62.iloc[len(blosum62)-1,1:] = _ + [0]

aminoAcids = list(blosum62.AminoAcid.values)
blosum62_matrix = blosum62.iloc[:,1:].values.astype(float)
#
blosum62.columns = np.concatenate([['AminoAcids'],aminoAcids])

blosum62

	AminoAcids	A	R	N	D	C	Q	E	G	H	...	P	S	T	W	Y	V	B	Z	X	-
0	A	4	-1	-2	-2	0	-1	-1	0	-2	...	-1	1	0	-3	-2	0	-2	-1	0	-4
1	R	-1	5	0	-2	-3	1	0	-2	0	...	-2	-1	-1	-3	-2	-3	-1	0	-1	-5
2	N	-2	0	6	1	-3	0	0	0	1	...	-2	1	0	-4	-2	-3	3	0	-1	-6
3	D	-2	-2	1	6	-3	0	2	-1	-1	...	-1	0	-1	-4	-3	-3	4	1	-1	-6
4	C	0	-3	-3	-3	9	-3	-4	-3	-3	...	-3	-1	-1	-2	-2	-1	-3	-3	-2	-9
5	Q	-1	1	0	0	-3	5	2	-2	0	...	-1	0	-1	-2	-1	-2	0	3	-1	-5
6	E	-1	0	0	2	-4	2	5	-2	0	...	-1	0	-1	-3	-2	-2	1	4	-1	-5
7	G	0	-2	0	-1	-3	-2	-2	6	-2	...	-2	0	-2	-2	-3	-3	-1	-2	-1	-6
8	H	-2	0	1	-1	-3	0	0	-2	8	...	-2	-1	-2	-2	2	-3	0	0	-1	-8
9	I	-1	-3	-3	-3	-1	-3	-3	-4	-3	...	-3	-2	-1	-3	-1	3	-3	-3	-1	-4
10	L	-1	-2	-3	-4	-1	-2	-3	-4	-3	...	-3	-2	-1	-2	-1	1	-4	-3	-1	-4
11	K	-1	2	0	-1	-3	1	1	-2	-1	...	-1	0	-1	-3	-2	-2	0	1	-1	-5
12	M	-1	-1	-2	-3	-1	0	-2	-3	-2	...	-2	-1	-1	-1	-1	1	-3	-1	-1	-5
13	F	-2	-3	-3	-3	-2	-3	-3	-3	-1	...	-4	-2	-2	1	3	-1	-3	-3	-1	-6
14	P	-1	-2	-2	-1	-3	-1	-1	-2	-2	...	7	-1	-1	-4	-3	-2	-2	-1	-2	-7
15	S	1	-1	1	0	-1	0	0	0	-1	...	-1	4	1	-3	-2	-2	0	0	0	-4
16	T	0	-1	0	-1	-1	-1	-1	-2	-2	...	-1	1	5	-2	-2	0	-1	-1	0	-5
17	W	-3	-3	-4	-4	-2	-2	-3	-2	-2	...	-4	-3	-2	11	2	-3	-4	-3	-2	-11
18	Y	-2	-2	-2	-3	-2	-1	-2	-3	2	...	-3	-2	-2	2	7	-1	-3	-2	-1	-7
19	V	0	-3	-3	-3	-1	-2	-2	-3	-3	...	-2	-2	0	-3	-1	4	-3	-2	-1	-4
20	B	-2	-1	3	4	-3	0	1	-1	0	...	-2	0	-1	-4	-3	-3	4	1	-1	-4
21	Z	-1	0	0	1	-3	3	4	-2	0	...	-1	0	-1	-3	-2	-2	1	4	-1	-4
22	X	0	-1	-1	-1	-2	-1	-1	-1	-1	...	-2	0	0	-2	-1	-1	-1	-1	-1	1
23	-	-4	-5	-6	-6	-9	-5	-5	-6	-8	...	-7	-4	-5	-11	-7	-4	-4	-4	1	0

24 rows × 25 columns

2.1 BLOSUM62 substitution w.r.t the ref seq

BI/16190/68 is one of the earliest cases, we may take that as reference. Same was used in the HA study. We calculate the BLOSUM62 substitution vectors for the mutations in each strain with respect to this reference.

Please note that in the FASTA file there were spurious single amino-acid extra deletions at the last positions, they were deleted.

Intra-sequence deletions are encoded with the negative value of the matching AA.

Code

dummy = []
for x in fasta_df.iterrows():
    idx1 = x[1]['strain']
    #print(idx1)
    # there are sporadic lower case letters and probably false indels
    seq1 = x[1]['seq'].upper()
    if len(seq1)>len(ref_seq1nt):
        seq1 = seq1.replace('-','')
#    else: # spurious single deletions are replaced with ref
#        ii = seq1.find('-')
#        if ii>=0:
#            s = list(seq1)
#            s[ii] = ref_seq1nt[ii]
#            seq1 = ''.join(s)
    dummy.append([idx1,
                  [ blosum62_matrix[aminoAcids.index(seq1[i]),aminoAcids.index(ref_seq1nt[i])]  for i in range(len(seq1)) ]])
        
ref_blosum62 = pd.DataFrame(dummy,columns=['strain','blosum62'])

#CPU times: user 128 ms, sys: 15.7 ms, total: 143 ms

Code

embedding = MDS(n_components=3, random_state=42)
blosum_mds= embedding.fit_transform(np.row_stack(ref_blosum62['blosum62'].values))
blosum_mds.shape
#1min 16s

The sequence projections are similar, but not the same as the titer MDS projections. But we did not expect it either as beyond rotations and mirror images, top-bottom flip, other small differences are expected. On the other hand these projections also nicely correlate with the date, maybe even better than the titer projections. This can be understood, as sequences evolve with time, but sometimes antigenic changes do not follow.

Code

%matplotlib inline
fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0].scatter(blosum_mds[:,0],blosum_mds[:,1],c=years, cmap=cmap, s=siz)
#axs[0].set_title('blosum01')
axs[0].set_xlabel("BLOSUM62 MDS0")
axs[0].set_ylabel("BLOSUM62 MDS1")

axs[1].scatter(blosum_mds[:,0],blosum_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[1].set_title('blosum02')
axs[1].set_xlabel("BLOSUM62 MDS0")
axs[1].set_ylabel("BLOSUM62 MDS2")


im = axs[2].scatter(blosum_mds[:,1],blosum_mds[:,2],c=years, cmap=cmap, s=siz)
#axs[2].set_title('blosum12')
axs[2].set_xlabel("BLOSUM62 MDS1")
axs[2].set_ylabel("BLOSUM62 MDS2")

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')
_ = plt.suptitle("Projections of the BLOSUM62 encoded sequence data")

Try rotating to optimal view in 3D!

Code

fig = px.scatter_3d(x=blosum_mds[:,0],y=blosum_mds[:,1],z=blosum_mds[:,2],
              color=years)
fig.show()

2.2 Predict date from titer

Up to now we used unsupervised projections. Now we will perform basic machine learning regressions. Part of the data will be used as training set to fit the relations, than it will be evaluated on an independent test set. The train test split can be done randomly or based on dates. The first option makes the predictions easier as similar data can be present in the training and test set. Extrapolating future properties would be the most interesting, interpolation is easier.

First let check if we can estimate the date from the titer values. This may not has any use, but can inform us on the relation between properties and in a sense measure the information content.

Code

%matplotlib inline
X = df.iloc[:,2:].values 
X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)

plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()

Train RMS: 0.5035031707908746
Test RMS: 2.725250995631021  MAD: 1.986651428801343

As expected it is hard to extrapolate future as the training set does not contain future dates. The exact values are not correct, but a trend can be seen.

Code


X = df.iloc[:,2:].values 
idx = years<2000
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()

Train RMS: 0.5910112899690743
Test RMS: 17.360129600702983  MAD: 15.309164997196916

2.3 Predict date from amino acids

The same as before, but now the model’s inputs are not the titer values, but the BLOSUM62 encoded sequences.

Code


X = np.row_stack(ref_blosum62['blosum62'].values)
X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)

plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange', label= 'test')
plt.title("Random split - Blosum62")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()

Train RMS: 0.44203423102499495
Test RMS: 0.9996396331279678  MAD: 0.6920104662905252

2.4 Important positions

The model allows us to list the AA positions that were most influential in estimating the date of the strain.

Code


#  - train on all data
plt.figure(figsize=(12,6))
model.fit(X, years)
plt.bar(x=range(len(model.feature_importances_)),height=model.feature_importances_)
plt.plot(model.feature_importances_,'o')
plt.xlabel("NA - amino acid position")
plt.ylabel("Feature importance")

Text(0, 0.5, 'Feature importance')

Code

top20 = np.argsort(model.feature_importances_)[::-1]
dftop_date = pd.DataFrame(index=top20.astype(int), data=model.feature_importances_[top20],columns=['Importance']).head(20).T
dftop_date.rename_axis(columns="Position")

Position	142	140	148	380	368	42	220	400	214	366	371	345	401	266	248	463	312	146	198	337
Importance	0.368931	0.120453	0.104074	0.08837	0.054158	0.044541	0.037659	0.025245	0.018386	0.01421	0.011318	0.011128	0.009755	0.009006	0.008715	0.008032	0.006504	0.006491	0.00637	0.004328

2.5 Extrapolate future

As for the titer it is not possible to guess future dates, but we see even better correlated trend.

Code



X = np.row_stack(ref_blosum62['blosum62'].values)
idx = years<2000
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5,c='orange', label= 'test')
plt.title("Date split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()

Train RMS: 0.423688698237624
Test RMS: 16.237387002262007  MAD: 14.246689010423573

2.6 Interpolate

Now not the last decades are kept as test set, but dates between 1990 and 2000. Interpolation is pretty good for this range.

Code


X = np.row_stack(ref_blosum62['blosum62'].values)
idx = (years<1990)|(years>2000)
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = years[idx]
y_test = years[~idx]

model = GradientBoostingRegressor()
#model = LinearRegression()

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

RMS_train = np.sqrt(np.mean((y_pred_train-y_train)**2))
RMS_test = np.sqrt(np.mean((y_pred_test-y_test)**2))
MAD_test = np.mean(np.abs(y_pred_test-y_test))
print('Train RMS:',RMS_train)
print('Test RMS:',RMS_test,' MAD:',MAD_test)


plt.figure(figsize=(6,6))
plt.scatter(y_train,y_pred_train, alpha=1, label='train')
plt.scatter(y_test,y_pred_test, alpha=0.5, c='orange',label= 'test')
plt.title("Date split - Titer")
plt.xlabel("True date")
plt.ylabel("Predicted date")
_ = plt.legend()

Train RMS: 0.40164925862060424
Test RMS: 2.506124099751673  MAD: 2.3518982738108387

2.7 Predict titer MDS 3D map coordinates from BLOSUM62 (not that important)

Use BLOSUM62 encoded sequences as input and try to estimate the 3 MDS titer coordinates. The task is not easy, now the model also tries to mimic the MDS projection, that is a complex nonlinear transformation.

For the training set the correlation is very good, but that may be result of overfitting. For the test set there are larger errors.

Code

X = np.row_stack(ref_blosum62['blosum62'].values)
y = titer_mds.copy()
X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []

for i in range(3):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    print('Train RMS:',RMS_train)
    print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)

fig, axs = plt.subplots(1, 3, figsize=(16,6), constrained_layout=True)
for i in range(3):
    axs[i].scatter(y_train[:,i],trains[i], alpha=1, label='train')
    axs[i].scatter(y_test[:,i],tests[i], alpha=0.5, c='orange', label= 'test')
    axs[i].set_title("Titer MDS "+str(i)+" from Blosum62")
_ = plt.legend()

Train RMS: 1.686085448601462
Test RMS: 3.4235557468050892  MAD: 2.5787491053749187
Train RMS: 1.76055945608206
Test RMS: 2.9031495256923745  MAD: 2.1939659988484204
Train RMS: 1.0749697143048584
Test RMS: 2.189875130864545  MAD: 1.702785397625311

Code


%matplotlib inline
fig, axs = plt.subplots(2, 3, figsize=(16,12), constrained_layout=True)
cmap = 'viridis' #'gist_rainbow_r'
siz=100
axs[0,0].scatter(titer_mds[:,0],titer_mds[:,1],c=years, cmap=cmap, s=siz)
axs[0,0].set_title('titer01')
axs[0,1].scatter(titer_mds[:,0],titer_mds[:,2],c=years, cmap=cmap, s=siz)
axs[0,1].set_title('titer02')
im = axs[0,2].scatter(titer_mds[:,1],titer_mds[:,2],c=years, cmap=cmap, s=siz)
axs[0,2].set_title('titer12')

axs[1,0].scatter(trains[0],trains[1],c=year_train, cmap=cmap, s=siz)
axs[1,0].scatter(tests[0],tests[1],c=year_test, cmap=cmap, s=siz)
axs[1,0].set_title('predicted titer01')

axs[1,1].scatter(trains[0],trains[2],c=year_train, cmap=cmap, s=siz)
axs[1,1].scatter(tests[0],tests[2],c=year_test, cmap=cmap, s=siz)
axs[1,1].set_title('predicted titer02')

im = axs[1,2].scatter(trains[1],trains[2],c=year_train, cmap=cmap, s=siz)
im = axs[1,2].scatter(tests[1],tests[2],c=year_test, cmap=cmap, s=siz)
axs[1,2].set_title('predicted titer12')

cbar = fig.colorbar(im,  shrink=1.0)
cbar.set_label('year')

2.7.1 Importance scores

Here we can get importance scores for all 3D coordinate model. They differ, as for each if them different positions seem to be important.

Code

# train on all
X = np.row_stack(ref_blosum62['blosum62'].values)
y = titer_mds.copy()

model = GradientBoostingRegressor()

nTiter = y.shape[1]
importance_blosum_MDS3D = []

for i in range(nTiter):
    model.fit(X, y[:,i])
    importance_blosum_MDS3D.append(model.feature_importances_)
importance_blosum_MDS3D = np.array(importance_blosum_MDS3D)

Code

fig, ax = plt.subplots(1,1,figsize=(16,4))
names = np.arange(len(importance_blosum_MDS3D),dtype=int).astype(str)
img = ax.imshow((importance_blosum_MDS3D+1e-7),aspect='auto',interpolation='nearest')
plt.title("")
plt.xlabel("NA amino acid position")
plt.ylabel("Titer ID")
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
_ =fig.colorbar(img,label="importance")

Code

x = importance_blosum_MDS3D.mean(axis=0)
srt = np.argsort(x)[::-1]
dftop_MDS = pd.DataFrame(index=srt.astype(int), data=x[srt],columns=['Importance']).head(20).T
dftop_MDS.rename_axis(columns="Position")

Position	368	42	140	152	244	337	401	367	312	198	436	246	371	220	247	335	389	343	308	80
Importance	0.273574	0.102225	0.051161	0.046234	0.042765	0.039661	0.039077	0.037007	0.028736	0.022507	0.021308	0.01982	0.018747	0.017828	0.016492	0.011336	0.011193	0.010379	0.008198	0.007892

2.8 Predict individual titer values from blosum - Random split

This is more interesting: can we predict the original 24 titer values (not the projected MDS) from the sequences?

Code


X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()
X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]

for i in range(nTiter):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    #RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    #RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    #MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    #print('Train RMS:',RMS_train)
    #print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)


#fig, axs = plt.subplots(6, 4, figsize=(16,6), constrained_layout=True)
names = df.iloc[:,2:].columns
plt.figure(figsize=(16,16), constrained_layout=True)
for i in range(nTiter):
    ax = plt.subplot(6, 4, i + 1)
    ax.scatter(y_train[:,i],trains[i], alpha=0.5, label='train')
    ax.scatter(y_test[:,i],tests[i], alpha=0.5, c='orange' , label= 'test')
    #rTrain = np.corrcoef(y_train[:,i],trains[i])[0,1]
    rmsTest = np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
    rTest = np.corrcoef(y_test[:,i],tests[i])[0,1]
    ax.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
    ax.legend()
_ = plt.suptitle("Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)
#plt.legend()

The correlation is not bad. Let’s see cumulatively for all 24 titer values in one plot.

Code


rmsTest = np.sqrt(np.mean((y_test.flatten()-np.array(tests).T.flatten())**2))
#np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
rTest = np.corrcoef(y_test.flatten(),np.array(tests).T.flatten())[0,1]

rmsTrain = np.sqrt(np.mean((y_train.flatten()-np.array(trains).T.flatten())**2))
rTrain = np.corrcoef(y_train.flatten(),np.array(trains).T.flatten())[0,1]

#plt.figure(figsize=(6,6), constrained_layout=True)
fig, axs = plt.subplots(1, 2, figsize=(12,6), constrained_layout=True)

axs[0].scatter(y_train.flatten(),np.array(trains).T.flatten(), alpha=0.5, label='train')
axs[0].set_title("TRAIN  RMS:"+str(rmsTrain)[:4]+"    R:"+str(rTrain)[:4])
axs[0].set_xlabel("True titer")
axs[0].set_ylabel("Estimated titer")
axs[0].legend()
axs[1].scatter(y_test.flatten(),np.array(tests).T.flatten(), alpha=0.5, c='orange', label= 'test')
axs[1].set_title("TEST RMS:"+str(rmsTest)[:4]+"    R:"+str(rTest)[:4])
axs[1].set_xlabel("True titer")
axs[1].set_ylabel("Estimated titer")
#plt.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
axs[1].legend()
_ = plt.suptitle("Random split - Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)

2.9 Predict individual titer values from blosum - date split

Same as above but not random train-test split rather date range 1990-2000 is in the test and all others are in the train set.

Code


X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()

idx = (years<1990)|(years>2000)
#X_train, X_test,y_train,y_test = train_test_split(X, years, test_size=0.8, random_state=14325)
X_train = X[idx] 
X_test = X[~idx]
y_train = y[idx]
y_test = y[~idx]

#X_train, X_test,y_train,y_test = train_test_split(X, y, test_size=0.5, random_state=14323)
#year_train, year_test = train_test_split(years, test_size=0.5, random_state=14323)
#idx_train, idx_test = train_test_split(range(len(years)), test_size=0.5, random_state=14323)

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]
for i in range(nTiter):
    model.fit(X_train, y_train[:,i])

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    #RMS_train = np.sqrt(np.mean((y_pred_train-y_train[:,i])**2))
    #RMS_test = np.sqrt(np.mean((y_pred_test-y_test[:,i])**2))
    #MAD_test = np.mean(np.abs(y_pred_test-y_test[:,i]))
    #print('Train RMS:',RMS_train)
    #print('Test RMS:',RMS_test,' MAD:',MAD_test)
    trains.append(y_pred_train)
    tests.append(y_pred_test)

#fig, axs = plt.subplots(6, 4, figsize=(16,6), constrained_layout=True)
names = df.iloc[:,2:].columns
plt.figure(figsize=(16,16), constrained_layout=True)
for i in range(nTiter):
    ax = plt.subplot(6, 4, i + 1)
    ax.scatter(y_train[:,i],trains[i], alpha=0.5, label='train')
    ax.scatter(y_test[:,i],tests[i], alpha=0.5, c='orange' , label= 'test')
    #rTrain = np.corrcoef(y_train[:,i],trains[i])[0,1]
    rmsTest = np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
    rTest = np.corrcoef(y_test[:,i],tests[i])[0,1]
    ax.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
    ax.legend()
_ = plt.suptitle("Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)
#plt.legend()

Code


rmsTest = np.sqrt(np.mean((y_test.flatten()-np.array(tests).T.flatten())**2))
#np.sqrt(np.mean((y_test[:,i]-tests[i])**2))
rTest = np.corrcoef(y_test.flatten(),np.array(tests).T.flatten())[0,1]

rmsTrain = np.sqrt(np.mean((y_train.flatten()-np.array(trains).T.flatten())**2))
rTrain = np.corrcoef(y_train.flatten(),np.array(trains).T.flatten())[0,1]

#plt.figure(figsize=(6,6), constrained_layout=True)
fig, axs = plt.subplots(1, 2, figsize=(12,6), constrained_layout=True)

axs[0].scatter(y_train.flatten(),np.array(trains).T.flatten(), alpha=0.5, label='train')
axs[0].set_title("TRAIN before 1990 and after 2000    RMS:"+str(rmsTrain)[:4]+"    R:"+str(rTrain)[:4])
axs[0].set_xlabel("True titer")
axs[0].set_ylabel("Estimated titer")
axs[0].legend()
axs[1].scatter(y_test.flatten(),np.array(tests).T.flatten(), alpha=0.5, c='orange', label= 'test')
axs[1].set_title("TEST 1990-2000    RMS:"+str(rmsTest)[:4]+"    R:"+str(rTest)[:4])
axs[1].set_xlabel("True titer")
axs[1].set_ylabel("Estimated titer")
#plt.set_title(names[i]+" $RMS_{test}$"+str(rmsTest)[:4]+" $R_{test}$"+str(rTest)[:4])
axs[1].legend()
_ = plt.suptitle("Date split - Individual titer estimates from BLOSUM62 endcoded sequences",fontsize=14)

2.10 Importance scores

Thes scores may be more relevant for immune properties. Interesting to see that for different sera, different positions come up with high importance scores. Now there is no train-test split, we use all the data.

Code

# train on all
X = np.row_stack(ref_blosum62['blosum62'].values)
y = df.iloc[:,2:].values.copy()

model = GradientBoostingRegressor()
#model = LinearRegression()

trains = []
tests = []
nTiter = df.iloc[:,2:].values.shape[1]
importance_blosum_titer = []
for i in range(nTiter):
    model.fit(X, y[:,i])
    importance_blosum_titer.append(model.feature_importances_)
importance_blosum_titer = np.array(importance_blosum_titer)

Code

fig, ax = plt.subplots(1,1,figsize=(16,8))
names = df.iloc[:,2:].columns
img = ax.imshow((importance_blosum_titer+1e-7),aspect='auto',interpolation='nearest')
plt.title("")
plt.xlabel("NA amino acid position")
plt.ylabel("Titer ID")
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
_ =fig.colorbar(img,label="importance")

We can get a cumulative top list by averaging the importance scores for the 24 titer predictions.

Code

x = importance_blosum_titer.mean(axis=0)
srt = np.argsort(x)[::-1]
dftop_blosum_titer = pd.DataFrame(index=srt.astype(int), data=x[srt],columns=['Importance']).head(20).T
dftop_blosum_titer.rename_axis(columns="Position")

Position	337	335	343	80	198	369	22	436	92	42	41	368	312	149	220	431	247	338	17	246
Importance	0.120426	0.098993	0.067251	0.050272	0.044978	0.043522	0.039964	0.039215	0.038438	0.037392	0.036661	0.03622	0.032844	0.027331	0.020773	0.018414	0.016897	0.013058	0.012213	0.011736