VeraoULisboa_INESCID

Personalized medicine¶

Some molecular markers were profiled for individuals with and without disease X.
Let us load the data acquired from this cohort study:

In [1]:

import pandas as pd
df = pd.read_csv('expression.csv')
df.head(20)

Out[1]:

	mRNA A	protein A	mRNA B	protein B	class
0	5.8	16.2	17.3	15.4	healthy
1	7.2	17.4	16.9	13.2	healthy
2	7.7	18.2	14.1	13.0	healthy
3	8.8	30.0	9.1	8.9	healthy
4	10.0	21.1	8.1	6.9	healthy
5	10.8	23.3	7.2	5.7	healthy
6	11.1	24.4	6.9	5.1	healthy
7	11.4	26.1	5.8	4.7	healthy
8	12.8	27.2	5.6	4.6	healthy
9	12.9	27.4	5.4	4.1	healthy
10	11.1	26.1	8.4	8.2	ill
11	11.8	26.3	8.2	8.0	ill
12	12.4	27.1	6.5	7.7	ill
13	12.9	28.6	6.4	6.3	ill
14	13.3	30.1	5.6	6.1	ill
15	13.9	30.3	5.1	5.8	ill
16	14.1	30.7	4.4	5.8	ill
17	14.2	31.3	3.9	4.7	ill
18	15.1	32.1	3.6	4.5	ill
19	18.1	34.9	2.5	2.1	ill

Let us now plot the population-wide expression of genes A and B, and their corresponding proteins

In [4]:

import matplotlib.pyplot as plt
import seaborn as sns

dfm = df.melt(id_vars='class', var_name='columns')
sns.displot(dfm, x='value', col='columns', hue="class", kind="kde", rug=True,
            col_wrap=2, fill=True, facet_kws={'sharey': True, 'sharex': True})
plt.show()

No description has been provided for this image

1. Can you draw some hypothesis on what may be explaining the differences between RNA and protein expression?

Answer notes

Expression of molecules for case-control individuals:
- gene/protein A is generally more expressed for ill individuals (e.g. oncogene in case of cancer)
- gene/protein B is generally less expressed for ill individuals (e.g. tumor supressor gene in case of cancer)
- gene/protein A appears to be slightly more discriminative of illness condition than B

Expression of RNA versus proteins:
- expression of protein A is higher than RNA A: lifetime of proteins is generally higher than RNA (due to its translation)
- yet expression of protein B in line with the expression of RNA B: a good amount of proteins B may be migrating to other tissues (other than the ones profiled)

Let us now consider data science techniques! We learn the following decision tree from the given data:

In [12]:

from sklearn import tree
X, y = df.drop('class', axis=1), df['class']
clf = tree.DecisionTreeClassifier() 
clf.fit(X, y)
plt.figure(figsize=(11, 9))
tree.plot_tree(clf,feature_names=X.columns.values.tolist(),class_names=['healthy','ill'],impurity=False,filled=False)
plt.show()

2. A friend of ours was profiled, yielding the following record: (mRNA-A=11, protein-A=18, mRNA-B=5.5, protein-B=5.4)
Provide your guess on whether our friend tested positive for disease X.

In [13]:

#after your guess, discover the result by running this code
clf.predict([[11,18,5.5,5.4]])[0]
clf.predict([[15,21,4,3]])[0]

Out[13]:

'ill'

Genomics, epigenomics, and glycoproteomics¶

The gravity of disease X considerably varies for the monitored population with the 10 diseased individuals.

Let us now explore the role of gene C in the gravity of disease X. What we know regarding gene C:
- mutation C1 associated with 30-60% malformed C proteins when occurs in one allele and 70-100% malformed C proteins when occurs in both alleles;
- lack/excess of glycosylation of protein C can be further associated with protein malformations;
- the level of methylation in the promoter region of gene C highly varies across individuals

Let us load the data acquired from this second cohort study:

In [14]:

import pandas as pd
df = pd.read_csv('omics.csv')
df.head(20)

Out[14]:

	mutation C1	glycosylation C	methylation C (0-1)	gravity (1-5)
0	none	lack	0.01	1
1	one allele	normal	0.10	1
2	none	normal	0.01	2
3	one allele	lack	0.20	3
4	both alleles	excessive	0.30	3
5	one allele	excessive	0.20	3
6	one allele	lack	0.40	4
7	both alleles	excessive	0.60	4
8	both alleles	excessive	0.40	5
9	both alleles	excessive	0.30	5

3. Let us now how the collected markers are correlated with disease gravity. Any guess?

In [15]:

sns.lmplot(data=df, x='gravity (1-5)', y='methylation C (0-1)', line_kws={'color': 'g'})
plt.show()

In [152]:

fig, ax = plt.subplots(1,2,figsize=(10,3))
sns.histplot(data=df, x="gravity (1-5)", hue="mutation C1", kde=True, bins=5, binrange=(0.5,5.5), ax=ax[0])
sns.histplot(data=df, x="gravity (1-5)", hue="glycosylation C", kde=True, bins=5, binrange=(0.5,5.5), ax=ax[1])
plt.show()

Answer notes:

- Genomics: mutated C alleles are associated with increased gravity (e.g. hampered tumor supressor gene unable to regulate proliferation in cancer)
- Epigenomics: higher promoter methylation (associated with lower expression) is correlated with higher gravity (e.g. cancer with C being a tumor supressor/DNA repair gene)
- Glycoproteomics: excessive glycosylation is also associated with disease gravity (e.g. negatively impacting the final structure of C protein, hence hampering its function)

We sequenced gene C, obtaining the following nucleotide sequence:
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGP

4. Using UniProt, can you identify what type of gene is this one? How does the codified protein looks like?

Link: https://www.uniprot.org/

Answer recurring to BLAST utils in UniProt: p53 (tumor suppressor gene that is linked to approximately 50% of cancers)

Link: https://www.uniprot.org/uniprotkb/P04637/entry#structure

Now, let us assume that malformation of protein C is only observed in cancer cells.
We want to check if white cells can detect cells with malformed protein C and kill them.
This can be done by marking malformed C (whether driven by mutations or glycosylation).
Yet, to guarantee if gene C is eligible for successful marking, we need first check if the codified protein is able to exit the cell and stay at its surface.

5. Knowing C is CD47, check the odds of finding the protein CD47 at the surface of the target cells using UniProt.

Answer: High, CD47 is generally observed at the cell surface, hence good candidate for marking/immunotherapy.
CD47 is an oncogene, aiding cell proliferation.
Link: https://www.uniprot.org/uniprotkb/Q08722/entry#subcellular_location