Description

TDK dolgozat Marussy Kristóf október 22. Budapesti Műszaki és Gazdaságtudományi Egyetem Villamosmérnöki és Informatikai Kar Számítástudományi és Információelméleti Tanszék Az intrinzikus dimenzionalitás

Information

Category:
## Education

Publish on:

Views: 42 | Pages: 34

Extension: PDF | Download: 0

Share

Transcript

TDK dolgozat Marussy Kristóf október 22. Budapesti Műszaki és Gazdaságtudományi Egyetem Villamosmérnöki és Informatikai Kar Számítástudományi és Információelméleti Tanszék Az intrinzikus dimenzionalitás átka a génkifejeződés adatok osztályozásában TDK dolgozat Készítette: Marussy Kristóf III. évfolyam Konzulens: Dr. Buza Krisztián egyetemi adjunktus Semmelweis Egyetem Genomikai Medicina és Ritka Betegségek Intézete október 22. Budapest University of Technology and Economics Faculty of Electrical Engineering and Informatics Department of Computer Science and Information Theory The Curse of Intrinsic Dimensionality in Genome Expression Classification TDK paper Written by: Kristóf Marussy year III Advisor: Dr. Krisztián Buza lecturer Semmelweis University Institute of Genomic Medicine and Rare Disorders 22nd October 2014 Contents Rövid összefoglaló Abstract ii iii 1 Introduction 1 2 Background Machine Learning for Gene Expressions Dimensionality Reduction Classification Hubness in Gene Expression Data Distance Metrics Nearest Neighbour Graph Measurement of Hubness Hubness-aware Gene Expression Classification Hubness-aware Classifiers Evaluation Methods Results and Discussion Conclusions and Future Work 23 References 25 i Rövid összefoglaló A génkifejeződés profilok vizsgálata fontos eszköze az orvosi kockázatfelmérési, diagnosztikai és prognosztikai alkalmazásoknak (Alon et al., 1999; Bhattacharjee et al., 2001; Sotiriou et al., 2003). Az új generációs szekvenálási technológiák elterjedése a közelmúltban növekvő érdeklődéshez vezetett a génkifejeződés adatok prediktív osztályozása iránt. Egy beteg génkifejeződés profilja több ezer gén kifejeződési értékét tartalmazhatja, ezért a génkifejeződési példányok sokdimenziós euklideszi tér vektoraiként ábrázolhatóak. Az ilyen nagy dimenziószámú terekben az osztályozóknak a dimenzionalitás átka néven ismert jelenségekkel kell megküzdeniük. Az átok egyik legjelentősebb eleme a csomósodás (hubness), mely számos kutatás tárgyát képezte az utóbbi időben. A csomósodás a nagy intrinzikus dimenziójú adathalmazokban figyelhető meg csomók megjelenésének formájában (Radovanovic et al., 2010a). Csomók alatt olyan példányokat értünk, melyek meglepően sok más példányhoz hasonlítanak. Az adott alkalmazási területen a példányok a betegek génkifejeződési profiljait jelentik, hasonlóságuk távolságfüggvények, például az euklideszi távolságuk segítségével mérhető. A csomósodás gyakran rossz csomók megjelenésével jár, melyek a hozzájuk hasonló példányoktól eltérő osztályba tartoznak. Az ilyen csomók csökkenthetik a tradicionális osztályozó algoritmusok pontosságát (Radovanovic et al., 2010a). A csomósodás-alapú (hubness-aware) osztályozókat úgy tervezték, hogy kihasználják a csomók jelenlétét, így elkerülhető a rossz csomók káros hatása (Tomasev et al., 2014; Tomasev és Mladenic, 2012). Dolgozatomban összehasonlítom a leggyakrabban használt tradicionális és csomósodásalapú osztályozók viselkedését a génkifejeződés adatokon. A kísérleteket nyilvános adatbázisokon (Alon et al., 1999; Bhattacharjee et al., 2001; Sotiriou et al., 2003), keresztvalidáció és statisztikai szignifikancia tesztek segítségével végzem. ii Abstract Gene expression profiles were found to be highly relevant for safety assessment, diagnostics and prognostics applications (Alon et al., 1999; Bhattacharjee et al., 2001; Sotiriou et al., 2003). Recent advancements in high-throughput sequencing technology lead to increasing interest in predictive classification models for gene expression data. A patient s gene expression profile may contain expression values of thousands of genes. Therefore, gene expression instances are represented as vectors in very high-dimensional Euclidean space. Classification in such high-dimensional spaces is challenged by a collection of phenomena known as the curse of dimensionality. One of the most prominent, recently explored aspect of the curse is hubness, which was found to be related to the intrinsic dimensionality of the data (Radovanovic et al., 2010a). With hubness we mean the emergence of hubs, instances that are similar to a surprisingly large number of other instances. In our application, instances are patients gene expression profiles and similarity may be determined by a distance function, e.g. Euclidean distance. Hubs may frequently be bad hubs, which have a different class label than the instances they are similar to. These hubs may mislead traditional classification methods (Radovanovic et al., 2010a). However, hubness-aware classifiers, which were explicitly designed to take advantage of hubs, are able to mitigate hubness artifacts (Tomasev et al., 2014; Tomasev and Mladenic, 2012). In my work, I explore the use of various traditional and hubness-aware classifiers on gene expression data. The comparisons are performed with cross-validation and significance testing on publicly available data sets (Alon et al., 1999; Bhattacharjee et al., 2001; Sotiriou et al., 2003). iii 1 Introduction The central dogma of life describes the information flow inside cells between biopolymers (Lesk, 2012): DNA transciption mrna translation Protein. The deoxyribonucleic acid (DNA) is a two-stranded double helix biopolymer consisting of nitrogen-containing nucleobases guanine (G), adenine (A), thymine (T), cytosine (C) and a phosphate-deoxyribose backbone. Ribonucleic acid has similar structure, however, it is single-stranded and contains uracyl (U) instead of thymine and ribose instead of deoxyribose. Parts of the DNA sequence are transcribed into messenger RNA (mrna) sequences in a process regulated by complex mechanisms according to changes in the environment inside and outside the cell. The mrna transcriptome eventually finds its way to a ribosome of the cell, where it is translated into a polypeptide sequence built from the 21 (in some organisms, 23) amino acids. After various physical and chemical changes of the polypeptide chain, the end product is a protein. Proteins are an extremely important class of biomolecules. They perform in virtually all kinds of functions within a cell: structural proteins are responsible for stiffness and rigidity, enzymes catalyse most intracelluar chemical reactions, antibodies bind to and neutralise foreign substances; proteins also regulate RNA transcription, signal information and transport substances between cells. The information required to synthesise is contained in hereditary units of the DNA called genes. A gene is a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence regions (Pearson, 2006). Alleles of gene are its variation found in nature. The Human Genore Project has revealed that there are about genes in the human genome (International Human Genome Sequencing Consortium, 2004). Gene expression studies look at the mrna present in a specimen to determine what genes are currently used in the cell to synthesise proteins and what are their corresponding alleles. In microarray methods, a set of probes are created to capture alleles of interest (Antal et al., 2014). In contrast, RNA sequencing (RNASeq) methods, becoming increasingly popular doe to the widespread use of high-througput sequencing methods, can read the bases of the mrna directly and hence discover yet unknown alleles. Gene expression data may be used to categorise subtypes of diseases, effects of a treatment and find genes participating in specific disease a biochemical process (Kaminski and Friedman, 2002). While gene expression measurement methods are gradually becoming more available, microarray and RNASeq measurements are still relatively expensive. For example, processing a sample with an Affymetrix Gene Profiling Array widely used in clinical research, may cost $400 $700 including reagents and processing (Science Exchange, 2014). The costs and the difficulty of recruiting a large number of patients for studies, especially in the case of rare diseases, means that gene expression studies are usually limited to a couple hundred specimens. This limitation of sample size combined with the large amount of genes and alleles makes data mining for gene expression data a small n, large p problem (Aruliah et al., 2006). The large dimensionality of gene expression vectors ( large p ) give 1 2 CHAPTER 1. INTRODUCTION rise to phenomena known as the curse of dimensionality (Bellman, 1957), while the small number of vectors ( small n ) means data sets are very sparse. Machine learning and data mining applications remain to be challanged by the dimensionality and sparsity of gene expression data despite the importance of gene expression studies in clinical safety assessment, diagnostics and prognostics. In this work, we explore the effects of hubness, a significant aspect of the curse of dimensionality, which is related to the intrinsic dimensionality data sets. After a discussion of related work in Chapter 2, we study how hubness appears in real-world gene expression data sets in Chapter 3. Chapter 4 concerns the effects of hubness in prediction task based on gene expression vectors. We conclude our study in Chapter 5. 2 Background In this Chapter, we describe some data mining and machine learning approaches for gene expression. 2.1 Machine Learning for Gene Expressions Machine learning tasks are usually categorised as either unsupervised or supervised. In unsupervised tasks, only a set of instances X {x i } n is available for the learning i 1 algorithm. In the domain of gene expression data mining, instances are vectors of gene expression values, i.e. x i (x i,1, x i,2,..., x i,t,..., x i,p ) R p, where x i,t is the expression value of the ith patient and tth gene and p is the total number of genes considered. The vectors X {x i } together form the data matrix X. The unsupervised learning algorithm attemps to analyse the structure of the input data. For example, in clustering similar instances are grouped into clusters C 1, C 2,..., C k by means of a partition X C 1 C 2 C 2. In supervised tasks, data labels are available in addition to the instances, thus, the data set takes the form D {(x i, y i )} n i 1 where y i belongs to the set Y of all possible labels. In most problems, the learning algorithm attempts to construct a predictor M from D. The predictor attempts to predict the label y of a yet-unseen instance x. In regression tasks, the possible labels are the real numbers, i.e. Y R. In classification, Y is a finite set of discrete class labels, for example Y { 1, +1}, where 1 means healthy tissue and +1 means cancerous tissue. For a more comprehensive review, we refer to Antal et al. (2014, Chapter 8) in the context of gene expression data and Witten et al. (2011) in the context of general machine learning. 2.2 Dimensionality Reduction Dimensionality reduction methods refer to method which can reduce the p-dimensional vectors X R p into a more manageable form. The output is a set of lower-dimensional vectors X R q with q p. Because a cdna microarray may be complementary to many thousands of genes, a large fraction of genes measured may be irrelevant to the disease or biochemical process studied. Feature selection methods attempt to select the relevant attributes, i.e. genes. This process is usually supervised, because we are interested in the genes that most likely determine the class labels. On the other hand, feature construction methods create whole new representations of X not limited to the attributes already present. This projection can be done both in a supervised and unsupervised manner Feature Selection Statistical indicators may be used to select relevant genes, including the t-statistic (Golub et al., 1999), twoing rule, information gain, gini index, max minority, sum minority and sum 3 4 CHAPTER 2. BACKGROUND of variances (Murthy, 1998). An implementation of these rules is avaiable in RankGene (Su et al., 2003) software package. Gene Ontology (Ashburner et al., 2000) is a dictionary (ontology) for expressing the roles of genes in biological processes. Under the null hypothesis, genes of a certain biological function are distributed among the overexpressed genes according to the hypergeometric distribution. Relevant genes may be selected according to the rejection of this null hypothesis. Other hypothesis testing based methods for determining significance of genes include GSEA: Gene Set Enrichment Analysis (Subramanian et al., 2005) and SAM: Significance Analysis of Microarrays (Tusher et al., 2001) Feature Construction Principal Component Analysis (PCA) Principal Component Analysis is an unsupervised dimensionality reduction method that constructs and orthonormal system of principal axes {u 1, u 2,... u p }. Each principal axis corresponds to a linear combination of features (genes). The principal axes are selected such that projection to the first q principal axes maximises the variance of projected data X. The projected vectors x X are calculated from the i corresponding original vectors x i X as follows: x i (ut 1 x i, u T 2 x i,..., u T p x i ) R q. (2.1) Let denote the data set mean and x 1 x (2.2) n x X S 1 (x x) T (x x) (2.3) N x X denote the data set covariance matrix. The principal axes are the eigenvectors {u i } of S, sorted in decreasing order by the eigenvalues λ i (Bishop, 2006, Chapter 12). Therefore, to obtain a q-dimensional representation of X, we find the q larges eigenvalues of S and project the instance vectors to the corresponding eigenvectors. Eigengenes The Eigengene method of gene expression feature construction (Alter et al., 2000) relies on Singular Value Decomposition (SVD) of the data matrix. The data matrix X is decomposed into a product of smaller matrices X UΣV T, where Σ is a q q orthogonal matrix, i.e. Σ T Σ 1. The left-singular vectors of X, which are the columns of U, correspond to q-many eigenarrays and the right-singular vectors of X, which are the columns of V, correspond to q-many eigengenes. V contains the calculated expression levels of eigengenes in the instances of original data set X, while U contains the expression levels of the genes of the original data set in the eigenarrays. The eigenexpression matrix Σ connects the eigengenes and the eigenarrays. 2.3 Classification In classification problems, the classifier M aims to maximize the classification accuracy that is, the probabilty of correctly predicting a class label. a(m) P(M(x ) y ), (2.4) 2.3. CLASSIFICATION 5 Figure 2.1 Classification for gene expression data.. To approximate the true accuracy a(m), the labeled data set D is partitioned into disjoint a D train training set and D test testing set. The learning algorithm only has access to the training set when in construct the classifier M. The accuracy may be approximated as â(m) {(x i, y i ) D test : M(x i ) y i }. (2.5) D test Figure 2.1 illustrates classification with training and testing sets for gene expression data. K-Nearest Neighbours (k-nn) The k-nearest neighbour classifier (k-nn) is a popular classification method were the class label of an instance x is predicted according to majority vote by its k nearest neighbour from the training set D train. The similarity measured by some distance function d(, ). Distance measures are discussed in detail in Section 3.1. A crucial aspect of nearest neighbour classification is its interpretability. The prediction output is decided only by the k nearest neighbours of x. These few vectors and the corresponding patients can be inspected by a human expert and further conclusions may be drawn. Support Vector Machines (SVM) Support Vector Machines employ projections of the data to very high-dimensional spaces and attempt to linearily separate the classes in data with a maximum-margin hyperplane, which is as far from the points as possible. In practice, dual representations are used, and the projections are realised by a kernel function K(x i, x j ). Support Vector Machines can be successfully applied to gene expression data (W. Lin and Chen, 2013; Mukherjee, 2003). For a detailed overview of SVMs and other sparse kernel machines, we refer to Bishop (2006, Chapter 7). 3 Hubness in Gene Expression Data Gene expression data sets contain expression values of thousands of genes. This means instances in gene expression data sets are represented by vectors in very high dimensional Euclidean space. For example, the Breast Cancer data set (Sotiriou et al., 2003) contains 7650 features measured by cdna microarray chips. Therefore, instances of the Breast Cancer data set are vectors in R 7650, i.e. the space of 7650-dimensional vectors. Data sets with such high dimensionality give rise to a collection of phenomena known as the curse of dimensionality. A prominent aspect of the curse is hubness (Radovanovic et al., 2010a). Hubness is the emergence of hubs in the data set, instances which are similar to a surprisingly large number of other instances Scale-Free Netorks The terminology hub comes from Albert et al. (1999) in the context of scale-free network analysis. Scale-free networks are random graphs whose vertex degrees are distributed according to a power law, P(out-deg x n) n γ. Scale-free networks include the World Wide Web with hyperlinks between sites as edges, the Internet with network connections as edges, movie actor and scientist collaboration graphs, celluar chemical reaction networks and ecological networks. In these graphs, there are a few vertices with suprising large number of adjacent edges. These vertices, located in the thick tails of the power law distribution, are hubs of scale-free networks. In the rest of this Chapter, we will consider networks created from gene expression data sets according to the similarity of gene expression profiles. 3.1 Distance Metrics To determine similarity of instances in gene expression data, we consider four widely used distance measures, Euclidean, Manhattan, maximum and cosine distance. The Euclidean and Manhattan distances are special cases of the l r or Minkowski distance: Definition 3.1 The l r or Minkowski distance (r 0) of two gene expression vectors is calculated as ( ) d r (x i, x j ) x i x j r x i,t x j,t 1/r, r (3.1) where x i and x j are the gene expression profiles of the ith and jth patiens, and x i,t is the expression value of the ith patient and tth gene. 7 t 8 CHAPTER 3. HUBNESS IN GENE EXPRESSION DATA Definition 3.2 Euclidean distance is the l 2 -distance of two vectors, d 2 (x i, x j ) x i x j 2 (x i,t x j,t ) 2. (3.2) Definition 3.3 Manhattan distance is the l 1 -distance of two vectors, d 1 (x i, x j ) x 1 x j 1 x i,t x j,t. (3.3) Definition 3.4 Maximum or supremum distance of two vector, often denoted as their l distance, is the maximum of their componentwise absolute differences, d (x i, x j ) max t t t x i,t x j,t. (3.4) Definition 3.5 Cosine similarity is the cosine of the angle between two vectors x i and x j, x i x j t x i,t x j,t cos θ i,j. (3.5) x i 2 x j 2 t x 2 i,t t x 2 j,t Cosine distance is defined to be small if two instances a similar, thus, it can be used with distance-based algorithms such as k-nearest neighbours. It has the form d cos (x i, x j ) 1 cos θ i,j. (3.6) Note that the cosine distance is not a proper metric, because the triangle equality d cos (x 1, x 2 ) d cos (x 1, x 3 ) + d cos (x 2, x 3 ) may not always be satisfied. The classification algorithms used in the work do not require the distance metric to satisfy the metric axioms, therefore, cosine distance may be employed without modification. 3.2 Nearest Neighbour Graph We can construct the directed k-nearest neighbour graph G (V, E) of a data set X according to a distance function d(, ) as follows: 1

Related Search

Similar documents

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks