Célia Talma Martins de Pinho Valente Oliveira. A Tool for Text Mining in Molecular Biology Domains - PDF

Description
Célia Talma Martins de Pinho Valente Oliveira Gonçalves A Tool for Text Mining in Molecular Biology Domains Departamento de Engenharia Informática Faculdade de Engenharia da Universidade do Porto Prof.

Please download to get full document.

View again

of 129
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Resumes & CVs

Publish on:

Views: 17 | Pages: 129

Extension: PDF | Download: 0

Share
Transcript
Célia Talma Martins de Pinho Valente Oliveira Gonçalves A Tool for Text Mining in Molecular Biology Domains Departamento de Engenharia Informática Faculdade de Engenharia da Universidade do Porto Prof. Doutor Eugénio Oliveira Prof. Doutor Rui Camacho April 2013 Célia Talma Martins de Pinho Valente Oliveira Gonçalves A Tool for Text Mining in Molecular Biology Domains Tese submetida à Faculdade de Engenharia da Universidade do Porto para obtenção do grau de Doutor em Engenharia Informática Departamento de Engenharia Informática Faculdade de Engenharia da Universidade do Porto Prof. Doutor Eugénio Oliveira Prof. Doutor Rui Camacho April 2013 To my daughter, Edna, truly my most lovely dream. To my husband, Litos, my everlasting eternal love. To my parents: Maria Isaura and José de Pinho Valente. Especially to honor the memory of my dear late Father, the most wonderful Father in the world. i ii Acknowledgments At the end of my thesis I would like to thank all the ones who made this thesis possible and an unforgettable experience for me. It is with deep gratitude that I acknowledge my supervisors: Prof. Eugénio Oliveira and Prof. Rui Camacho, whom I genuinely respect and admire. I would like to thank Prof. Eugénio Oliveira, for the constant support, motivation and patience. I also would like to express my gratitude to Prof. Rui Camacho for the constant guidance, support and invaluable help during the course of this research work. Many thanks to my friend, Prof. Nuno Fonseca, who gave the embrionary idea of this thesis and was always available. I wish to thank all the LIACC-NIAD&R colleagues that have accompanied me. Last but not least, I am infinitely grateful to my parents, brother and sister, for their unconditional love and support. A very special word to my Mother and Father, for being present in all moments of my life, and for the unconditional love and support they gave me all of my life and also for encouraging me to pursue my studies. A word to my sister, Elsa Marina, who was always a friend, a sister and a mother, whenever I needed her. The day I finished my masters she was the first one to encourage me to pursue a PhD degree. My heartfelt thanks to her for encouraging me. I cannot forget her daughters, my two loving nieces: Joana Aléxia and Diana Zénia whom I love very much, whose love and joy make my life much better. A word to my brother, Zé Manuel, that looked after Edna together with my mother, whenever it was necessary, since she was born. A word to the memory of my grandmother Hermínia because it was she who taught me to write, read and make make my first calculations in primary school. A word to my godmather, Maria Ursulina who also taugh me how to enjoy studying. Of course that my most deep word of care goes to the love of my life, my husband Litos, to whom I cannot find words to express my gratitude, for being an amazing husband and so dedicated father. He has accompanied this thesis since it hás began, and has helped me with its deep knowledge in programming. Thank you for the continuous encouragements and for always believing in my skills. I could not have accomplished it without you. Another most deep word of love goes to my loving daughter, Edna Sofia, that is the most beautiful and loving treasure of my life, my sweet litle angel. She was my precious gift at the end of the thesis. Her smile and her joy make my life truly meaningful. Both of you are my daily source of inspiration. The last word goes to an endless care for the most special Father in the world, who iii lives forever in my heart. How I miss you, how I wish you were here! This work has been carried out always remembering the words of encouragement you gave me throughout my life. All this work was meant for you. We will all be together once again one day... English text reviewed by Ricardo Tavares iv Abstract Researchers need to be constantly aware of the work that has been done in their research area. Nowadays, most of the publications are available on the Internet. However there is, usually, an overwhelming amount of information making it impossible for a researcher to be aware of all this available information. Accessing the relevant information through the traditional keyword search engines still results in a huge list of publications to read, that usually have a large number of irrelevant publications. To tackle this problem, research in Text Mining and Information Retrieval has been applied to identify the most relevant publications out of the enourmous amount of documents made available in the Internet. Molecular Biologists have some routine tasks, that we believe may be automatically accomplished through the application of Machine Learning techniques. We have identified one of those tasks. Given a set of genomic or proteomic sequences, return a set of related sequences and a set of papers with information relevant to the study of such sequences. To properly implement this task we have to solve two main problems: to fetch a set of relevant papers; to sort by relevancy the papers resulting from the previous stage. In this thesis we are proposing a novel method of Information Retrieval, based on Machine Learning techniques, to address the problem of retrieving relevant papers from MEDLINE. We have developed a new Information Retrieval methodology involving the dynamic construction of a classifier in real time for classifying MEDLINE papers. The methodology works as follows. A set of papers, associated with a set of sequences of interest are retrieved from the NCBI database. A data set is constructed using the NCBI retrieved papers, taken as positive examples and a set of equal number of papers randomly sampled from MEDLINE, taken as the negative examples. The negative examples are constrained to share MeSH terms with the positives ones. This data set is used by a Machine Learning algorithm to induce a classifier. The induced classifier is used to retrieve from MEDLINE a set of relevant papers. Since the retrieved set of papers is usually very large, a ranking of the set is performed (the second step). To address this second problem we are proposing a multi-criteria ranking function. We have used a new methodology to evaluate it automatically. The ranking function is a weighted combination of MeSH terms, number of citations, author s h-index, author s number of publications, journal impact factor and journal similarity factor where the original sequences were published. v A web-based search tool was fully implemented integrating all the scientific contributions mentioned. vi Resumo Os investigadores necessitam de estar constantemente informados do trabalho efetuado na sua área de investigação. Atualmente, a maioria das publicações estão disponíveis na Internet. Porém, existe normalmente, uma enorme quantidade de informação que torna impossível para um investigador estar a par de toda a informação disponível. Aceder à informação relevante através dos tradicionais motores de pesquisa baseados em pesquisa de palavras-chave resulta numa enorme lista de publicações para o investigador ler com um elevado número de publicações irrelevantes. No sentido de resolver este problema, a investigação em Text Mining e Recuperação de Informação (Information Retrieval) tem sido aplicada para identificar as publicações mais relevantes de entre a gigantesca quantidade de documentos disponíveis na Internet. Investigadores da Biologia Molecular têm habitualmente um conjunto de tarefas rotineiras, que acreditamos serem susceptíveis de automatização através da aplicação de técnicas de Aprendizagem Computacional (Machine Learning). Identificamos uma destas tarefas. Dado um conjunto de sequências genómicas ou proteómicas, devolver um conjunto de sequências relacionadas e um conjunto de artigos com informação relevante para o estudo dessas mesmas sequências. Para resolvermos esta tarefa tivemos de solucionar dois problemas principais: encontrar e retornar um conjunto de artigos relevantes; ordenar por relevância este mesmo conjunto de artigos. Nesta tese propomos um novo método de Recuperação de Informação baseado em técnicas de Aprendizagem Computacional, para solucionar o problema de recuperar artigos relevantes da MEDLINE. Desenvolvemos uma nova metodologia de Recuperação de Informação que envolve a construção dinâmica de um classificador em tempo real para classificar os artigos da MEDLINE. A metodologia funciona do seguinte modo. Um conjunto de artigos associado a um conjunto de sequências de interesse é recuperado da base de dados do NCBI. É construído um data set usando os artigos recuperados, que constituem os exemplos positivos e é gerado um conjunto de artigos escolhidos de forma aleatória da MEDLINE, que constituem os exemplos negativos. Os exemplos negativos partilham obrigatoriamente MeSH terms com os exemplos positivos. Este data set é usado pelos algoritmos de Aprendizagem Computacional para construír um classificador. Este classificador é depois usado para recuperar um conjunto de artigos relevantes da MEDLINE. Como normalmente o conjunto dos artigos recuperados é enorme, é efectuada a ordenação deste conjunto (segundo passo do processo). No sentido de resolver este segundo problema propomos uma função de vii ordenação multi-critério. Usamos uma nova metodologia para sintonizar esta função de forma automática. A função de ordenação proposta é uma combinação ponderada dos seguintes factores: MeSH terms, número de citações, h-index do autor, número de publicações do autor, o factor de impacto da revista e o factor de similaridade da revista onde as sequências originais foram publicadas. Foi totalmente implementada, uma ferramenta baseada na Web que integra todas as contribuições científicas mencionadas. viii Résumé Les chercheurs ont besoin d être renseignés en permanence sur le travail développé dans le domaine de leur recherche. Á l heure actuelle la majorité des publications sont disponibles sur Internet. Cependant, il existe en général une tellement grande quantité d information qu un chercheur ne peut pas en être au courant. Accéder à l information révélant par le biais de traditionnels moteurs de recherche basés sur mots-clés donne comme résultat d énormes listes de publications à lire par le chercheur mais avec un nombre très élevé d exemplaires sans intérêt ou importance. Pour résoudre ce problème, la recherche en Extraction de Donnés (Text Mining) et Récupération d Information (Information Retrieval) a été parfois appliquée pour l identification des publications ordonnées par degré d importance parmi le gigantesque amoncellement de documents disponibles sur Internet. Des chercheurs en Biologie Moléculaire ont d habitude un ensemble de tâches routinières, que nous croyons être susceptibles d automatisation avec l application de techniques d Apprentissage Computationnelle (Machine Learning). Étant donné un ensemble de séquences génomiques or protéomiques, développer un ensemble de séquences en rapport et un ensemble d articles pertinents pour l étude de ces séquences mêmes. Pour résoudre cette tâche nous avons dû trouver la solution pour deux problèmes principaux: trouver et retourner un ensemble d articles pertinents; ordonner par pertinence cet ensemble d articles lui-même. Dans cette thèse nous proposons une nouvelle méthode de Récupération d Information basé sur des techniques d Apprentissage Computationnelle, dans le but de résoudre le problème de récupérer des articles pertinents de MEDLINE. Nous avons développé une nouvelle méthodologie de Récupération d Information qui implique la construction dynamique d un classificateur en temps réel pour classer les articles de MEDLINE. La méthodologie fonctionne de la façon suivante: un ensemble d articles associé à un ensemble de séquences d intérêt est récup1 eré de la banque de données du NCBI. Un data set est construit avec les articles récupérés, lesquels constituent les exemples positifs et il est gén1 eré aussi un ensemble d articles choisis de MEDLINE d une façon aléatoire, lesquels constituent les exemples négatifs. Les exemples négatifs partagent obligatoirement MeSH terms avec les exemples positifs. Ce data set est employé par les algorithmes d Apprentissage Computationnelle pour construire un classificateur. Ce classificateur est ensuite employé pour récupérer un ensemble d articles pertinents ix de MEDLINE. Comme normalement l ensemble d articles récupérés est énorme, la mise en ordre de cet ensemble est effectuée (second pas du procédé). Dans le sens de résoudre ce second problème, nous proposons une fonction de mise en ordre multicritère. Nous employons une nouvelle méthodologie pour accorder cette fonction d une façon automatique. La fonction de mise en ordre est une combinaison pondérée des facteurs suivants: MeSH terms, nombre de citations, h-index de l auteur, nombre de publications de l auteur, le facteur d impact de la revue et le facteur de similarité des revues où les séquences originelles ont été publiées. Un outil basé sur la Web a été totalement développé, intégrant toues les contributions scientifiques mentionnées. x Contents i Abstract v Resumo vii Résumé ix Contents xiv Tables Index xvii Figures Index xx 1 Introduction Context and Problem BioTextRetriever Research Questions Thesis Objectives Key Contributions Structure of the thesis Information Retrieval and Text Mining Information Retrieval Boolean Model xi 2.1.2 Vector Space Model Probabilistic Model Performance Evaluation of an Information Retrieval System Text Mining Pre-Processing Algorithms for Text Mining Document Indexing Classifier Learning Support Vector Machine Naïve Bayes Classifier K-Nearest Neighbor Rocchio Algorithm Decision Trees Ensemble Classifiers Inductive Logic Programming Classifier Evaluation Tools for Information Retrieval and Text Mining General Tools for and Text Mining Tools for Bioinformatics Ranking Methodologies Summary A tool for Text Mining in Molecular Biology s Domains BioTextRetriever Overview The MEDLINE Local DataBase Extensions to the available Information Pre-Processing MEDLINE data xii 3.5 Assessing the Pre-Processing Techniques How to use BioTextRetriever Summary From Sequences to Papers Constructing a data set Complementing the data sets Original sequences characterization Assessing the method of choosing irrelevant papers Classifier Construction Process Evaluating the Alternatives Alternative ILP Results Comparing propositional and ILP results Alternative Alternative Alternative Alternative Comparing the five alternatives Summary Ranking MEDLINE The global procedure for the Ranking Function The Ranking Function Number of MeSH terms Author s Number of Publications Number of citations h-index Journal Impact Factor xiii 5.2.6 Journal Similarity Factor Choosing the Ranking Function Coefficients Experimental Settings Summary Conclusions and Further Research Thesis Overview Further Research References 117 A Annexes 133 Index 139 xiv List of Tables 2.1 Confusion Matrix Machine Learning algorithms used in the study Summary of tools for generic text mining applications Summary of tools for bio text mining applications Summary of tools for bio text mining applications (cont) LDB characterization Machine Learning algorithms used in the study Pre-processing combinations and data sets characterization Results summary table Characterization of data sets of the NMV type Characterization of data sets of types MRV and RV. In MRV the data sets of Table 4.1 were completed with randomly selected papers. In RV all negative examples are randomly seleted Machine Learning algorithms used in the study Accuracy results (%) in NMV data sets. In all data sets the best value isn t different from the second best value in a statistically significant way, according to the t-student test (α = 0.05) [SAC07]. The t-student test is a widely used method for comparing the means of two samples Accuracy results (%) in MRV data sets. In all data sets the best value isn t different from the second best value in a statistically significant way, according to the t-student test (α = 0.05) Accuracy results (%) in RV data sets. In all data sets the best value isn t different from the second best value in a statistically significant way, according to the t-student test (α = 0.05) xv 4.7 Algorithm s Accuracy (%) average for NMVs, RMVs and RVs. There is statistical significance (using t-student test (α = 0.05)) between the best and second best methods Characterization of data sets used to assess the 5 alternatives Accuracy results (%) in data sets. In all data sets the best value isn t different from the second best value in a statistically significant way, according to the t-student test (α = 0.05) Prolog predicates used ILP results over the same data sets used with the propositional algorithms Comparing ILP and propositional results using accuracy (%) Ranking of the algorithms in comparison Ensemble s Accuracy results in alternative 2. The bold values are statistically different from the second best values according to the t- student test (α = 0.05) Alternative 3 Results Alternative 4 Results Alternative 5 Results Best accuracies achieved in each alternative Comparing the five alternatives ranking position The table shows an example of four papers associated with the input sequences The α value represents the devaluation coefficient in decreasing order of the paper s scientific age Example of h-index computation for h=4 in this case Example of four publication journals of four papers associated to the input sequences Characterization of data sets regarding the number of attributes and the number of positive and negative examples Characterization of data sets used to tune the coefficients of the ranking function. The colum number six represents the total number of relevant papers classified as relevant by BioTextRetriever. The last colum represents the percentage of relevant papers classified as relevant by BioTextRetriever xvi 5.7 Best combinations results for the fourteen data sets described in Table 5.6. C1 represents the coefficient weight for the number of MeSH terms; C2 represents the coefficient weight for the number of citations; C3 represents the coefficient weight for the author h-index; C4 represents the coefficient weight for the impact factor; C5 represents the coefficient weight for the number of publications and C6 represents the coefficient weight for the Journal Similarity Factor Individual combination results for the fourteen data sets described in Table 5.6 for the three combinations presented in Table A.1 Accuracy Classification results. Values with an attached * were obtained interrupting Weka after 3 months of execution. In each cell is the accuracy and the standard deviation in parenthesis. Bold values are the best results for each data set. The value for cross validation is 10. The last colum presents the average of accuracy of the all algorithms for each data set A.2 True Positive Classification results. Values with an attached * were obtained interrupting Weka after 3 months of execution.in each cell is the true positives and the standard deviation in parenthesis. Bold values are the best results for each data set. The value for cross validation is 10. The last colum presents the average of true positives of the all algorithms for each data set A.3 F-Measure Classification results. In each cell is the f-measure and the standard deviation in parenthesis. Bold values are the best results for each data set. The value for cross validation is 10. The last colum presents the average of f-measure of the all algorithms for each data set. 137 xvii xviii List of Figures 1.1 An example of a DNA structure (U.S. National Library of Medicine) (a) and protein
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks