UNIVERSITÀ DEGLI STUDI DI PAVIA Facoltà di Lettere e Filosofia - PDF

Description
UNIVERSITÀ DEGLI STUDI DI PAVIA Facoltà di Lettere e Filosofia Corso di Laurea Specialistica in Linguistica Teorica e Applicata TOWARDS A DISCOURSE RESOURCE FOR ITALIAN: DEVELOPING AN ANNOTATION SCHEMA

Please download to get full document.

View again

of 130
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Recipes/Menus

Publish on:

Views: 35 | Pages: 130

Extension: PDF | Download: 0

Share
Transcript
UNIVERSITÀ DEGLI STUDI DI PAVIA Facoltà di Lettere e Filosofia Corso di Laurea Specialistica in Linguistica Teorica e Applicata TOWARDS A DISCOURSE RESOURCE FOR ITALIAN: DEVELOPING AN ANNOTATION SCHEMA FOR ATTRIBUTION Relatore: Prof.ssa Irina Prodanof Correlatore: Dott.ssa Claudia Soria Correlatore: Prof.ssa Cecilia Maria Andorno Tesi di: Silvia Pareti Anno Accademico 2008/2009 Nobody believes the official spokesman... but everybody trusts an unidentified source. Ron Nessen Abstract This thesis investigates the complex phenomenon of attribution and addresses the issue of annotating attribution relations, developing, by means of a pilot study, a possible annotation schema to be applied to the Italian Syntactic Semantic Treebank corpus of newspaper articles (ISST). Attribution is the relation occurring between assertions but also e.g. beliefs, feelings, intentions, and the agents they belong to (e.g. The minister says that taxes will rise in 2010). As this relation deeply affects the way we perceive information, this should not be considered in isolation. It is fundamental to recognise attribution in order to deal with the reliability of information and with opinions. The development of an annotation schema for attribution aims at providing a resource in which information is overtly linked to its source. Having this annotated resource could serve a number of purposes especially in the fields of Information Retrieval, Multi Perspective Question Answering and Opinion Mining. To date, attribution has only been annotated when associated with a discourse connective or one of its arguments (Prasad et al., 2007), or only at the document, sentence (Skadhauge and Hardt, 2005) or even word level (Wiebe, 2002) thus only partially approaching the phenomenon. The present study addresses attributions independently and regarding them also as a discourse phenomenon. After analysing the features and issues connected to attribution, e.g. scope definition, nested attributions, factuality of the relation and co-reference resolution, an annotation schema will be proposed following the identification of a set of possible attribution devices. To test its feasibility and accuracy against data, a pilot annotation will be performed on a portion of the ISST corpus. This will allow the definition of annotation guidelines and the identification of additional issues remained unnoticed at the theoretical level. In order to select a suitable tool to perform the pilot annotation, several available annotation tools will be compared. This thesis not only constructively contributes to the development of a discourse resource for Italian, but also approaches attribution relations from a new independent perspective raising problematic issues and providing a deeper - ii - account of the phenomenon. Further developments of the project should perform a complete pilot annotation of all the type of attribution and features intended to be included and develop, together with the appropriate tool, a final annotation schema to be applied to the whole corpus. - iii - Acknowledgments It might seem banal, however every time a challenging project is over it is useful to look back and consider who made it possible. Not only to in order to recognise other people s merits and efforts, but especially to realise that we have not been alone. It is because of that very feeling that every time an endeavour finally comes to its conclusion, I can once again think about starting a new one. I can start with remembering how much of what I am I owe to the Erasmus Scheme and the chances it gave me first as a student and recently as an intern to study and research at two amazing UK universities: Reading University and the University of Edinburgh. First of all I would like to thank my supervisor at Edinburgh, Bonnie Webber, for the unforgettable opportunity and the many hours she devoted to listen about my progresses and my many doubts, every time with a solution to propose or a name possibly having it to suggest. Echoes of the enlightening conversations I had there with Theresa Wilson, Janyce Wiebe, Jean Carletta, Nicoletta Calzolari, John Niekrasz, Katja Markert, and other colleagues, can be found in this thesis as they were fundamental in shaping my choices and widening my perspective concerning the topic. A special acknowledgement is also due to Rashmi Prasad who has patiently answered all my s providing me with material and clarifications about the PDTB and precious suggestions. Constructive were also the contacts I had with Roser Saurí, Tommaso Caselli and Massimo Poesio. Lastly, I cannot forget the contribution of Jasmine to the revision of the thesis and the technical and loving support Gregor unfailingly provided. - iv - Contents Abstract... ii Acknowledgments... iv List of Figures and Tables... ix List of Figures...ix List of Tables...ix 1 Introduction An Independent Approach to Attribution Methodology Terminology Outline of the Thesis Discourse and Attribution What is Discourse? Definition Theories of Discourse Coherence and Cohesion Constituency vs. Dependency Discourse Annotation Projects RST-DT The Penn Discourse TreeBank PDTB Other Projects Attribution Towards a Definition of Attribution Are Attribution Relations a Discourse Phenomenon? Related Studies GraphBank Opinion Corpus PDTB - The Penn Discourse TreeBank v - 2.5 Summary An Analysis of Attribution The Components of Attribution The Source The Content Elements Functioning as Cue Some Issues Nested Attributions Source of the Source Multiple Sources, Contents, Cues Co-reference Resolution Scope Definition Summary Features to Include in the Annotation Type Assertion Belief Fact Eventuality Issues Concerning Type Definition Source Writer Arbitrary Other Factuality Factual Non-factual Scopal Change Scopal Polarity Other Elements Affecting the Factuality vi - 4.5 Summary Performing a Pilot Annotation Corpus ISST Architecture Subcorpus Selection Tool Selection Requirements Comparison of Available Tools Selection and Tool Specifics Setting MMAX Scheme Customization Style Feasibility of the Schema and Issues Summary Annotation Schema and Guidelines Text Spans Selection Source Span Cue Span Content Span Supplement Feature Annotation Guidelines Type Attribute Factuality Attribute Scopal Change Attribute Source Type Attribute Collecting a List of Italian Cues Extracting Verb Cues from the PDTB Summary vii - 7 Conclusion Future Work And Beyond Bibliography: Abbreviations and Acronyms Appendix 1 MMAX2 Code Appendix 2 Italian Attribution Cues Appendix 3 PDTB Verb Cues viii - List of Figures and Tables List of Figures Figure A - Reported news example... 2 Figure B - RST schemas... 9 Figure C - - (L-TAG) Tree examples (Cristea and Webber, 1997) Figure D - Sense classification of discourse connectives in the PDTB Figure E - Graphic extra-linguistic attribution Figure F - Newspaper article source Figure G - Nested attribution schema Figure H - Truth values of a nested content Figure I - Design Process Figure J - ISST orthographic level (sole002) Figure K - ISST morpho-syntactic level (sole002) Figure L - ISST syntactic constituent level (sole002) Figure M - ISST table format Figure N - GATE annotation environment Figure O - GATE annotation exported in XML Figure P - Knowtator annotation environment Figure Q - Knowtator annotation exported in XML Figure R - MMAX2 Project Wizard Figure S - MMAX2 Base Data (ISST cs001) Figure T - The annotation of cue, content and source as separate levels Figure U - MMAX2 Annotation window Figure V - MMAX2 Annotation window (attributes) Figure W - MMAX2 Annotation of relations Figure X - Nested attributions visible through handles Figure Y - Attribution relation components Figure Z - Annotation, text spans selection Figure AA - Annotation, elements which could function as a markable Figure BB - Annotation, attributes selection List of Tables Table 1 - Factuality values (Saurí and Pustejovsky, 2008) Table 2 - N. of articles selected per section Table 3 - Knowtator/ MMAX2 feature comparison Table 4 - Annotation schema features Table 5 - Factuality and Scopal change values assignment ix - 1 Introduction 1 Introduction Discourse relations represent a fundamental aspect of discourse understanding and generation. Therefore research in many areas, such as Information Extraction, Discourse Generation and Question Answering, would benefit from a discourse annotated corpus as a basis for their studies. The aim of this thesis is to contribute towards providing Italian with complete linguistic resources in particular with designing and testing the addition of a discourse level of annotation to the ISST corpus, a multi-level annotated corpus of Italian newspaper texts. This already consists of 5 levels of annotation: orthographic, morpho-syntactic, syntactic (constituents), syntactic (dependencies) and semantic. The addition of a layer for discourse annotation comes as a natural development of the ISST corpus. Most of the work in this frame, to date, concentrates on analysing and annotating discourse connectives or anaphoric relations. For the purpose of the present study, however, these issues will not be addressed and the focus will be on attribution relations. This topic is especially relevant for research dealing with Information Retrieval, Multi-Perspective Question Answering and Opinion Mining. Tools able to discern information according to the relevance of its source or to identify different opinions with regards to a given topic would dramatically improve the quality of the information we are constantly exposed to. People more and more refer to the internet as a source of information and knowledge interrogating search engines instead of encyclopaedias or experts. A number of projects, last the Microsoft search engine Bing, are trying to outperform Google and break its monopoly with scarce success as they introduce interesting small changes without remarkably improving the reliability of responses to our queries. Search engines usually classify the source only at the macro-level, i.e. the webpage a certain text or information was taken from. The urge for retrieving answers quickly does not always allow users to take the context in which the information was found into consideration or to address the troublesome question: Where does this knowledge come from? Quite often for example we hear people supporting their views with stating that they read - 1 - 1 Introduction something about them on the internet or even that internet says it. This generalisation is also due to the difficulty of linking the information to the exact source, often hidden by several levels of attribution all nested one in another like a Matryoshka doll. The practice of reporting information is particularly pervasive in the journalistic field and especially in news reviews where what is stated is always second hand if it does not originate from even further away. In the example below (Figure A), on the website First Bell it is reported that the UK has the largest gender gap in science achievement. This, however, according to the UK s Telegraph which in turn reports a study from the OECD whose data is taken from the Program for International Student Assessment (2006). Figure A - Reported news example In the last few years the Web has become the indistinct repository of all human knowledge. However, although it surely is the shallowest source of the data we learn from it, it is never the only one and knowing all the passages a certain statement has gone through is fundamental, as it is e.g. to know its temporal anchor, in order to verify its veracity, understand and interpret it. Just consider the example (1) below: (1) According to The Times the President wants to buy the Amazon Forest and turn the trees into toothpicks 1 Introduction This intentions attributed to the President seems to come from a trustworthy source, The Times, and would hopefully determine immediate reactions at least from the environmentalists. But what if this statement was part of another attribution relation as in the paragraph that follows (2)? (2) According to The Times the President wants to buy the Amazon Forest and turn the trees into toothpicks. The comedian pronounced these words, joking about the President s disregard for environmental issues. A last remark concerns the utility and importance of developing such a project in a language other than English. First of all, because findings and results proceeding from studies employing the English language cannot be always and entirely valid for other languages. Secondly, the importance and life of a language depends also on these efforts to make it available for every possible use. Having language resources for Italian means providing support for studies and research and allow the development of tools specific for this language, thus enabling its speakers to rely on it for the full range of their needs. Lastly, developing resources in several languages provides precious data for inter-linguistic comparison, thus making it possible to identify aspects which are common and aspects instead peculiar to each language. 1.1 An Independent Approach to Attribution Being able to automatically link together attributed material and its source would represent a big advantage for a number of tasks. At present, this is still not possible. A manually annotated corpus for attribution is surely not the solution, however, it represents an important step towards it. Studies aiming at developing tools for the recognition of attribution would in fact need a complete description of how the phenomenon functions and is expressed, together with an already annotated corpus to test their reliability. Although attribution relations have already been annotated in a few other projects (Wolf and Gibson, 2005; Wiebe, 2002; Prasad et al., 2007), a systematic and independent account of the phenomenon is still lacking. Studies aiming at - 3 - 1 Introduction capturing the complexity of discourse relations recognise the importance of attribution, but reserve a rather secondary role for it (Wolf and Gibson, 2005; Prasad et al., 2007). Other approaches instead take the distance from discourse and assume a more independent perspective or pair attribution with subjective language (Wiebe, 2002). None of them, however, completely investigate attribution as they limit the annotation to only some of the attribution levels: word, clause, sentence or document. In the present project, attribution relations will be investigated as the starting point towards the construction of a discourse resource for Italian and not as an additional feature of it. Moreover, all levels of attribution will be considered and annotated. This way of proceeding will allow exploring the topic independently from other discourse relations and reaching a deeper understanding and a broader account of the phenomenon. 1.2 Methodology In order to annotate the corpus for attribution, some preliminary work needs to be carried out. First of all, attribution relations have to be analysed in order to identify their characteristics and spot issues which represent an obstacle to the annotation. Afterwards, possible solutions to these problems will be proposed and an annotation schema outlined. This will be then applied to a section of the corpus with the help of an annotation tool. The tool has been selected after comparing and testing several available software applications. The choice of the annotation tool poses constraints to the annotation schema as its limited functionality determines what is feasible and what not (e.g. some tools do not allow the selection of overlapping text spans). Although ideally the tool is determined by the annotation schema and should be developed according to its requirements, this was at this stage not realistic. Having to rely on an existing tool, the initial annotation schema proposed will therefore be adapted to the selected tool. Performing the pilot annotation will rise additional issues and determine new changes to the annotation schema. This will finally reach its final stage, a proposal for the annotation of attributions, which should be applicable to the rest of the corpus with the help of annotators, leading - 4 - 1 Introduction to presumably good interannotator-agreement. 1.3 Terminology Before moving on, it is opportune to briefly introduce some terminology employed. Although text is used in linguistics to refer to any passage, spoken or written, of whatever length, that does form a unified whole (Halliday and Hasan, 1976:1), as the type of texts within the scope of this study are solely newspaper articles, this will refer to written language only. The account for attribution provided in this thesis should hold also for the spoken language, however, further investigations are necessary in order to determine to what extent this is true. When generally discussing attribution, the term writer will be mostly employed to refer to both the writer and the speaker of the text. Discourse is often characterised as a coherent text, as opposed to text lacking a semantic unity. As incoherent texts will not be taken into consideration, both discourse and text will be generally used to refer to a coherent unit of language. The lexical material signalling an attribution relation will be mostly identified as cue or text anchor 1.4 Outline of the Thesis In the second chapter, the framework of discourse studies will be briefly introduced together with a survey of discourse annotation projects. Afterwards, attribution will be defined and projects involving its annotation reviewed. The third chapter presents the phenomenon of attribution and provides an analysis of its constitutive elements, with particular attention to the elements expressing them in the text. Some of the most problematic issues connected to attribution relations and their annotation are also investigated. A first annotation schema proposal is described in chapter four. The description focuses on the features to include in the annotation. These attributes and their possible values are carefully analysed and described with the help of examples 1 Introduction The fifth chapter illustrates the stages towards performing a pilot project in order to test the feasibility of the schema on the corpus. These include the specification of the tool requirements, the analysis and selection of the most suitable tool among the ones currently available and the setting of the selected tool. Afterwards, some additional issues or issues identified through the pilot annotation are also presented. In the sixth chapter the final annotation schema proposed for the annotation of attribution relations is briefly summarised and guidelines concerning the annotation are provided, as they have been adopted for the pilot, in order to facilitate the selection of the relevant text spans and the assignment of the attribute values. In the last chapter conclusions are drawn and future developments discussed 2 Discourse and Attribution 2 Discourse and Attribution 2.1 What is Discourse? Definition Aristotle already understood it and warned us in his Metaphysics that The whole is more than the sum of its parts. This also holds for such wholes like texts, where the meaning deriving from the juxtaposition of clauses, as pointed out by Moore and Wiemer-Hastings (2003), may not coincide with the meaning of the individual clauses and may imply more than that. Discourse could therefore be defined as propositions in context (Péry-Woodley and Scott, 2007). Units of language are usually organised in a coherent way and researchers agree that coherent text has a structure and that understanding the way it functions is fundamental for the understanding of discourse (Grosz and Sidner, 1986, Hobbs, 1985). This structure needs to be taken into consideration when dealing with natural language generation but also with tasks such as co-reference resolution,
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks