Term inology E xtraction System based on Vocabulary Space

Description
German-Japan NL WS in Sapporo2003/7/4. Term inology E xtraction System based on Vocabulary Space. Hiroshi Nakagawa Information Technology Center, The University of Tokyo. 歩留まり : Bu-Domari: Success rate ?? 横持ち : Side take:

Please download to get full document.

View again

of 36
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Documents

Publish on:

Views: 0 | Pages: 36

Extension: PDF | Download: 0

Share
Transcript
German-Japan NL WS in Sapporo2003/7/4 TerminologyExtractionSystem based on Vocabulary SpaceHiroshi NakagawaInformation Technology Center, The University of Tokyo歩留まり: Bu-Domari:
  • Success rate ??
  • 横持ち: Side take:
  • Transportation between main transportation method station (like airport, train station )and destination or starting point.
  • 玉掛け: ball hinge
  • To operate a power shovel
  • Really useful and interesting terminologies
  • Long Compound Nouns
  • German
  • German-Japan
  • German-Japan natural
  • German-Japan natural language
  • German-Japan natural language processing
  • German-Japan natural language processing workshop
  • German-Japan natural language processing workshop program
  • German-Japan natural language processing workshop program chair
  • German-Japan natural language processing workshop program chair and
  • German-Japan natural language processing workshop program chair and ACL
  • German-Japan natural language processing workshop program chair and ACL2003
  • German-Japan natural language processing workshop program chair and ACL2003 general
  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor
  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii
  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory
  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory
  • German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory chief scientist Dr. Tsuruoka
  • Long compound noun (NP) is the source of information about terminology
  • Objective
  • Up-to-date domain terminology dictionary is the gateway to various technology and academic fields.
  • For this, first of all we need high quality terminologies of the target domain.
  • What corpus? Ordinary corpus or Web pages?
  • Concepts
  • Methodological classification:
  • Supervised Learning based extraction
  • finding heavily influenced features
  • surrounding patterns of target expression
  • technology developed by NE task
  • Statistics based extraction  our target
  • document space based statistics
  • linguistic structure, such as syntactic, semantic structure based formalism
  • vocabulary space based statisticsour target
  • Document space versus Vocabulary spaceWebabc,abc,ablmnxy,xyabcab, xyabc, lmnxydocument space based statistics
  • Old fashioned
  • Weight term candidates based on their occurrence on document space: corpus or Web, and rank them descending order.
  • term frequency or tf*idf for basic nouns
  • To extract compound nouns,contingency matrix and co-occurrence based decision with MI, χ2 ,Dice etc.
  • Linguistic Structure based method
  • Syntactic structure
  • POS pattern like {adj (noun)+}
  • phrasal verbs, etc.
  • Semantic structure of compound nouns
  • Predicate argument structure (i.e.Pustejovski)
  • Case frame of predicate
  • Single and compound nouns are not treated equally.
  • Vocabulary space based method
  • Statistics of vocabulary space such as
  • Statistics of embedded relation (C-value)
  • How many compound nouns the target noun makes (LR = our proposal)
  • Application of link structure analysis of Web pages: (PageRank, HITS)
  • Single and compound nouns are treated equally
  • Our objective
  • Experimental analysis and evaluation of various term extraction methods with
  • Test collection (TMREC) corpus
  • Web page corpus
  • Domain dictionaries on Web or in CR-ROM as gold-standard
  • Term extraction system repository
  • Gensen Web (言選Web)
  • http://gensen.dl.itc.u-tokyo.ac.jp/gensenweb_eng.html
  • Finally Automatic builder for up-to-date domain terms dictionary
  • ATR byCompound noun statistics言選 Gensen Web
  • Automatic term extraction from WEB pages
  • Step1. Term candidate extraction
  • separating text by stop-words (or using morphological analyzer ) to generate candidates
  • Step 2. Scoring candidates to rank them
  • our scoring mechanism is innovative and unique
  • Domain Specific Termsexpressing domain conceptsAbout 85%       about15%compoundsimple nouns nouns
  • Simple noun: no more divided into shorter nouns
  • Compound noun: uninterrupted sequence of simple nouns
  • Our Purpose isExtracting domain specific terms including compound and simple nouns from domain corpus automatically.Scoring of Simple Nouns
  • Li =freq. n Nm Rj= freq.
  • 3noun statistics 2
  • 1character trigram
  • 1class acquisition1
  • LN(trigram)=5 n=3 m=2 RN(trigram)=3Principle:A simple noun which contributes to make a big number of compound nouns has a high score.Scoring of compound nounsGM(Compound Noun)GM(CN) is a geometric mean which does notdepend on the length of CN.New scoring function:FGM(CN)if CN occurs independently then where f(CN) means the number of independent occurrences of noun CN(= CN does not appear as a part of longer CN ) Ex. GM(trigram)=((5+1)x(3+1))1/2=4.9 if f(trigram)= 5FGM(trigram)=24.5Modified C-valueModify C-value(Frantzi&Ananiadou,1996) to be able toscore a simple nounlength(a) :# of simple nouns consisting afreq(a):frequency of at(a):frequency of candidate compound nouns including a c(a):frequency of distinct candidate compound nouns including aExperimental Evaluations Data used in our experiment is developed by NII.
  • Manually POS tagged Japanese corpus and the gold-standard is a set of manually extracted terms developed by NTCIR1 TMREC task
  •  (Artificial Intelligence field:1,870 paper abstracts)
  • Gold-standard consists of manually extracted 8,843 domain specific terms
  • Complete and Partial match by GM: (base line)Partial match(contained)Complete matchNumber of complete matched terms by FGM,MC-valueMCval - GMFGM-GMNumber of partially matched terms byFGM,MC-valueFGM-GMMCval-GMAverage length (every 100 terms)of extracted termsMC-valueGMFGMTop scored 20 terms by GM
  • candidate terms frequency
  • 知識(knowledge) 787  ○
  • 学習知識(learning knowledge) 1 ○
  • 学習(learning) 255 ○
  • 言語的知識(linguistic knowledge) 2 ○
  • 知識システム(knowledge system) 14 ○
  • 学習システム(learning system) 16 ○
  • 問題知識(problem knowledge) 3 ×
  • 学習問題(learning problem) 5 ○
  • 言語的(linguistic) 1 ○
  • システム(system) 861 ○
  • Top scored 20 terms by GM(con’t)
  • 11. 問題(problem) 561 ○
  • 12. 論理的知識(logical knowledge) 1 ○
  • 13. 学習支援システム(learning assistance system) 3 ○
  • 14. 設計知識 (design knowledge) 29 ○
  • 学習問題解決システム(learning problem solver) 1 ○
  • 16. 学習支援 (learning assistance) 9 ○
  • 17. 言語的情報(linguistic knowledge) 3 ○
  • 18. 知識モデル(knowledge model) 3 ○
  • 19. 設計システム(design system) 6 ○
  • 20. システム設計(system design) 1 ○
  • Top scored 20 terms by FGM
  • candidate terms frequency
  • 知識(knowledge) 787  ○
  • システム(system) 861 ○
  • 問題(problem) 561 ○
  • 学習(learning) 255 ○
  • 学習者(learner) 383 ○
  • モデル(model) 356 ○
  • 情報(information) 382 ○
  • 問題解決(problem solving) 186 ○
  • 設計(design) 183 ○
  • 知識ベース(knowledge base) 149 ○
  • Top scored 20 terms by FGM(con’t)11. 推論(inference) 162 ○12. 支援(assistance) 87 ×13. 知識表現(knowledge representation) 74 ○14. エージェント(agent) 256 ○15. 学習者モデル(learner’s model) 57 ○16. 機能(function) 294 ×17. 設計者(designer) 69 ○18. 対話(dialogue) 205 ○19. 言語(language) 75 ○20. 対象(object) 293 ○ Top scored 20 terms by MC-value
  • candidate terms frequency
  • 学習者(learner) 383  ○
  • 問題解決(problem solving) 186 ○
  • システム(system) 861 ○
  • 知識(knowledge) 787 ○
  • 研究(research) 651 ×
  • 本稿(this paper) 594 ×
  • 手法(method) 562 ×
  • 問題(problem) 561 ○
  • 知識ベース(knowledge base) 149 ○
  • 論文(paper) 453 ×
  • Top scored 20 terms by MC-value (con’t)11. 方法(method, way to do) 426 ×12. 支援システム(assistance system) 18 ×13. 計算機(computer) 128 ○14. 情報(information) 382 ○15. モデル(model) 356 ○16. 自然言語(natural language) 63 ○17. 我々(we) 332 ×18. 有効性(effectiveness) 160 ×19. エキスパートシステム(expert system) 78  ○20. ユーザ(user) 297 ○ Precision(complete matched) of each methodN1,N2: top two systems of NTCIR1Precision(partially matched) of each methodPrecision of each method when large number of terms extractedN1, N2: top two systems of NTCIR1 Conclusions-1New statistical methods for ATR, which are basically how many nouns adjoin the single-noun in question to form compound nouns. FGM・best in extracting small number( up to 1400) of high quality domain specific terms・longer terms including correct terms are better extracted by FGM or GMMC-valueStrong in extracting large number (up to 6000) of domain specific termsConclusions-2
  • Web is perceived as a gigantic knowledge resource, but yet to be fully utilized.
  • Terminology in various domain is sure to be the gateway to the domain for novices even for experts.
  • More readily useful ATR is needed.
  • Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks