DEVELOPING AN AUTOMATIC METADATA EXTRACTION (METEX) SYSTEM FROM ELECTRONIC DOCUMENTS. A MASTER S THESIS in Software Engineering Atılım University - PDF

Description
DEVELOPING AN AUTOMATIC METADATA EXTRACTION (METEX) SYSTEM FROM ELECTRONIC DOCUMENTS A MASTER S THESIS in Software Engineering Atılım University by MURAT ÖZMERT JUNE 2007 DEVELOPING AN AUTOMATIC METADATA

Please download to get full document.

View again

of 112
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Lifestyle

Publish on:

Views: 50 | Pages: 112

Extension: PDF | Download: 0

Share
Transcript
DEVELOPING AN AUTOMATIC METADATA EXTRACTION (METEX) SYSTEM FROM ELECTRONIC DOCUMENTS A MASTER S THESIS in Software Engineering Atılım University by MURAT ÖZMERT JUNE 2007 DEVELOPING AN AUTOMATIC METADATA EXTRACTION (METEX) SYSTEM FROM ELECTRONIC DOCUMENTS A THESIS SUBMITTED TO THE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES OF ATILIM UNIVERSITY BY MURAT ÖZMERT IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN THE DEPARTMENT OF SOFTWARE ENGINEERING JUNE 2007 Approval of the Graduate School of Natural and Applied Sciences Prof. Dr. Selçuk Soyupak Director I certify that this thesis satisfies all the requirements as a thesis for the degree of Master of Science. Asst. Prof. Dr. Çiğdem Turhan Head of Department This is to certify that we have read this thesis and that in our opinion it is fully adequate, in scope and quality, as a thesis for the degree of Master of Science. Asst. Prof. Dr. Nergiz Ercil Çağıltay Supervisor Examining Committee Members Prof.Dr. İbrahim Akman Asst.Prof.Dr. Çiğdem Turhan Asst.Prof.Dr. Nergiz Ercil Çağıltay Instructor Gül Tokdemir Computer Engineer (MS) Arzu Serpen ABSTRACT DEVELOPING AN AUTOMATIC METADATA EXTRACTION (METEX) SYSTEM FROM ELECTRONIC DOCUMENTS Özmert, Murat M.S., Software Engineering Department Supervisor: Asst. Prof. Dr. Nergiz Ercil Çağıltay June 2007, 96 pages Today, the companies make big investments on Enterprise Resource Planning (ERP) solutions to manage their enterprise activities more effectively, easily and productively. Accordingly, they try to update their information systems. While companies make big investments, they try to retrieve significant data from hard copy documents to their information systems using man-power. These tedious chain of processes cause several losses. One of the most common problems met in the production oriented companies is the time loss due to effort on data extraction from publications/printed documents. Manual data input into ERP software slows down the work processes of the company and may cause entering incorrect data into the system because of high mistyping rate. This causes time and productivity loss in a company. Data retrieval from massive amounts of technical content can be a challenge for data input operators. Moreover, every technical publications and their subgroups having its own structure, the difference of the data extracted on every technical publication and belonging to different object group in the target information system create several challenges. When all these issues are considered, automatic metadata extraction processes gain more importance. Data collection activity should be completed in a short period of time for an information system whose software development phase is completed to begin serving for users as soon as possible (e.g. A dam must be filled with water to produce electricity). iii This study is a descriptive case study which analyze metadata extraction processes to support information systems. This case study is conducted in a real-world logistic domain that has predefined (standard) structural technical documentation to feed its information system. This study aims to guide other studies to better organise their infrastructure on the way of supporting their information system in a reliable domain. In this thesis, a framework that extracts metadata from massive electronic technical documents and transforms into XML, is presented. In this regard, aspects such as the basic structures of developed system, development processes, basic services provided and the similar systems are also elaborated. To better show the gains of the developed system, the durations of the processes in classical(manual) system and the developed system are also evaluated and compared. Keywords: Data Extraction, Metadata, Electronic document, XML transformation, ERP data input. iv ÖZ ELEKTRONİK DOKÜMANLARDAN OTOMATİK VERİ AYRIŞTIRMA (METEX) ARACININ GELİŞTİRİLMESİ Özmert, Murat Yüksek Lisans, Yazılım Mühendisliği Bölümü Tez Yöneticisi: Yrd.Doç.Dr. Nergiz Ercil Çağıltay Haziran 2007, 96 sayfa Günümüzde, şirketler kurumsal faaliyetlerini daha etkin, kolay ve verimli bir şekilde gerçekleştirebilmek amacıyla Kurumsal Kaynak Planlama (ERP) otomasyonu ile çözümlere büyük kaynaklar ayırmakta ve büyük yatırımlar yapmaktadırlar. Bu şekilde kullandıkları teknolojiyi her zaman yükseltmeye çalışmaktadırlar. Ancak şirketler bu sistemlere bu kadar yatırım yaparken çeşitli teknik dokümanlar üzerindeki bilgileri insan gücü kullanarak zahmetli bir biçimde bilgi sistemlerine aktarmaya çalışırlar. Bu durum çeşitli kayıplara yol açmaktadır. Üretime yönelik şirketlerde en çok karşılaşılan ortak problemlerden biri de yüksek volümlü teknik dokümanlardan verilerin hedef sisteme yüklemek amacıyla çıkartılmasında harcanan kayıplardır. İnsan gücü kullanarak ERP sistemine veri girişi, süreçlerin yavaşlamasına ve hata oranının yüsek olmasından dolayı yanlış bilgilerin sisteme aktarılmasına neden olabilmektedir. Bir organizasyon için, bu da verimlilik ve zaman kaybı anlamına gelmektedir. Yüksek hacimli teknik dokümanlardan verilerin bulunup hedef sisteme yüklenmesi amacıyla elde edilmesi, veri giriş operatörleri için oldukça zahmetli bir iş olabilmektedir. Ayrıca, her teknik dokümanın ve alt gruplarının kendine özgü bir yapısının olması, her dokümandan temin edilen verilerin farklılığı ve hedef bilgi sisteminde verilerin farklı nesne gruplarına ait olması da bazı güçlükler oluşturmaktadır. v Tüm bu konular değerlendirildiğinde, otomatik veri ayrıştırma süreçleri önem kazanmaktadır. Yazılımı tamamlanmış ve kullanıcılarına hizmet vermeye hazır bir bilgi sisteminin veri toplama süreçlerinin mümkün olan en kısa sürede tamamlanması gerekmektedir. (bir baraj göletinin su ile dolmaması halinde elektrik üretilememesi gibi). Bu çalışma bilgi sistemlerini desteklemek amacıyla veri ayrıştırma işlemlerini analiz eden tanımlayıcı bir çalışmadır. Bu çalışma bilgi sistemlerini beslemek amacıyla standart teknik doküman yapısına sahip olan gerçek bir lojistik faaliyet alanında yürütülmüştür. Çalışmanın amacı, güvenilir bir alanda bilgi sistemlerini desteklemede, altyapılarını daha iyi bir şekilde organize etmek isteyenlere rehberlik etmektir. Bu tezde, yüksek volümlü teknik dokümanlardan verileri ayrıştıran ve bunları XML formatına dönüştüren bir yapı sunulmaktadır. Bu kapsamda, geliştirilen sistemin temel yapısı, tasarım/geliştirme aşamaları, sağlanan temel fonksiyonlar ve benzer sistemler özetlenmiştir. Geliştirilen sistemin kazançlarını daha iyi ifade edebilmek için sonuç bölümünde, elle veri giriş ve geliştirilen sistem ile veri giriş süreleri değerlendirilmiş ve karşılaştırılmıştır. Anahtar kelimeler: Veri ayrıştırma, metadata, elektronik doküman, XML dönüşümü, ERP veri girişi. vi To My Wife, Sonsoles & To My Daughter, Elif vii ACKNOWLEDGEMENTS I am grateful to my advisor, Asst. Prof. Dr. Nergiz Ercil Çağıltay, for her guidance, motivation, patience, and the support that she has provided throughout the course of this study. I should also express my appreciation to Asst. Prof. Dr. Çiğdem Turhan for her continuous encouragement since I would like to thank to my family. Without their support and love, I could never taste any achievement in my life. viii TABLE OF CONTENTS ABSTRACT...iii ÖZ...v ACKNOWLEDGEMENTS...viii TABLE OF CONTENTS... ix LIST OF TABLES... xii LIST OF FIGURES...xiii LIST OF ABBREVIATIONS... xv CHAPTER 1. INTRODUCTION Printed Documents & ERP Systems Manual Data Entry Data Collection/Input Methods The Aim of the Case Study BACKGROUND AND RELATED WORKS General Concepts and Terms Previous Works and Commercially Available Products Evaluation and Comparison of the Products IMPLEMENTATION ix 3.1. Project Definition System Overview Provided Functionalities Requirements & Expectations General System Architecture Source Files to be Processed Description Argument and Description File Used PDF to Text Conversion Text to Excel Conversion Man-In-The-Loop Excel to XML Conversion How Does METEX Work? Technology Used EXPERIMENTAL RESULTS Document Structure Manual Data Input Transactions in The Target ERP System Total Duration Calculation CONCLUSION Future Work REFERENCES x APPENDICES A. ENTITY RELATIONSHIP DIAGRAMS RELATIVE TO CHAPTER B. DATAFLOW DIAGRAMS RELATIVE TO CHAPTER C. SEQUENCE DIAGRAMS RELATIVE TO CHAPTER D. REGULAR EXPRESSIONS RELATIVE TO CHAPTER E. PSEUDO CODES RELATIVE TO CHAPTER xi LIST OF TABLES TABLES Table 2.1 An Example of A Simple Metadata Record Table 2.2 Updated Tool Categories Table 2.3 Batch-Oriented Tool Categories Table 2.4 Updated On-line/Real-Time/Interactive Tool Categories Table 3.1 Description Argument/File And Their Functions Table 4.1 Time Portions for Manual Data Input Steps in ERP System xii LIST OF FIGURES FIGURES Figure 1.1 Possible Data Collection Methods... 4 Figure 2.1 Information Extraction in Context [12]... 8 Figure 2.2 Business-to-Business Communications with XML/EDI [16]... 9 Figure 2.3 An Example of XML Code [16] Figure 2.4 The Classical Approaches to Search for Regular Expressions in a Text. 16 Figure 2.5 The Filtering Approach to Search for Regular Expressions in a Text Figure 2.6 A Sample Web Page Figure 2.7 PDF Files Retrieved From Web Page Using Web2DB Software Figure 2.8 DataMite Software Screenshot Figure 2.9 Pattern Editor Software Screenshot Figure 3.1 General System Architecture Figure 3.2 Sample Portion of PDF Source File Figure 3.3 Sample Portion of Excel Source File Figure 3.4 Sample FSA (Finite State Automata) for a Sample Regular Expression. 34 Figure 3.5 Portion of the Access File Contains ERP s XML Id-Tags Figure 3.6 Sample PDF File That Contains Date Information Figure 3.7 Extraction of Date Information From Text Body Figure 3.8 The Source Code that Executes the Parsing Process Figure 3.9 The FSA for Above Given Regular Expression Figure 3.10 The Excel File Contains Extracted Data Figure 3.11 Description File Contains XML-Tags Correspond to Excel Data Figure 3.12 Specific Data in the XML File Figure 3.13 PDF To Excel Converter Main Window Figure 3.14 Screenshot of Open-File-Dialog-Box Figure 3.15 Presentation of PDF To Text Conversion Result Figure 3.16 Results of Text Processing Using Regular Expression Figure 3.17 Screenshot of Excel-File-Creation Window Figure 3.18 Save-Work-Area Functionality Figure 3.19 Excel To XML Converter Main Window Figure 3.20 Load Excel and Select Sheet xiii Figure 3.21 Successfully Loaded Excel File Figure 3.22 Failure on Loading Excel File Figure 3.23 Load Access File Figure 3.24 Result of Column Pairing of Loaded Access and Excel Files Figure 3.25 Failure on Column Pairing of Loaded Access and Excel Files Figure 3.26 Creation of XML File Figure 3.27 Saving an XML File Figure 3.28 Sample PDF File Figure 4.1 Experimental Document Breakdowns Figure 4.2 Sample Data Set and Data Locations in the Printed Technical Publication.63 Figure 4.3 Sample Sub Data Set and Data Locations in the Printed Technical Publication Figure 4.4 Manual Data Input in the Target ERP System Figure 4.5 Manual Data Input for the Subsystem (1) in the Target ERP System Figure 4.6 Manual Data Input for the Subsystem (2) in the Target ERP System Figure 4.7 Manual Data Input for the Subsystem(3) in the Target ERP System Figure 4.8 Manual Data Input for the Subsystem(4) in the Target ERP System xiv LIST OF ABBREVIATIONS CSV: DOM: DTD: ERP: GUI: HTML : METEX: ODBC: SCM: SQL: TCP/IP: XML: XSL: W3C: NFA: DFA: IE: Comma Separated Values Document Object Model The Document Type Definition Enterprise Resource Planning Graphical User Interface Hypertext Markup Language Metadata Extraction System Open Database Connectivity Supply Chain Management Structured Query Language Transmission Control Protocol/Internet Protocol Extensible Markup Language The extensible Style Language World Wide Web Consortium Nondeterministic Finite Automaton Deterministic Finite Automaton Information Extraction xv CHAPTER 1 INTRODUCTION In this chapter a general background is provided to better understand the concepts studied in this thesis. 1.1 Printed Documents & ERP Systems Despite the growing use of data capture, the business world still relies heavily on paper for account applications and other transactions. An estimated $15 billion dollars is spent each year in the United States to key in and validate data. Considerable savings could be achieved by automating this process. There is an increase in the amount of data captured electronically. The number of documents generated both electronically and on paper will increase to 20 trillion over the next five years [1]. Although the percentage of printed documents will decline from 90 percent of the total today to about 40 percent, this volume still represents a fourfold increase in the total number of document pages, to 4 trillion pages [2]. Information drives business and 80% of all corporate content is unstructured in nature. The volume of information produced is growing at 50% annually. On average, a knowledge worker generates about 800MB of content each year [3]. Nowadays companies need to analyze the technical publications produced by the manufacturers and extract the data needed from those publications and then input them into their information systems. Similarly, companies make big investments on Enterprise Resource Planning (ERP) solutions to manage their enterprise activities more effectively, easily and productively. Accordingly, they try to update their information systems regularly. However while companies make big investments, they try to migrate big amount of information from hard copy documents to their information systems by using manpower [4]. This causes time and productivity loss. Installations of new, highly integrated systems, such as ERP systems, may compound the problem of human error. Internal controls may be lacking due to 1 deadline pressures and complexity of the installation. In fact, approximately 40% of participants in a recent set of interviews pointed out that the internal controls for ERP systems are often inadequate [5]. The data entry errors occur frequently. One explanation may be that humans lapse into habits of mind during data entry because they mistakenly assume that the computerized systems have controls to catch most errors when modern systems may not [5]. 1.2 Manual Data Entry Manual data entry is the slowest and most error prone way to record data [6]. Human error in manual data entry is a concern, as is the potential for technical problems or equipment failure in automated methods of data collection [7]. One of the most common problems met at the companies is that the time loss due to effort on data extraction from publications slows down work process of the companies. Manual data entry into ERP system slows down the work processes of the company and may cause entering incorrect data into the system. Entering data into the information system manually takes considerable time and causes loss of manpower. Another concern is the rate of human errors in such a repetitive task. When a part number is entered incorrectly into an accounting information system, what is the impact? If good preventive, detective, and corrective controls are in place, the impact will be little, if any. However, if the appropriate controls are not in place, human error can have big ramifications. For example, when an Army clerk mis-keyed a part number, the product that was ordered was a seven-ton marine anchor instead of a headlight for a jeep. As a result, the huge anchor was delivered as a result of the wrong order [5]. The process of detecting, monitoring and correcting of wrong data is another concern. When considering the loss of time and man power on this process, the total cost of entering data gets very high [4]. Data retrieval from massive amounts of technical content can be a challenge for data input operators. Multiple users and departments need to collaborate to input data to the target information systems. Regarding logistics processes, the extraction of data in high volume technical publications by evaluating and analyzing them is very hard and tedious process. Due to large volume of data, the job requires many people working parallel. Not only time consuming, manual data entry also means a high 2 potential for errors [3]. The data read on technical publications and entered manually by system users may cause invalid data on the system. Invalid data entered into the system due to user errors cause financial loss and time consumption and sometimes system s security and safety violation. When information gets captured, it s often a manual process to validate it, separate it from other content, and classify it based on content type or sensitivity [32]. This type of manual process is inefficient, insecure, and error prone. There are many potential points of failure, no built-in security measures, and no built-in compliance management. Mistakes made in these areas can destroy a business credibility and reputation and result in costly legal fees [3]. The drawbacks of the manual-data-entry can be grouped under the following topics; Error-Prone: Manual data entry is error-prone. Even the best workers rarely maintain the vigilance and consistency required to achieve 100% data accuracy for significant periods of time [8]. Cost Driven: It is not always cost effective solution. The ERP system data entered, usually dictates that users should follow a certain path for entering data not to violate the data relations and data integrity. Moreover it requires the users to know the ERP system requirements well. This requirement brings increasing cost due to user training, consuming resources such as time and money. Source Definition & Data Preparation: Documents published in foreign languages requires person who is responsible for data entry to know that foreign language at least at the level of understanding the documents. It takes a long time to find the data required on the technical documents. The users dealing with data entry need to be an expert on functionality of the work processes. Every technical publication and their subgroups having its own structure, the difference of the data extracted on every technical publication and belonging to different object group in the information system create several challenges. 3 Data entered into the target ERP system can be in various media and in several standards. Update: Addition to the primary data migration, according to the technological advances, the technical data changes published by the manufacturers must be reflected into the system so often during product the life cycle. For each new product entering the system, the technical publications and their detailed levels need to be evaluated and then data collecting process should be repeated. 1.3 Data Collection/Input Methods The possible data input methods into the ERP System can be listed as follows: Manual data entry Automatically loading data obtained from electronic documents into the ERP system Hybrid solutions Figure 1.1 illustrates the possible methods for data collection. Figure 1.1 Possible Data Collection Methods 4 There is no one-size-fits-all process for data conversion and capture. The most appropriate method should be selected for the specific project, whether it is keyboarding, optical character recognition, automated processes or double key entry with verification [9]. 1.4 The Aim of the Case Study To address the problems associated with manual data entry, an automated chain of processes was needed. The large volume of data on printed or electronic technical publications makes it difficult to enter the data into the system manually. To break the
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x