ROD OBIEKTOWA BAZA DANYCH DLA JĘZYKA RUBY - PDF

Description
STUDIA INFORMATICA 2012 Volume 33 Number 2A (105) Aleksander POHL Jagiellonian University, Departament of Computational Linguistics ROD RUBY OBJECT DATABASE Summary. Ruby Object Database is an open-source

Please download to get full document.

View again

of 17
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Concepts & Trends

Publish on:

Views: 13 | Pages: 17

Extension: PDF | Download: 0

Share
Transcript
STUDIA INFORMATICA 2012 Volume 33 Number 2A (105) Aleksander POHL Jagiellonian University, Departament of Computational Linguistics ROD RUBY OBJECT DATABASE Summary. Ruby Object Database is an open-source object database designed for storing and accessing data which rarely changes. The primary reason for designing it was to create a storage facility for natural language dictionaries and corpora. It is optimized for reading speed and easiness of usage. Keywords: object database, Ruby, natural language processing ROD OBIEKTOWA BAZA DANYCH DLA JĘZYKA RUBY Streszczenie. ROD (Ruby Object Database) jest otwartą, obiektową bazą danych zaprojektowaną do przechowywania i odczytywania danych, które rzadko ulegają zmianie. Podstawowym powodem jej utworzenia była chęć stworzenia bazy dla słowników oraz korpusów wykorzystywanych w przetwarzaniu języka naturalnego. Baza ta jest zoptymalizowana pod kątem szybkości odczytu danych oraz łatwości jej użycia. Słowa kluczowe: obiektowa baza danych, Ruby, przetwarzanie języka naturalnego 1. Introduction ROD (Ruby Object Database) is an open-source object database distributed under the MIT/X11 1 license and available at ROD is designed for storing and accessing data which rarely changes. It is an opposite of RDBMS as the data is not normalized, while joins are much faster. It is an opposite of in-memory databases, since it is designed to cover out of core data sets (10 GB and more). It is also an opposite of simple key-value stores, since it provides an expressive object-oriented interface. 1 282 A. Pohl The primary reason for designing it was to create storage facility for natural language dictionaries and corpora. The data in a fully fledged dictionary is interconnected in many ways, thus the relational model (joins) introduces unacceptable performance hit. The size of corpora forces them to be kept on disks. The in-memory databases like Redis [11] are not suited for large corpora. They would also require the data of a dictionary to be kept mostly in RAM, which is not needed (in most cases only a fraction of the data is used at the same time). And the last but not the least, the key-value stores however fast, provide an interface that is not expressive enough for Natural Language Processing tasks. That is why a storage facility which minimizes the number of disk reads and overcomes the defects of the mentioned storage systems was designed. The database is accessible via the Ruby language [8], which is both its data definition and data manipulation language. Thanks to its great expressiveness and true object-orientedness the data manipulation is done as easy as if a domain specific language was defined, still giving full access to a modern and very powerful programming language. 2. Motivation The primary reason for designing ROD was to create a facility for storing linguistic data, which would be easily accessible from Ruby. This was motivated by the research in Natural Language Processing (NLP) and the lack of an object oriented database that would be suited for the specific NLP needs. The primary resources used in NLP are machine readable dictionaries and corpora these resources have several features that are not very common in information technology in general. First of all dictionaries rarely change although natural languages evolve, this process is quite slow. Assuming that a dictionary has reached some maturity, it is not needed to update it more than several times a year. Similar situation concerns corpora they are composed of texts, which are not modified after being incorporated. These types of resources tend to accumulate data rather than to change or remove their contents in this respect they are similar to data warehouses. However, the second feature of these resources makes them very different from data warehouses that is the number of types of relations and number of relations that are registered for the data. If one wishes to build a decent dictionary for Polish one has to consider the following: the relation between a word form and its lemma (e.g. dogs dog, Polish has much more inflected forms than English 14 in the case of a typical noun), the relation between a word and its senses (e.g. grain a small granular particle, a weight unit, a seed-like fruit; Princeton WordNet 3.0 [2] registers 11 senses for grain), the different relations between senses of words (e.g. hyperonymy, hyponymy, meronymy, holonymy, troponymy etc.), the der- ROD Ruby Object Database 283 ivational relations between words (e.g. house housing) and similar. These might be further enriched with statistical data, so the data model is quite complex. In the case of corpora, the situation is similar for a fully annotated corpus there would be many syntactic and semantic relations involved. If we used a relational database to store such a data, obtaining a full semantic or syntactic information, even for a single word form or a text segment would produce a very large number of table joins, causing an unacceptable performance. For this reason lexicographers tend to use SGML and (recently) XML to store dictionaries and textual data rather than relational databases. The last important feature of these resources is their size. If it was relatively small, a good choice for such data would be in-memory databases like memcached 2 or Redis, which offer very good performance, even for highly interrelated data. But the size of dictionaries and corpora is often much larger than the available physical memory, even on modern machines. Size of a fully-fledged dictionary is two or three orders of magnitude larger than the number of entries (hundreds of thousands for words and millions for word forms), while sizes of the largest corpora are counted in tens or hundreds of gigabytes. On the other hand the performance provided by in-memory databases is not really needed for these resources. Although it is good to have the largest dictionary and the largest corpus available, the data provided by them is never needed at-once. Usually only a small fraction of the data is processed at one moment, so it is not necessary to keep it in the operating memory. Only a good memory manager is needed: one that would keep the data that is often accessed in RAM and remove the rarely accessed data. Such requirements are not very common in the other fields of information technology, thus there are not many general purpose storage systems that would suite them, the only exception being the graph databases. But on the other hand, the semi-structured graph model offered by these systems is too general for such needs the data models of dictionaries and corpora, although evolve, are rather fixed, mostly due to the fact, that a change in the structure makes sense only if new data is available for the whole dictionary or corpus, obtaining which is usually very expensive. Besides the features of the linguistic resources, the important factor taken into account when designing ROD was the language for accessing and processing the data. Instead of designing a new one like SQL or SPARQL, it was assumed that Ruby is a very good choice when it comes to navigate the data. If the database provided an object oriented interface with simple indexing, more complex queries could be easily expressed in that language due to its fully object oriented nature and simple, yet very powerful syntax. 2 284 A. Pohl 3. Related Work Designing a new data storage system should always have a good rationale, since there are so many storage system that there is a big chance that the problem at hand was already solved and it doesn't make sense to implement yet-another-home-made-storage-facility. On the other hand the number of the available solutions makes it quite hard to find the one that suites the best (which may change during the development of the client system). Thus the review of related work will focus on the systems used in Polish NLP as well as systems available for Ruby. Finite state machines and finite state transducers are the primary means for obtaining taggings and lemmas from inflected word forms [1,10] (e.g. pies+noun:plural:nominative for psy (dogs)). The whole dictionary for a given language is transformed into finite state transducer, where each state transition corresponds to a letter in an analysed word form. The result of the analysis is a lemma or lemmas (in the case of ambiguity) of the word plus corresponding taggings. Since many word forms share some of the letter sequences, the information is much compressed and such a system works with very high performance characteristic for finite state machines. For the specific task of providing lemma and tagging for a word form, finite state transducers seems to be the best option. But they fall short when it comes to provide more information about the word in question. The first problem is that the result of the analysis might be ambiguous at least for Polish the lemma plus the tagging is not enough to distinguish words such as rządy (governments) and rzędy (rows), which both are nouns of the same gender and have the same lemma: rząd. The other problem is that if one wishes to obtain more than a lemma and a tagging (such as the senses of the word) the result of the analysis has to be parsed once again, which introduces significant processing cost. On the contrary in such a case an object database will return an object or objects that are distinguishable merely by their abstract ids and may provide any additional data via unified object interface (method call). Another data stores used in NLP are engines build to store and efficiently query corpora, not only via key-words, but also via various features of the words. For example Poliqarp, a corpus engine build by IPI PAN [6], allows for storing large amounts of text and query them with Poliqarp query language by lemmas as well as by a specific part of speech or other morphosyntactic features such as gender, case or number. Although Poliqarp provides quite expressive query language, its problem is similar to finite state transducers it doesn't provide an object oriented interface to the data. So if a developer wishes to remember some result, he/she has to remember the query sent to the ROD Ruby Object Database 285 server and the offset of the interesting result. This significantly impairs subsequent data retrieval performance, especially when the query returned many results. The other problem is that the data is transferred in the form of a semi-structured text so the result of the query has to be parsed, which further impairs its performance. As a result the system has to be augmented with another storage engine to remember the issued queries or the processed results, which makes it impractical as a standalone data storage solution. The last interesting related work in the field of NLP is the access layer build on the top of PolNet one of the two WordNets build for Polish. [3] describes the architecture of the POLINT-112-SMS system and the reasons for building a custom query language on the top of XML-based data store used to store the WordNet, which the POLINT system interacts with. The author argues that a direct integration of the WordNet with the system implementation language (Prolog is given as an example) would introduce high coupling between the NLP system and the storage system. But the author also indicates that the adoption of a generic solution such as SQL database, XML store or RDF store with SPARQL interface would yield a system which is less suited for NLP tasks, such as navigation over the WordNet structure or reasoning over the data the queries would be much more verbose and less meaningful for the developers. So it would be harder to maintain the interoperability between the systems. It is true that the general purpose data manipulation languages like SQL or SPARQL are more verbose than the language provided by the access layer. It is also true, that direct integration of the data with the system would introduce high coupling. Still it is not obvious that if an object oriented interface was provided, the queries expressed in the same language as the client system implementation language would not be concise enough. Most of the examples provided by the author are easily expressible in modern programming languages like Ruby or Python, so there is no need to create such a domain specific language. This solution is more powerful since it is much easier to write code in such general purpose languages, than extend syntax and implement semantics of a domain specific language. Concerning Ruby and the solutions that bring the benefits of object-orientedness into the data storage world, there are many of them. However we compare only the most known and most used library, that is ActiveRecord [7], since it is the most popular object-relational mapper for Ruby. The important omission is Neo4j [9] a graph database which seems to be most similar to ROD. The reason is that this database is available only for JRuby, that is a Ruby implementation for the Java Virtual Machine, while ROD is targeted at MRI the primary Ruby implementation in C. 286 A. Pohl 4. Database Design 4.1. Overview of implementation The database features described in the Motivation section impose two primary constraints on its design raw read performance and easiness of use provided by the Ruby interface. This language does not seem to be the best choice for implementing the fastest data storage engine, that's why the core of the database is implemented in C. Still its interface is pure Ruby, which mimics the well know object-relational mapper for Ruby ActiveRecord used as the default ORM in Ruby on Rails framework. To bridge the gap between the fully object-oriented and dynamically typed Ruby and the procedural and statically typed C a RubyInline 3 library is used. It was created to allow developers replace slow, but critical Ruby code with a C implementation. Its name comes from the fact that the C code is written directly in a Ruby class so there is no need to maintain separate C files nor manually compile them. The C code is generated and compiled when the Ruby code containing it is run for the first time. As a result a shared library is created and linked in the run-time while the dynamic nature of Ruby allows for extending the already defined classes with new methods. The RubyInline library maintains the changes in the Ruby/C code, so the C code is re-complied only if the developer provides new methods or changes the C implementation. This library allows ROD to generate C code providing access to database for each class that is intended to be stored in the database. As a result this ensures the highest performance provided by statically typed C while maintaining flexible interface of dynamically typed Ruby. It is not a surprise that this design introduces some restrictions concerning the types of Ruby values that might be stored in the database. But these restrictions are soften by the mechanisms provided by the database itself the fact that is very easy to create a new type storable in the database as wall as a serialization mechanism, that allows for storing immediate Ruby values (if referential integrity does not have to be maintained) Access method The access method of the database is provided by the mmap system call available on most of the modern operating systems, Linux in particular. This call allows for mapping a file stored on a disk to a physical memory region. The pages containing the data are loaded only if the particular memory address is accessed. This greatly simplifies the implementation of the data access routines and it transfers the responsibility of memory management to the operating system. The only operations that have to be maintained by the database are growing the data file and its re-mapping. The OS is responsible for the rest reading of the data, mainte- 3 ROD Ruby Object Database 287 nance of the buffered pages and writing the data back to disk. So the primary memory manager of the database is just the memory manager of the OS. Such a solution should suite needs of most of the developers. If it is not the case, it is usually possible to select a different memory manager of the whole OS Data model The primary storage unit of ROD is a Ruby object, but the core implementation language is C, so the data model of the database is mapped directly to the C data types. This means that any Ruby object apt for being persisted in ROD has to be represented as a C struct. Thanks to the RubyInline library this struct is defined at run-time by calling Ruby methods and it is used to represent the attributes and associations of the persisted object. There are several base rules of mapping between Ruby objects and C structs. The first is that the name of the attribute or association in the class is used as the name or prefix of the names of corresponding fields in the C struct. The second is that the name of the struct is a transformation of the fully qualified name of the Ruby class, with colons replaced by underscores. The last is that the C structs of a single class constitute a continuous array in the memory and are uniquely identified by their index (that is the offset in the corresponding memory-mapped file). This index is the database identifier of the object, it is named rod_id and its smallest value is Attributes The database defines several types of attributes: atomic, fixed length attributes; atomic, variable length attributes and complex attributes. Atomic, fixed length attributes are these attributes that might be directly mapped to atomic C data types such as Ruby Fixnum, Integer and Float. In fact at present only the following C types are used: int, unsigned long and double. The first type is used to map values of Ruby Fixnum type, the second positive values of Ruby Integer type that are smaller or equal to the maximum value of unsigned long and the third values of Ruby Float type, with the restrictions imposed by the double type. If there is such an attribute in the persisted class, a corresponding field is created in the C struct and the value is stored using Ruby built-in Ruby-to-C mapping macros and read using built-in C-to-Ruby macros. Ruby strings are represented by atomic, variable length attributes, which are stored in a flat file. Since Ruby strings may contain non-string terminating zeros, they are identified by their offset in the file and their length. These two values are represented in the C struct as _offset and _length fields respectively, prefixed with the name of the attribute. Complex attributes have to be further divided into two groups: immediate values, that do not have to preserve referential integrity, such as Ruby arrays of integers or hashes of strings 288 A. Pohl and complex Ruby objects that have to preserve referential integrity. The values of the first type are marshaled using Ruby built-in marshal function or encoded using a JSON 4 format (the method is left as a choice for the user). They are stored in the same way as strings, with the exception that strings are always UTF-8 encoded before being stored. This also applies to values of atomic Ruby types besides Integer, Fixnum and Float such as Symbol. If the complex values are supposed to preserve referential integrity, they have to be defined by classes that use ROD storage mechanism and such attributes have to be transformed into singular associations. In most circumstances this should not be a problem, since Ruby incorporates the uniform accessor pattern, so from the point of view of object's interface attributes are indistinguishable from singular associations Singular Associations Singular associations that is 1-to-1 and n-to-1 associations from the point of view of the class on the left are treated as follows: the C struct defines an unsigned long field with the name of the association as prefix and _id suffix, that stores the rod_id of the referenced
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks