Probabilistic Reasoning with Naïve Bayes and Bayesian Networks - PDF

Description
Probabilistic Reasoning with Naïve Bayes and Bayesian Networks Zdravko Markov 1, Ingrid Russell July, 2007 Overview Bayesian (also called Belief) Networks (BN) are a powerful knowledge representation and

Please download to get full document.

View again

of 22
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Crafts

Publish on:

Views: 14 | Pages: 22

Extension: PDF | Download: 0

Share
Transcript
Probabilistic Reasoning with Naïve Bayes and Bayesian Networks Zdravko Markov 1, Ingrid Russell July, 2007 Overview Bayesian (also called Belief) Networks (BN) are a powerful knowledge representation and reasoning mechanism. BN represent events and causal relationships between them as conditional probabilities involving random variables. Given the values of a subset of these variables (evidence variables) BN can compute the probabilities of another subset of variables (query variables). BN can be created automatically (learnt) by using statistical data (examples). The well-known Machine Learning algorithm, Naïve Bayes is actually a special case of a Bayesian Network. The project allows students to experiment with and use the Naïve Bayes algorithm and Bayesian Networks to solve practical problems. This includes collecting data from real domains (e.g. web pages), converting these data into proper format so that conditional probabilities can be computed, and using Bayesian Networks and the Naïve Bayes algorithm for computing probabilities and solving classification tasks. Objectives The aim of this project is to expose students to two important reasoning and learning algorithms Naïve Bayes and Bayesian Networks, and to explore their relationship in the context of solving practical classification problems. In particular, the objectives of the project are: Learning the basics of Bayesian approach to Machine Learning and the Bayesian Networks approach to Probabilistic Reasoning in AI. Gaining experience in using recent software applications in these areas for solving practical problems. Better understanding of fundamental concepts of Bayesian Learning and Probabilistic Reasoning and their relationship in the more general context of knowledge representation and reasoning mechanisms in AI. Project Description 1 Corresponding author: Department of Computer Science, Central Connecticut State University, 1615 Stanley Street, New Britain, CT Similarly to the Web document classification project (http://uhaweb.hartford.edu/compsci/ccli/wdc.htm) this project also has three main steps: Data collection, Data preparation, and Machine Learning. The first two steps of the two projects are basically the same. In fact, documents and data sets in Weka s ARFF format prepared for the former project can also be used for the Bayesian reasoning project. Hereafter we shall describe these steps again, because there are some differences in the software tools used which make the first two steps of the current project more straightforward. The basic difference is in the third step, which includes Bayesian classification with experiments with Bayesian networks. Data Collection Generally the data collection step in Machine Learning involves gathering cases, examples, or instances of objects from different types or categories (classes), so that at the ML step a model of these data is created that can be later used to identify groups or similarities in data (in unsupervised learning) or predict the class of new objects (in supervised learning). Although any other objects may be used for the purposes of this project we suggest that web or text documents are collected. The area of text/web document classification has been used in other ML projects and students who have worked on these projects have already gained some experience in this area. Results from other projects may be compared to the results from Bayesian learning which would show the advantages and drawbacks of different algorithms applied to the same data. And last but not least, text documents can be easily collected from the Web along with their class labels (topics or user preferences). This issue was discussed in detail in other two ML projects related to the Web - Web document classification (http://uhaweb.hartford.edu/compsci/ccli/wdc.htm) and Web user profiling (http://uhaweb.hartford.edu/compsci/ccli/wup.htm). To illustrate the process of data collection we use a small part of the Web, a set of pages describing the departments in the school of Arts and Sciences at CCSU. Each of these pages is a short textual description of the department. Figure 1 shows a directory page that includes links to the department pages. One of the department pages is shown in Figure 2. As we are going to classify these and also new documents into categories at this point we have to determine a class label for each department page. These labels can be given independently from the document content however as the classification will be based on the document content some semantic mapping between content and class should be established. For this purpose we can roughly split the departments into two groups sciences (class label A) and humanities (class label B). Table 1 shows these two groups (the department names are abbreviated). Documents Anthropology, Biology, Chemistry, Computer, Economics, Geography, Math, Physics, Class (A sciences, B humanities) A Political, Psychology, Sociology Justice, Languages, Music, Philosophy, Communication, English, Art, History, Theatre B Table 1. Documents groped by classes Figure 1. A directory page for a collection of web documents. Figure 2. A sample web document Finally, the web documents are converted into plain text files. The idea is to take off the HTML tags and leave the text only (although the HTML structure can carry some semantics associated with the document class we ignore it to simplify the processing and document representation). The conversion to text can be done is various ways manually or using some tools. For small number of documents we suggest using the Internet Explorer, which allows the currently loaded HTML document to be saved with the Save As option with Save as type: Text File (*.txt). The deliverable for the data collection step is a set of plain text files grouped into categories. As an illustration we suggest that students look into the collection of text files describing the A&S departments available from the data repository at folder CCSU departments. Data Preparation Now we are going to convert the text documents into data sets in Weka s ARFF format to be later used in the Machine Learning phase. With this conversion we will actually illustrate the basic concepts of the Vector Space document model that plays an important role in Information Retrieval and Web Search. We suggest that students read Chapter 1 of book [2] (available for free download from Wiley) before continuing with the steps described below: 1. Download and install the Weka data mining system (version Weka 3.4) (http://www.cs.waikato.ac.nz/~ml/weka/). Read the documentation and try some examples to familiarize yourself with its use (e.g. load and explore the weather data, provided with the installation). 2. Create a string data file in ARFF format (see the description of the ARFF format at Follow the directions below: First create a concatenation of all text documents (text corpus) obtained from the data collection step and save them in a single text file, where each document is represented on a separate line in plain text format. For example, this can be done by loading all text files in MS Word and then saving the file in plain text format without line breaks. Other editors may be used for this purpose too. Students with programming experience may want to write a program to automate this step. Once the file with the text corpus is created enclose each line in it (an individual document content) in quotation marks ( ) and add the document name in the beginning of the line and the document class at the end, all separated by commas. Also add a file header in the beginning of the file followed as shown document_name document_content document_class Anthropology, anthropology anthropology anthropology consists, A This representation uses three attributes document_name, document_content, and document_class, all of type string. Each row in the data section represents one of the initial text documents. Note that the number of attributes and the order in which they are listed in the header should correspond to the comma separated items in the data section. An example of such string data file is Departmentsstring.arff, available from the data repository at folder Weka data. 3. Create Term counts, Boolean, and TFIDF data sets. Load the string data file in Weka using the Open file button in Preprocess mode. After successful loading the system shows some statistics about the number of attributes (3) their type (string) and the number of instances (rows in the data section or documents). Choose the StringToNominal filter and apply it (one at a time) to the first attribute, document_name and then to the last attribute (index 3), document_class. Then choose the StringToWordVector filter and apply it with outputwordcounts=true. You may also change the setting of onlyalphabetictokens and usestoplist to see how the results change. As Weka moves the class attribute at the second place, move it back last by using the Copy filter and the Remove button. The result of all these steps is a Weka data set that uses a term count representation for the documents. An example of this data set is the file Departments-counts.arff, available from the data repository at folder Weka data. Now we have a document-term matrix loaded in Weka. Press the Edit button to see it in a tabular format, where you can also change its content or copy it to other applications (e.g. MS Excel). Once created in Weka the table can be stored in an ARFF file through the Save option. Below is a screenshot of a part of the document-term table for the CCSU departments. Weka can also show some interesting statistics about the attributes. The class distribution over the values of each attribute (including the document name) is shown at the bottom of the Selected attribute area. With Visualize All we can see the class distribution in all attributes. If we change the class to document_name we can see the distribution of terms over documents as bar diagrams. The screenshot below shows some of these diagrams. Examine the diagrams (the color indicates the document) and find the most specific terms for each document. For example, compare the diagrams of anthropology and chair and explain the difference. Which one is more representative and for which document? Similarly we can create the boolean and TFIDF representation of the document collection. Examples of these representations are provided in the files Departmentsbinary.arff and Departments-TFIDF.arff, available from the data repository at folder Weka data. Read the comment in the beginning of each file that explains the steps taken to create it. To obtain the boolean representation apply the NumericToBinary filter to the term count representation. See what changes in the diagrams. For the TFIDF representation, use the original string representation and apply the StringToWordVector filter with IDFTransform=true. Examine the document-term table and the bar diagrams and explain why some columns (e.g. chair and website ) are filled with only zeros. The deliverable for the data preparation step is the three data files Boolean, Term count, and TFIDF, as well as some statistics (tables and diagrams) and analysis of the attributes and class distributions (see the questions above). Bayesian Classification and Reasoning At this step we use Naïve Bayes and Bayesian networks for text document classification and further look into the underlying mechanisms of Bayesian reasoning. However before applying these algorithms we have to reduce the number of attributes in our data sets through feature selection. This will have two positive effects to our experiments. Firstly, it will improve the classification accuracy and secondly, it will simplify and make possible the visual inspection and evaluation of the Bayesian network model, which otherwise would involve thousands of variables. Feature selection In the area of text document classification the number of attributes is usually much higher than the number of documents. This poses a problem for many learning algorithms, further aggravated by the presence of too many 0 s in the document-term matrix. An obvious solution to this problem is to select a subset of words (terms) that best represent the document collection with respect to the classification task. This process is generally called feature (attribute) selection and is aimed at removing the irrelevant to the classification task attributes. There is an interesting experiment that we discuss hereafter. It investigates the relationship between the number of attributes, their relevance and the accuracy of classification. Attributes may be ranked by relevance using various algorithms. Weka provides a good number of algorithms for this purpose available through the Attribute Selection filter (the same algorithms can be used in Select attribute mode). We are not going into the details of these algorithms, but interested students may read Section Feature Selection in Chapter 5 of book [2]. To illustrate this process let s apply InfoGainAttributeEval to our Boolean data set, Departments-binary.arff. Weka now shows the attributes in a different order, starting with document_name, research, science etc. Note that document_name appears to be the most relevant attribute, which is obvious as it distinguishes uniquely every single document. However this attribute cannot be used to classify new documents as they will have different names (or IDs). That is why in most cases when it comes to generating a classification model we remove the ID attribute. Now, after doing so let us look at the document-term table (press on Edit button in Preprocess). There is an obvious difference with the same table showing the attributes in alphabetical order the number of 0 s in the leftmost columns is far less. It s interesting to look at the rightmost columns too there are columns with all 1 s (e.g. chair, phone, website) and such with almost all 0 s. These are obviously irrelevant attributes (explain why). The next step is to run the Naïve Bayes algorithm on the current data (with ordered attributes). We will repeat this with different number of attributes selected from the beginning of the list. Let s try 1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300, 400, 500, 600 and then record the classification accuracy with 10-fold cross validation (the default test option). Note that we don t count the class attribute, which is actually used in the process of attribute ranking. The results from this experiment are shown in the graph below (the markers on the curves indicate the data points 1, 2, 3, 5 etc.). LOO-CV accuracy Number of attributes With one attribute (research) the algorithm performs relatively well (85% accuracy) and with 2 and 3 it achieves 100% accuracy. Adding more attributes generally decreases the accuracy as they become less relevant. We also see some fluctuations especially at the right end of the curve, which may be a result of inconsistency between the attribute selection method and the attribute relevance with respect to the Naïve Bayes classification method. The effect of adding irrelevant attribute can be explained generally with the fact that an important assumption for using the Naïve Bayes algorithm is violated. It is that all attributes should have the same relevance with respect to the classification task. The reason for making this assumption can be found in the way Naïve Bayes computes the class likelihoods they are simply products of the conditional probabilities of all attribute values given the class value. The results of the above series of experiment show some optimal intervals for the number of attributes needed to achieve maximum accuracy. These are [2, 3] and [20, 30]. Generally these results depend on two factors the choice of data representation and the attribute selection method used. As a deliverable for this step we suggest that students create similar graphs with the other two data sets Term counts and TFIDF, and also use different attribute selection algorithms to see how these two factors affects the performance of Naïve Bayes and consequently determine the optimal number (and the actual subset) of attributes. The Subset evaluator may also be used to see how Weka will find the optimal subset of attributes with different data sets. Bayesian learning and classification At this step we use the data set with 5 attributes with which Naïve Bayes achieved 90% accuracy in the experiments described in the previous section. First we look closely into the classification model built by Naïve Bayes and then investigate how the Bayesian network extends it and thus improves the classification accuracy. Running Naïve Bayes with 10-fold cross validation produces the following output (some parts of it are skipped for brevity): === Run information === Scheme: weka.classifiers.bayes.naivebayes Instances: 20 Attributes: 6 research science concentrations ba social document_class Test mode: 10-fold cross-validation === Classifier model (full training set) === Class A: Prior probability = 0.55 research: Discrete Estimator. Counts = 4 9 (Total = 13) science: Discrete Estimator. Counts = 5 8 (Total = 13) concentrations: Discrete Estimator. Counts = 8 5 (Total = 13) ba: Discrete Estimator. Counts = 5 8 (Total = 13) social: Discrete Estimator. Counts = 8 5 (Total = 13) Class B: Prior probability = 0.45 research: Discrete Estimator. Counts = 10 1 (Total = 11) science: Discrete Estimator. Counts = 10 1 (Total = 11) concentrations: Discrete Estimator. Counts = 10 1 (Total = 11) ba: Discrete Estimator. Counts = 1 10 (Total = 11) social: Discrete Estimator. Counts = 10 1 (Total = 11) === Stratified cross-validation === Correctly Classified Instances % === Confusion Matrix === a b -- classified as 9 2 a = A 0 9 b = B Looking at this output two important observations can be made. First, all attributes have only one of its values occurring in class B. This is indicated by the fact that one of the counts is always 1, which means that the actual count is 0 (according to the Laplace estimator used by the algorithm the actual value count is incremented by 1). Second, the confusion matrix indicates that all documents from actual class B are always classified as B, while two documents from class A are wrongly classified as B. Obviously the second observation is a result of the first one (explain why). A look at the class distribution over the attribute values produced by the Visualize All option in Preprocess mode (see below) confirms this conclusion. For example, all documents with research=1 (the right bar) fall in class A (blue color), which means that no documents with research=1 fall in class B. Consequently the conditional probability P(research=1 B) = 0. In fact, the Laplace estimator is used to avoid zero probabilities like this, because they make the whole product zero and thus ignore the contribution of other attributes. At the same time however the conditional probability P(research=0 B) will be almost 1 (not exactly 1 because of the Laplace correction). Thus a document with research=0, science=0, concentration=0, ba=1, and social=0 will be clearly classified in class B, because the product P(research=0 B)P(science=0 B)P(concentration=0 B)P(ba=1 B)P(social=0 B) will be much higher that the corresponding product for class A. This can be seen in the Weka output if we look at the probability distribution in the predictions (after checking Output predictions in More options ): === Predictions on test data === inst#, actual, predicted, error, probability distribution 1 1:A 1:A * :A 1:A * :A 1:A * :B 2:B 0.05 * :A 1:A * :B 2:B 0.05 * :A 1:A * :B 2:B *0.939 1 1:A 2:B * :B 2:B * :A 2:B * :B 2:B * :A 1:A * :B 2:B * :A 1:A * :B 2:B * :A 1:A * :B 2:B * :A 1:A * :B 2:B *0.944 For all documents from actual class B, the class distribution decisively predicts class B. The predictions also show that the two errors (marked w
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks