Description

Naïve Discriminative Learning: Theoretical and Experimental Observations Stefan Evert 1 & Antti Arppe 2 1 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany 2 University of

Information

Category:
## Engineering

Publish on:

Views: 16 | Pages: 127

Extension: PDF | Download: 0

Share

Transcript

Naïve Discriminative Learning: Theoretical and Experimental Observations Stefan Evert 1 & Antti Arppe 2 1 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany 2 University of Alberta, Edmonton, Canada QITL-6, Tübingen, 6 Nov 2015 S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Outline Outline 1 Introduction Naïve Discriminative Learning An example 2 Mathematics The Rescorla-Wagner equations The Danks equilibrium NDL vs. the Perceptron vs. least-squares regression 3 Insights Theoretical insights Empirical observations Conclusion S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction Naïve Discriminative Learning Outline 1 Introduction Naïve Discriminative Learning An example 2 Mathematics The Rescorla-Wagner equations The Danks equilibrium NDL vs. the Perceptron vs. least-squares regression 3 Insights Theoretical insights Empirical observations Conclusion S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction Naïve Discriminative Learning Objectives Explain the mathematical foundations of Naïve Discriminative Learning (NDL) in one place and in a consistent way Highlight the theoretical similarities of NDL with linear/logistic regression and the single-layer perceptron Present some empirical simulations of stochastic NDL learners, in light of the theoretical insights S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction Naïve Discriminative Learning Naïve Discriminative Learning Baayen (2011); Baayen et al. (2011) Incremental learning equations for direct associations between cues and outcomes (Rescorla and Wagner 1972) Equilibrium conditions (Danks 2003) Implementation as R package ndl (Arppe et al. 2014) Naive: cue-outcome associations estimated separately for each outcome (this independence assumption is similar to a naive Bayesian classifier) Discriminative: cues predict outcomes based on total activation level = sum of direct cue-outcome associations Learning: incremental learning of association strengths S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction Naïve Discriminative Learning The Rescorla-Wagner equations (1972) Represent incremental associative learning and subsequent on-going adjustments to an accumulating body of knowledge. Changes in cue-outcome association strengths: No change if a cue is not present in the input Increased if the cue and outcome co-occur Decreased if the cue occurs without the outcome If outcome can already be predicted well (based on all input cues), adjustments become smaller Only results of incremental adjustments to the cue-outcome associations are kept no need for remembering the individual adjustments, however many there are. S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction Naïve Discriminative Learning Danks (2003) equilibrium conditions Presume an ideal stable adult state, where all cue-outcome associations have been fully learnt further data points should then have no impact on the cue-outcome associations Provide a convenient short-cut to calculating the final cue-outcome association weights resulting from incremental learning, using relatively simple matrix algebra Most learning parameters of the Rescorla-Wagner equations drop out of the Danks equilibrium equation Circumvent the problem that a simulation of an R-W learner does usually not converge to a stable state unless the learning rate is gradually decreased S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction Naïve Discriminative Learning Traditional vs. linguistic applications of R-W Traditionally: simple controlled experiments on item-by-item learning, with only a handful of cues and perfect associations Natural language: full of choices among multiple possible alternatives phones, words, or constructions which are influenced by a large number of contextual factors, and which often show weak to moderate tendencies towards one or more of the alternatives rather than a single unambiguous decision These messy, complex types of problems are a key area of interest in modeling and understanding language use Application of R-W in the form of a Naïve Discriminative Learner to such linguistic classification problems is sucessful in practice and can throw new light on research questions S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction Naïve Discriminative Learning Related work R-W vs. perceptron (Sutton and Barto 1981, p. 155f) R-W vs. least-squares regression (Stone 1986, p. 457) R-W vs. logistic regression (Gluck and Bower 1988, p. 234) R-W vs. neural networks (Dawson 2008) similarities are also mentioned by many other authors... S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction An example Outline 1 Introduction Naïve Discriminative Learning An example 2 Mathematics The Rescorla-Wagner equations The Danks equilibrium NDL vs. the Perceptron vs. least-squares regression 3 Insights Theoretical insights Empirical observations Conclusion S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction An example Simple vs. complex settings QITL-1 revisited Arppe and Järvikivi (2002, 2007) Person (first person singular or not) and Countability (collective or not) of agent/subject of Finnish verb synonym pair miettiä vs. pohtia think, ponder : Forced-choice Frequency Acceptability Dispreferred Preferred (relative) Unacceptable Acceptable miettiä+sg1 Frequent miettiä+sg1 pohtia+coll pohtiaä+coll miettiä+coll pohtia+sg1 Rare miettiä+coll pohtia+sg1 S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction An example QITL-1 through the lens of NDL AgentGroup pohtia weight t S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction An example QITL-1 through the lens of NDL AgentGroup miettiä weight t S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction An example QITL-1 through the lens of NDL PersonFirst pohtia weight t S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction An example QITL-1 through the lens of NDL PersonFirst miettiä weight t S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction An example QITL-1 through the lens of QITL-6 (courtesy of Dagmar Divjak) TRY.acceptability$Probability TRY.acceptability$ACCEPTABILITY S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction An example Simple vs. complex settings QITL-2 revisited Number of linguistic features/cues in context per each outcome Frequency n(features) S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 Introduction An example QITL-4 revisited NDL vs. statistical classifiers λ prediction τ classification accuracy Polytomous logistic regression (One-vs-rest) Polytomous mixed logistic regression (Poisson reformulation) 1 Section Author Section + 1 Author Support Vector Machine Memory-Based Learning (TiMBL) Random Forests Naive Discriminative Learning Table: Classification diagnostics for models fitted to the Finnish data set (n = 3404). S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations Outline 1 Introduction Naïve Discriminative Learning An example 2 Mathematics The Rescorla-Wagner equations The Danks equilibrium NDL vs. the Perceptron vs. least-squares regression 3 Insights Theoretical insights Empirical observations Conclusion S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations The Rescorla-Wagner equations Goal of naïve discriminative learner: predict an outcome O based on presence or absence of a set of cues C 1,..., C n S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations The Rescorla-Wagner equations Goal of naïve discriminative learner: predict an outcome O based on presence or absence of a set of cues C 1,..., C n An event (c, o) is formally described by indicator variables c i = { 1 if C i is present 0 otherwise o = { 1 if O results 0 otherwise S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations The Rescorla-Wagner equations Goal of naïve discriminative learner: predict an outcome O based on presence or absence of a set of cues C 1,..., C n An event (c, o) is formally described by indicator variables c i = { 1 if C i is present 0 otherwise o = { 1 if O results 0 otherwise Given cue-outcome associations v = (V 1,..., V n ) of learner, the activation level of the outcome O is n c j V j j=1 S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations The Rescorla-Wagner equations Goal of naïve discriminative learner: predict an outcome O based on presence or absence of a set of cues C 1,..., C n An event (c, o) is formally described by indicator variables c i = { 1 if C i is present 0 otherwise o = { 1 if O results 0 otherwise Given cue-outcome associations v = (V 1,..., V n ) of learner, the activation level of the outcome O is n c (t) j V (t) j j=1 Associations v (t) as well as cue and outcome indicators (c (t), o (t) ) depend on time step t S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations The Rescorla-Wagner equations Rescorla and Wagner (1972) proposed the R-W equations for the change in associations given an event (c, o): 0 if c i = 0 ( V i = α i β 1 λ n j=1 c ) jv j if c i = 1 o = 1 ( α i β 2 0 n j=1 c ) jv j if c i = 1 o = 0 with parameters λ 0 target activation level for outcome O α i 0 salience of cue C i β 1 0 learning rate for positive ovents (o = 1) β 2 0 learning rate for negative ovents (o = 0) S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations The Widrow-Hoff rule The W-H rule (Widrow and Hoff 1960) is a widely-used simplification of the R-W equations: 0 if c i = 0 ( V i = α i β 1 λ n j=1 c ) jv j if c i = 1 o = 1 ( α i β 2 0 n j=1 c ) jv j if c i = 1 o = 0 with parameters λ = 1 target activation level for outcome O α i = 1 salience of cue C i β 1 = β 2 global learning rate for positive and = β 0 negative events S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations The Widrow-Hoff rule The W-H rule (Widrow and Hoff 1960) is a widely-used simplification of the R-W equations: 0 if c i = 0 V i = β ( 1 n j=1 c ) jv j if c i = 1 o = 1 β ( 0 n j=1 c ) jv j if c i = 1 o = 0 with parameters λ = 1 target activation level for outcome O α i = 1 salience of cue C i β 1 = β 2 global learning rate for positive and = β 0 negative events S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations The Widrow-Hoff rule The W-H rule (Widrow and Hoff 1960) is a widely-used simplification of the R-W equations: 0 if c i = 0 V i = β ( 1 n j=1 c ) jv j if c i = 1 o = 1 β ( 0 n j=1 c ) jv j if c i = 1 o = 0 = c i β ( o n j=1 c ) jv j with parameters λ = 1 target activation level for outcome O α i = 1 salience of cue C i β 1 = β 2 global learning rate for positive and = β 0 negative events S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t o c 1 c 2 c 3 c 4 c 5 c 6 word pl? e n s umlaut dbl cons bgrd 1 Bäume Flasche Baum Gläser Flaschen Latte Hütten Glas Bäume Füße S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t cj V j V 1 V 2 V 3 V 4 V 5 V Bäume o c 1 c 2 c 3 c 4 c 5 c 6 association strength V i e n s umlaut dbl cons bgrd Bäume S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t cj V j V 1 V 2 V 3 V 4 V 5 V Flasche o c 1 c 2 c 3 c 4 c 5 c 6 association strength V i e n s umlaut dbl cons bgrd Bäume Flasche S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t cj V j V 1 V 2 V 3 V 4 V 5 V Baum o c 1 c 2 c 3 c 4 c 5 c 6 association strength V i e n s umlaut dbl cons bgrd Bäume Flasche Baum S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t cj V j V 1 V 2 V 3 V 4 V 5 V Gläser o c 1 c 2 c 3 c 4 c 5 c 6 association strength V i e n s umlaut dbl cons bgrd Bäume Flasche Baum Gläser S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t cj V j V 1 V 2 V 3 V 4 V 5 V Flaschen o c 1 c 2 c 3 c 4 c 5 c 6 association strength V i e n s umlaut dbl cons bgrd Bäume Flasche Baum Gläser Flaschen S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t cj V j V 1 V 2 V 3 V 4 V 5 V Latte o c 1 c 2 c 3 c 4 c 5 c 6 association strength V i e n s umlaut dbl cons bgrd Bäume Flasche Baum Gläser Flaschen Latte S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t cj V j V 1 V 2 V 3 V 4 V 5 V Hütten o c 1 c 2 c 3 c 4 c 5 c 6 association strength V i e n s umlaut dbl cons bgrd Bäume Flasche Baum Gläser Flaschen Latte Hütten S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t cj V j V 1 V 2 V 3 V 4 V 5 V Glas o c 1 c 2 c 3 c 4 c 5 c association strength V i e n s umlaut dbl cons bgrd Bäume Flasche Baum Gläser Flaschen Latte Hütten Glas S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t cj V j V 1 V 2 V 3 V 4 V 5 V Bäume o c 1 c 2 c 3 c 4 c 5 c association strength V i e n s umlaut dbl cons bgrd Bäume Flasche Baum Gläser Flaschen Latte Hütten Glas Bäume S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t cj V j V 1 V 2 V 3 V 4 V 5 V Füße o c 1 c 2 c 3 c 4 c 5 c association strength V i e n s umlaut dbl cons bgrd Bäume Flasche Baum Gläser Flaschen Latte Hütten Glas Bäume Füße S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A simple example: German noun plurals t cj V j V 1 V 2 V 3 V 4 V 5 V o c 1 c 2 c 3 c 4 c 5 c association strength V i e n s umlaut dbl cons bgrd Bäume Flasche Baum Gläser Flaschen Latte Hütten Glas Bäume Füße S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A stochastic NDL learner A specific event sequence (c (t), o (t) ) will only be encountered in controlled experiments S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A stochastic NDL learner A specific event sequence (c (t), o (t) ) will only be encountered in controlled experiments For applications in corpus linguistics, it is more plausible to assume that events are randomly sampled from a population of event tokens (c (k), o (k) ) for k = 1,..., m event types listed repeatedly proportional to their frequency S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A stochastic NDL learner A specific event sequence (c (t), o (t) ) will only be encountered in controlled experiments For applications in corpus linguistics, it is more plausible to assume that events are randomly sampled from a population of event tokens (c (k), o (k) ) for k = 1,..., m event types listed repeatedly proportional to their frequency I.i.d. random variables c (t) c and o (t) o distributions of c and o determined by population NDL can now be trained for arbitrary number of time steps, even if population is small (as in our example) study asymptotic behaviour of learners convergence stable adult state of associations S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A stochastic NDL learner Effect of the learning rate β β = association strength V i S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A stochastic NDL learner Effect of the learning rate β β = association strength V i S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A stochastic NDL learner Effect of the learning rate β β = association strength V i S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A stochastic NDL learner Effect of the learning rate β β = association strength V i S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A stochastic NDL learner Effect of the learning rate β β = association strength V i S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Rescorla-Wagner equations A stochastic NDL learner Effect of the learning rate β β = association strength V i S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Danks equilibrium Outline 1 Introduction Naïve Discriminative Learning An example 2 Mathematics The Rescorla-Wagner equations The Danks equilibrium NDL vs. the Perceptron vs. least-squares regression 3 Insights Theoretical insights Empirical observations Conclusion S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Danks equilibrium Expected activation levels Since we are interested in the general behaviour of a stochastic NDL, it makes sense to average over many individual learners to obtain expected associations E [ V (t) ] j E [ V (t+1) j ] = E [ V (t) j ] + E [ V (t) j E [ V (t) ] j = E [c i β ( o n j=1 c jv (t) ) ] j ] S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Danks equilibrium Expected activation levels Since we are interested in the general behaviour of a stochastic NDL, it makes sense to average over many individual learners to obtain expected associations E [ V (t) ] j E [ V (t+1) j ] = E [ V (t) j ] + E [ V (t) j E [ V (t) ] j = E [c i β ( o n j=1 c jv (t) ) ] j = β E [ c i o ] [ ] β E c n i j=1 c jv (t) j ] S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Danks equilibrium Expected activation levels Since we are interested in the general behaviour of a stochastic NDL, it makes sense to average over many individual learners to obtain expected associations E [ V (t) ] j E [ V (t+1) j ] = E [ V (t) j ] + E [ V (t) j E [ V (t) ] j = E [c i β ( o n j=1 c jv (t) ) ] j = β E [ c i o ] β n j=1 E[ c i c j V (t) ] j c i and c j are independent from V (t) j ] S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Danks equilibrium Expected activation levels Since we are interested in the general behaviour of a stochastic NDL, it makes sense to average over many individual learners to obtain expected associations E [ V (t) ] j E [ V (t+1) j ] = E [ V (t) j ] + E [ V (t) j E [ V (t) ] j = E [c i β ( o n j=1 c jv (t) ) ] j = β E [ c i o ] β n j=1 E[ ] [ (t)] c i c j E V j c i and c j are independent from V (t) j indicator variables: E[c i o] = Pr(C i, O); E[c i c j ] = Pr(C i, C j ) ] S. Evert & A. Arppe NDL: Theory & Experiments Tübingen, 6 Nov / 53 The Danks equilibrium Expected activation levels Since we are interested in the general behaviour of a stochastic NDL, it makes sense to average over many individual learners to obtain expected associations E [ V (t) ] j E [ V (t+1) j ] = E [ V (t) j E [ V (t) ] j = E [c i β ( o n j=1 c jv (t) j ] + E [ V (t) j ) ] ( = β Pr(C i, O) n j=1 Pr(C i, C j )E [ V (t) j ] ] ) c i and c j are independent from V (t) j indicator variables: E[c i o] = Pr(C i, O); E[c i c j ] = Pr(C i, C j ) S. E

Related Search

Theoretical and experimental approach on dielExperimental, Theoretical, and Mathematical SLearning Theory and Cognitive DevelopmentLearning Styles and Multiple IntelligencesTHEORETICAL AND COMPUTATIONAL CHEMISTRYTeaching and Learning Writing and ReadingTeaching-learning Methods and TechniquesTheoretical and applied linguisticsData Driven Learning (Languages And LinguistiNumerical and Experimental Methods in Fluid D

Similar documents

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks