k-layer neural networks: High capacity scoring functions + tips on how to train them
|
|
- Emory Greene
- 5 years ago
- Views:
Transcription
1 k-layer neural networks: High capacity scoring functions + tips on how to train them
2 A new class of scoring functions Linear scoring function s = W x + b 2-layer Neural Network s 1 = W 1 x + b 1 h = max(0, s 1 ) s = W 2 h + b 2 xd xd. s3. s1,m hm s3 x3 s2 x3.. s2 x2 s1 x2 s1,1 h1 s1 x1 x1 Input: x Before Output: s = W x + b Input: x s1 = W1x + b1 h = max(0, s1) s = W2h + b2 Now
3 Not restricted to two layers 2-layer Neural Network s 1 = W 1 x + b 1 h = max(0, s 1 ) s = W 2 h + b 2 3-layer Neural Network s 1 = W 1 x + b 1 h 1 = max(0, s 1 ) s 2 = W 2 h 1 + b 2 h 2 = max(0, s 2 ) s = W 3 h 2 + b 3 xd xd. hm s3. h1,m1 h2,m2 s3 x3. s2 x3.. s2 x2 h1 s1 x2 h1,1 h2,1 s1 x1 x1 Input: x s1 = W1x + b1 Output: s = W2h + b2 h = max(0, s1) Input: x s1 = W1x + b1 s2 = W2h1 + b2 Output: s = W3h2 + b3 h1 = max(0, s1) h2 = max(0, s2)
4 Some terminology 3-layer Neural Network s 1 = W 1 x + b 1 W 1 is m 1 d 1st hidden layer activations h 1 = max(0, s 1 ) apply non-linearity via activation fn s 2 = W 2 h 1 + b 2 W 2 is m 2 m 1 2nd hidden layer activations h 2 = max(0, s 2 ) apply non-linearity via activation fn Output responses s = W 3 h 2 + b 3 W 3 is c m 2 xd. h1,m1 h2,m2 s3 x3.. s2 x2 h1,1 h2,1 s1 x1 Input: x s1 = W1x + b1 s2 = W2h1 + b2 Output: s = W3h2 + b3 h1 = max(0, s1) h2 = max(0, s2) Sometimes referred to as a 2-hidden-layer neural network.
5 Computational Graph of our 2-layer neural network W 1 x + b 1 max(0, s 1 ) W 2 h + b 2 x s 1 h s W 1 b 1 W 2 b 2
6 2-layer neural network with probabilistic outputs W 1 x + b 1 max(0, s 1 ) W 2 h + b 2 softmax(s) x s 1 h s p W 1 b 1 W 2 b 2
7 Effect of the number of hidden nodes in a 2 layer network m = 3 m = 20 m = 30 m = 100 m is the number of nodes in the hidden layer. No regularization.
8 Result depends on parameter initialization m = 3 m = 20 m = 30 m = 100 m is the number of nodes in the hidden layer. No regularization. Different random parameter initialization to previous slide.
9 Effect of regularization J(D, λ, Θ) = (x,y) D l(x, y, Θ) + λr(θ) λ = 0 λ =.001 λ =.01 λ =.1 m = 100 nodes in the hidden layer. L 2 regularization. Do not use size of neural network as a regularizer. Use stronger regularization.
10 High-level overview of how to train network Mini-batch GD (or variant) Loop 1. Sample a batch of the training data. 2. Forward propagate it through the graph and calculate loss/cost. 3. Backward propagate to calculate the gradients. 4. Update the parameters using the gradient.
11 Options for activation functions Sigmoid tanh ReLu 1 σ(x) 1 tanh(x) 10 max (0, x) x x x σ(x) = 1 1+exp( x) tanh(x) = exp(x) exp( x) exp(x)+exp( x) ReLu(x) = max(0, x) Activation function is applied independently to each element of the score vector.
12 Options for activation Functions Leaky ReLu ELU 10 8 max (0.1x, x) 10 8 ELU(x) x x max(0.1x, x) ELU(x) = { x if x > 0 α (exp(x) 1)) otherwise Activation function is generally applied independently to each element of vector.
13 Options for Activation Functions Sigmoid tanh ReLu 1 σ(x) 1 tanh(x) 10 max (0, x) x x x σ(x) = 1 1+exp( x) tanh(x) = exp(x) exp( x) exp(x)+exp( x) ReLu(x) = max(0, x) In modern networks ReLU is the most common activation function.
14 Better understanding of gradient flows during BackProp has helped training of neural networks Understanding Effect of Activation Functions
15 Sigmoid 1 σ(x) dσ(x) dx σ(x) = exp( x) 0.5 Problems 1. Saturated activations kill the gradient flow. 2. Sigmoid outputs are not zero-centered. 3. exp() is expensive to compute x
16 tanh 1 tanh(x) d tanh(x) dx tanh(x) = Properties exp(x) exp( x) exp(x) + exp( x) 1. Squashes numbers to range [ 1, 1]. 2. Tanh outputs are zero-centered. 3. Saturated activations kill the gradients x
17 Rectified Linear Unit (ReLu) 10 8 max (0, x) d max (0,x) dx ReLu(x) = max(0, x) Pros 1. Does not saturate for large positive x x 2. Very computationally efficient. 3. In practice training of a ReLu network converges much faster than one with sigmoid/tanh activation functions. 4. Output is not zero-centered 5. Negative activations have zero gradients and freezes some parameter weights.
18 Effect of weight initialization & activation function on gradient flow
19 Some activation histograms Initialize a 10-layer network with 500 nodes at each layer. Use a tanh activation function at each layer. Initialize weights will small random numbers. Generate random input data (N(0, 1 2 )) with d = Layer 1 Layer 2 Layer 3 Layer Layer 5 Layer 6 Layer 7 Layer 8 Histograms of activations at each layer
20 Change the initialization to bigger random numbers Almost all neurons completely saturated, either -1 or +1. = Gradients will be all zero (Remember the picture of the gradient of tanh.) Layer 1 Layer 2 Layer 3 Layer Layer 5 Layer 6 Layer 7 Layer 8 Histograms of activations at each layer
21 Change the initialization to Xavier initialization Initialize a 10-layer network with 500 nodes at each layer. Use a tanh activation function at each layer. Initialize weights with Xavier initialization: W i,lm N(w; 0, 1/ 500). Generate random input data (N(0, 1 2 )) with d = Layer 1 Layer 2 Layer 3 Layer Layer 5 Layer 6 Layer 7 Layer 8 Histograms of activations at each layer
22 Lessening the effect of initialization: Batch normalization
23 Batch Normalization Want unit Gaussian activations at each layer? Just make them unit Guassian! Idea introduced in: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, arxiv Consider activations at some layer for a batch: s (j) 1, s(j) 2..., s n (j) To make each dimension unit gaussian, apply: ŝ (j) i ( ) = diag(σ 1,..., σ m ) 1 s (j) i µ where µ = 1 n n i=1 s (j) i, σ 2 p = 1 n n (s (j) i, p µ p) 2 i=1
24 Batch Normalization Usually apply normalization after the fully connected layer before non-linearity. Therefore for a k layer network have - for i = 1,..., k 1 for (x (i 1), y) D Apply ith linear transformation to batch s (i) = W i x (i 1) + b i end Compute batch mean and variances of ith layer: µ = 1 s (i), σ 2 j D = 1 ( s (i) ) 2 j µ j for j = 1,..., mi s (i) D D s (i) D for (s (i), y) D Apply BN and activation function ŝ (i) = BatchNormalise(s (i), µ, σ 1,..., σ mi ) x (i) = max (0, ŝ (i)) end end - Apply final linear transformation: s (k) = W k x (k 1) + b k
25 Batch Normalization: Scale & shift range Can also allow the network to squash and shift the range of the ŝ (i) s at each layer. ŝ (i) = γ (i) ŝ (i) + β (i) Can learn the γ (i) s and β (i) s and add them as parameters of the network. To keep things simple this added complexity is often omitted.
26 Benefits of Batch Normalization Improves gradient flow through the network. Reduces the strong dependence on initialization. = learn deeper networks more reliably. Allows higher learning rates. Acts as a form of regularization. If training a deep network, you should use Batch Normalization.
27 Batch Normalization at Test Time At test time do not have a batch. Instead fixed empirical mean and variances of activations at each level are used. These quantities estimated during training (with running averages).
28 Baby sitting the training process
29 Training neural networks not completely trivial Several hyper-parameters affect the quality of your training. These include - learning rate - degree of regularization - network architecture - hyper-parameters controlling weight initialization If these (potentially correlated) hyper-parameters are not appropriately set = you will not learn an effective network. Multiple quantities you should monitor during training. These quantities indicate - a reasonable hyper-parameter setting and/or - how hyper-parameters setting could be changed for the better.
30 What to monitor during training
31 Monitor & Visualize the loss/cost curve Evolution of your training loss is telling you something! Typical training loss over time
32 Telltale sign of a bad initialization
33 Monitor & visualize the accuracy Gap between training and validation accuracy indicates amount of over-fitting. Over-fitting = should increase regularization during training: - increase the degree of L 2 regularization - more dropout - use more training data.
34 Monitor & visualize the accuracy Gap between training and validation accuracy indicates amount of over-fitting. Under-fitting = model capacity not high enough: - increase the size of the network
35 Optimization of the training hyper-parameters
36 Hyperparameters to adjust Initial learning rate. Learning rate decay schedule. Regularization strength - L 2 penalty - Dropout strength
37 Cross-validation strategy Do a coarse fine cross-validation in stages. Stage 0: Identify the range of feasible learning rates & regularization penalties. (usually done interactively and train only for a few updates.) Stage 1: Broad search. Goal is to narrow the search range. Only run training for a few epochs. Stage 2: Finer search. Increase training times. Stage...: Repeat Stage 2 as necessary. Use performance on the validation set to identify good hyper-parameter settings.
38 Prefer random search to grid search randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid Random Search for Hyper-Parameter Optimization, Bergstra and Bengio, 2012
39 Parameter Updates: Variations of Stochastic Gradient Descent
40 One weakness of SGD SGD can be very slow... Example: Use SGD to find the optimum of f(x) = exp(.5x T Σx) 150 iterations, η =.01 Curves show the iso-contours of f(x) SGD has trouble navigating ravines as it oscillates across the bottom of the ravine. Could increase learning rate but increased the learning rate = more likely the optimizer will diverge. Unfortunately, ravines are common around local optima.
41 Solution: SGD with momentum Introduce momentum vector as well as the gradient vector. Let γ [0, 1] and v is the momentum vector v (t+1) = γ v (t) + η x f(x (t) ) x (t+1) = x (t) v (t+1) update vector Typically set γ in somewhere in the range [.9,.99]. e (t+1) η x f(x (t) ) x (t+1) γv (t) γv (t) η x f(x (t) ) x (t) η xf(x (t) )
42 How and why momentum helps How? Momentum helps accelerate SGD in the appropriate direction. Momentum dampens the oscillations of default SGD. = Faster convergence. Why? (γ =.9, η =.01, 150 iterations) For dimensions whose gradient is constantly changing then their entries in the update vector are damped. For dimensions whose gradient is approx. constant then their entries in the update vector are not damped.
43 Momentum not the complete answer When using momentum = can pick up too much speed in one direction. = can overshoot the local optimum. (γ =.9, η =.03)
44 Solution: Nesterov accelerated gradient (NAG) Look and measure ahead. Use gradient at an estimate of the parameters at the next iteration. Let γ [0, 1] then e (t+1) = x (t) γv (t) estimate of x (t+1) v (t+1) = γ v (t) + η x f(e (t+1) ) update vector x (t+1) = x (t) v (t+1) Typically γ set to.9. e (t+1) η xf(x (t) ) x (t+1) e (t+1) η xf(e (t+1) ) γv (t) γv (t) x (t+1) x (t) η xf(x (t) ) γv (t) η xf(x (t) ) Momentum update x (t) η xf(x (t) ) γv (t) η xf(e (t+1) ) NAG update
45 How and why NAG helps The anticipatory update prevents the algorithm having too large updates and overshooting. Algorithm has increased responsiveness to the landscape of f. (γ =.9, η =.01, 150 iterations) Note: NAG shown to greatly increase the ability to train RNNs: Bengio, Y., Boulanger-Lewandowski, N. & Pascanu, R. Advances in Optimizing Recurrent Networks, (2012).
46 Improvements to NAG Want to adapt the updates to each individual parameter. Perform larger or smaller updates depending on the landscape of the cost function. Family of algorithms with adaptive learning rates - AdaGrad - AdaDelta - RMSProp - Adam
47 AdaGrad For a cleaner statement introduce some notation: g t = x f(x (t) ) and g t = (g t,1,..., g t,d ) T. Keep a record of the sum of the squares of the gradients w.r.t. each x i up to time t: G t,i = t j=1 g 2 j,i The AdaGrad update step for each dimension is x (t+1) i = x (t) i Usually set ɛ = 1e 8 and η =.01. η Gt,i + ɛ g t,i J. Duchi, E. Hazan & Y. Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, Journal of Machine Learning Research, 2011.
48 Adagrad s convergence on our toy problem (ɛ = 1e 8, η =.01, 150 iterations)
49 Big weakness of AdaGrad Each g 2 t,i is positive. = Each G t,i = t j=1 g2 j,i keeps growing during training. = the effective learning rate η/( G t,i + ɛ) shrinks and eventually 0. = updates of x (t) stop.
50 AdaDelta Devised as an improvement to AdaGrad. Tackles AdaGrad s convergence to zero of the learning rate as t increases. AdaDelta s two central ideas - scale learning rate based on the previous gradient values (like AdaGrad) but only using a recent time window, - include an acceleration term (like momentum) by accumulating prior updates. M. Zeiler, ADADELTA: An Adaptive Learning Rate Method,
51 Technical details of AdaDelta Compute gradient vector g t at current estimate x (t). Update average of previous squared gradients (AdaGrad-like step) G t,i = ρ G t 1,i + (1 ρ) g 2 t,i Compute the update vector Ut 1,i + ɛ u t,i = Gt,i + ɛ g t,i Compute exponentially decaying average of updates (momentum-like step) The AdaDelta update step: U t,i = ρ U t 1,i + (1 ρ) u 2 t,i x (t+1) i = x (t) i u t,i
52 Adaptive Moment Estimation (Adam) Computes adaptive learning rates for each parameter. How? - Stores an exponentially decaying average of past gradients m (t) and past squared gradients v (t) - m (t) and v (t) are estimates respectively of the first and second moments of the gradient in each dimension. - Uses the variance+mean 2 estimate to damp the update in dimensions with high second moment D. P. Kingma & J. L. Ba, Adam: a Method for Stochastic Optimization, International Conference on Learning Representations, 2015.
53 Update equations for Adam Let g t = x f(x (t) ) m (t+1) = β 1 m (t) + (1 β 1 ) g t v (t+1) = β 2 v (t) + (1 β 2 ) g t. g t Set m (0) = v (0) = 0 = m (t) and v (t) are biased towards zero (especially during the initial time-steps). Counter these biases by setting: The Adam update rule: ˆm (t+1) = m(t+1) 1 β1 t, ˆv (t+1) = v(t+1) 1 β2 t x (t+1) = x (t) η ˆv (t+1) + ɛ ˆm(t+1) Suggested default values β 1 =.9, β 2 =.999, ɛ = 10 8.
54 Adam s performance on our toy problem (default parameter settings, 150 iterations)
55 Comparison of different algorithms on our toy problem Adam Adagrad NAG Momentum SGD (ɛ = 1e 8, γ =.9, η =.01, 150 iterations) (ɛ = 1e 8, γ =.9, η =.03, 150 iterations)
56 Which optimizer to use? Data sparse = likely to achieve best results using one of the adaptive learning-rate methods. Using the adaptive learning-rate methods = won t need to tune the learning rate (much!). RMSprop, AdaDelta, and Adam are very similar algorithms that do well in similar circumstances. Adam slightly outperforms RMSProp near the end of optimization. Adam might be the best overall choice. But vanilla SGD (without momentum) and a simple learning rate annealing schedule may be sufficient. But time until finding a local minimum may be long...
57 Annealing the learning rate
58 Useful to anneal the learning rate When training deep networks, usually helpful to anneal the learning rate over time. Why? - Stops the parameter vector from bouncing around too widely. - = can reach into deeper, but narrower parts of the loss function. But knowing when to decay the learning rate is tricky! Decay too slowly = waste computations bouncing around chaotically with little improvement. Decay too aggressively = system unable to reach the best position it can.
59 Common approaches to learning rate decay Step decay: After every nth epoch set η = αη where α (0, 1). (Instead sometimes people monitor the validation loss and reduce the learning rate when this loss stops improving.) Exponential decay: η = η 0 e kt where t is iteration number (either w.r.t. number of update steps or epochs). Then η 0 and k are hyper-parameters. 1/t decay: η = η kt Step decay most common. Better to decay conservatively and train for longer.
Lecture 4 - k-layer Neural Networks
Lecture 4 - k-layer Neural Networks DD2424 May 9, 207 A new class of scoring functions Linear scoring function s = W x + b 2-layer Neural Network s = W x + b h = max(0, s ) s = W 2 h + b 2 xd xd. s3. s,m
More informationMachine Learning (CSE 446): Pratical issues: optimization and learning
Machine Learning (CSE 446): Pratical issues: optimization and learning John Thickstun guest lecture c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 10 Review 1 / 10 Our running example
More information4 Reinforcement Learning Basic Algorithms
Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems
More informationLarge-Scale SVM Optimization: Taking a Machine Learning Perspective
Large-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological Institute at Chicago Joint work with Nati Srebro Talk at NEC Labs, Princeton, August, 2008 Shai
More informationSupport Vector Machines: Training with Stochastic Gradient Descent
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Support vector machines Training by maximizing margin The SVM
More informationBayesian Finance. Christa Cuchiero, Irene Klein, Josef Teichmann. Obergurgl 2017
Bayesian Finance Christa Cuchiero, Irene Klein, Josef Teichmann Obergurgl 2017 C. Cuchiero, I. Klein, and J. Teichmann Bayesian Finance Obergurgl 2017 1 / 23 1 Calibrating a Bayesian model: a first trial
More informationarxiv: v3 [q-fin.cp] 20 Sep 2018
arxiv:1809.02233v3 [q-fin.cp] 20 Sep 2018 Applying Deep Learning to Derivatives Valuation Ryan Ferguson and Andrew Green 16/09/2018 Version 1.3 Abstract This paper uses deep learning to value derivatives.
More informationdistribution of the best bid and ask prices upon the change in either of them. Architecture Each neural network has 4 layers. The standard neural netw
A Survey of Deep Learning Techniques Applied to Trading Published on July 31, 2016 by Greg Harris http://gregharris.info/a-survey-of-deep-learning-techniques-applied-t o-trading/ Deep learning has been
More informationDeep Learning - Financial Time Series application
Chen Huang Deep Learning - Financial Time Series application Use Deep learning to learn an existing strategy Warning Don t Try this at home! Investment involves risk. Make sure you understand the risk
More informationLecture 17: More on Markov Decision Processes. Reinforcement learning
Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture
More informationScaling SGD Batch Size to 32K for ImageNet Training
Scaling SGD Batch Size to 32K for ImageNet Training Yang You Computer Science Division of UC Berkeley youyang@cs.berkeley.edu Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley
More informationDeep Learning in Asset Pricing
Deep Learning in Asset Pricing Luyang Chen 1 Markus Pelger 1 Jason Zhu 1 1 Stanford University November 17th 2018 Western Mathematical Finance Conference 2018 Motivation Hype: Machine Learning in Investment
More informationApplication of Deep Learning to Algorithmic Trading
Application of Deep Learning to Algorithmic Trading Guanting Chen [guanting] 1, Yatong Chen [yatong] 2, and Takahiro Fushimi [tfushimi] 3 1 Institute of Computational and Mathematical Engineering, Stanford
More informationIs Greedy Coordinate Descent a Terrible Algorithm?
Is Greedy Coordinate Descent a Terrible Algorithm? Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, Hoyt Koepke University of British Columbia Optimization and Big Data, 2015 Context: Random
More informationExercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem.
Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem. Robert M. Gower. October 3, 07 Introduction This is an exercise in proving the convergence
More information$tock Forecasting using Machine Learning
$tock Forecasting using Machine Learning Greg Colvin, Garrett Hemann, and Simon Kalouche Abstract We present an implementation of 3 different machine learning algorithms gradient descent, support vector
More informationPredicting stock prices for large-cap technology companies
Predicting stock prices for large-cap technology companies 15 th December 2017 Ang Li (al171@stanford.edu) Abstract The goal of the project is to predict price changes in the future for a given stock.
More informationLeverage Financial News to Predict Stock Price Movements Using Word Embeddings and Deep Neural Networks
Leverage Financial News to Predict Stock Price Movements Using Word Embeddings and Deep Neural Networks Yangtuo Peng A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF THE
More informationPortfolio replication with sparse regression
Portfolio replication with sparse regression Akshay Kothkari, Albert Lai and Jason Morton December 12, 2008 Suppose an investor (such as a hedge fund or fund-of-fund) holds a secret portfolio of assets,
More informationIntroduction to Reinforcement Learning. MAL Seminar
Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology
More informationDistributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks
Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks, Detroit MI (joint work with Waheed Bajwa, Rutgers) Motivation: Autonomous Driving Network of autonomous automobiles
More informationMachine Learning and Options Pricing: A Comparison of Black-Scholes and a Deep Neural Network in Pricing and Hedging DAX 30 Index Options
Machine Learning and Options Pricing: A Comparison of Black-Scholes and a Deep Neural Network in Pricing and Hedging DAX 30 Index Options Student Number: 484862 Department of Finance Aalto University School
More informationMachine Learning for Quantitative Finance
Machine Learning for Quantitative Finance Fast derivative pricing Sofie Reyners Joint work with Jan De Spiegeleer, Dilip Madan and Wim Schoutens Derivative pricing is time-consuming... Vanilla option pricing
More informationCOMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2
COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman
More informationReinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein
Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the
More informationStock Market Index Prediction Using Multilayer Perceptron and Long Short Term Memory Networks: A Case Study on BSE Sensex
Stock Market Index Prediction Using Multilayer Perceptron and Long Short Term Memory Networks: A Case Study on BSE Sensex R. Arjun Raj # # Research Scholar, APJ Abdul Kalam Technological University, College
More information2D5362 Machine Learning
2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files
More informationDeep Learning for Forecasting Stock Returns in the Cross-Section
Deep Learning for Forecasting Stock Returns in the Cross-Section Masaya Abe 1 and Hideki Nakayama 2 1 Nomura Asset Management Co., Ltd., Tokyo, Japan m-abe@nomura-am.co.jp 2 The University of Tokyo, Tokyo,
More informationInvesting through Economic Cycles with Ensemble Machine Learning Algorithms
Investing through Economic Cycles with Ensemble Machine Learning Algorithms Thomas Raffinot Silex Investment Partners Big Data in Finance Conference Thomas Raffinot (Silex-IP) Economic Cycles-Machine Learning
More informationJournal of Internet Banking and Commerce
Journal of Internet Banking and Commerce An open access Internet journal (http://www.icommercecentral.com) Journal of Internet Banking and Commerce, December 2017, vol. 22, no. 3 STOCK PRICE PREDICTION
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
for Strongly Convex Stochastic Optimization Microsoft Research New England NIPS 2011 Optimization Workshop Stochastic Convex Optimization Setting Goal: Optimize convex function F ( ) over convex domain
More informationApplications of Neural Networks
Applications of Neural Networks MPhil ACS Advanced Topics in NLP Laura Rimell 25 February 2016 1 NLP Neural Network Applications Language Models Word Embeddings Tagging Parsing Sentiment Machine Translation
More informationMarkov Decision Process
Markov Decision Process Human-aware Robotics 2018/02/13 Chapter 17.3 in R&N 3rd Ø Announcement: q Slides for this lecture are here: http://www.public.asu.edu/~yzhan442/teaching/cse471/lectures/mdp-ii.pdf
More informationLendingClub Loan Default and Profitability Prediction
LendingClub Loan Default and Profitability Prediction Peiqian Li peiqian@stanford.edu Gao Han gh352@stanford.edu Abstract Credit risk is something all peer-to-peer (P2P) lending investors (and bond investors
More informationSYSM 6304: Risk and Decision Analysis Lecture 6: Pricing and Hedging Financial Derivatives
SYSM 6304: Risk and Decision Analysis Lecture 6: Pricing and Hedging Financial Derivatives M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October
More informationCS 343: Artificial Intelligence
CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC
More informationPortfolio Management and Optimal Execution via Convex Optimization
Portfolio Management and Optimal Execution via Convex Optimization Enzo Busseti Stanford University April 9th, 2018 Problems portfolio management choose trades with optimization minimize risk, maximize
More informationDeep learning analysis of limit order book
Washington University in St. Louis Washington University Open Scholarship Arts & Sciences Electronic Theses and Dissertations Arts & Sciences Spring 5-18-2018 Deep learning analysis of limit order book
More informationGraduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Solutions to Final Exam
Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay Solutions to Final Exam Problem A: (30 pts) Answer briefly the following questions. 1. Suppose that
More informationAnurag Sodhi University of North Carolina at Charlotte
American Put Option pricing using Least squares Monte Carlo method under Bakshi, Cao and Chen Model Framework (1997) and comparison to alternative regression techniques in Monte Carlo Anurag Sodhi University
More informationMachine Learning in Finance: The Case of Deep Learning for Option Pricing
Machine Learning in Finance: The Case of Deep Learning for Option Pricing Robert Culkin & Sanjiv R. Das Santa Clara University August 2, 2017 Abstract Modern advancements in mathematical analysis, computational
More informationRandom Variables and Probability Distributions
Chapter 3 Random Variables and Probability Distributions Chapter Three Random Variables and Probability Distributions 3. Introduction An event is defined as the possible outcome of an experiment. In engineering
More informationMachine Learning (CSE 446): Learning as Minimizing Loss
Machine Learning (CSE 446): Learning as Minimizing Loss oah Smith c 207 University of Washington nasmith@cs.washington.edu October 23, 207 / 2 Sorry! o office hour for me today. Wednesday is as usual.
More information1. You are given the following information about a stationary AR(2) model:
Fall 2003 Society of Actuaries **BEGINNING OF EXAMINATION** 1. You are given the following information about a stationary AR(2) model: (i) ρ 1 = 05. (ii) ρ 2 = 01. Determine φ 2. (A) 0.2 (B) 0.1 (C) 0.4
More informationForecasting Foreign Exchange Rate during Crisis - A Neural Network Approach
International Proceedings of Economics Development and Research IPEDR vol.86 (2016) (2016) IACSIT Press, Singapore Forecasting Foreign Exchange Rate during Crisis - A Neural Network Approach K. V. Bhanu
More informationPoint Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage
6 Point Estimation Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage Point Estimation Statistical inference: directed toward conclusions about one or more parameters. We will use the generic
More information1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016
AM 22: Advanced Optimization Spring 206 Prof. Yaron Singer Lecture 9 February 24th Overview In the previous lecture we reviewed results from multivariate calculus in preparation for our journey into convex
More informationMachine Learning in Computer Vision Markov Random Fields Part II
Machine Learning in Computer Vision Markov Random Fields Part II Oren Freifeld Computer Science, Ben-Gurion University March 22, 2018 Mar 22, 2018 1 / 40 1 Some MRF Computations 2 Mar 22, 2018 2 / 40 Few
More informationFinancial Econometrics
Financial Econometrics Volatility Gerald P. Dwyer Trinity College, Dublin January 2013 GPD (TCD) Volatility 01/13 1 / 37 Squared log returns for CRSP daily GPD (TCD) Volatility 01/13 2 / 37 Absolute value
More informationCourse information FN3142 Quantitative finance
Course information 015 16 FN314 Quantitative finance This course is aimed at students interested in obtaining a thorough grounding in market finance and related empirical methods. Prerequisite If taken
More informationarxiv: v1 [q-fin.cp] 6 Oct 2016
Efficient Valuation of SCR via a Neural Network Approach Seyed Amir Hejazi a, Kenneth R. Jackson a arxiv:1610.01946v1 [q-fin.cp] 6 Oct 2016 a Department of Computer Science, University of Toronto, Toronto,
More informationStock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques
Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques 6.1 Introduction Trading in stock market is one of the most popular channels of financial investments.
More informationThe Use of Importance Sampling to Speed Up Stochastic Volatility Simulations
The Use of Importance Sampling to Speed Up Stochastic Volatility Simulations Stan Stilger June 6, 1 Fouque and Tullie use importance sampling for variance reduction in stochastic volatility simulations.
More informationCOMPARING NEURAL NETWORK AND REGRESSION MODELS IN ASSET PRICING MODEL WITH HETEROGENEOUS BELIEFS
Akademie ved Leske republiky Ustav teorie informace a automatizace Academy of Sciences of the Czech Republic Institute of Information Theory and Automation RESEARCH REPORT JIRI KRTEK COMPARING NEURAL NETWORK
More informationBackpropagation. Deep Learning Theory and Applications. Kevin Moon Guy Wolf
Deep Learning Theory and Applications Backpropagation Kevin Moon (kevin.moon@yale.edu) Guy Wolf (guy.wolf@yale.edu) CPSC/AMTH 663 Calculating the gradients We showed how neural networks can learn weights
More informationThe University of Chicago, Booth School of Business Business 41202, Spring Quarter 2012, Mr. Ruey S. Tsay. Solutions to Final Exam
The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2012, Mr. Ruey S. Tsay Solutions to Final Exam Problem A: (40 points) Answer briefly the following questions. 1. Consider
More informationArtificial Neural Networks Lecture Notes
Artificial Neural Networks Lecture Notes Part 10 About this file: This is the printer-friendly version of the file "lecture10.htm". In case the page is not properly displayed, use IE 5 or higher. Since
More informationForeign Exchange Forecasting via Machine Learning
Foreign Exchange Forecasting via Machine Learning Christian González Rojas cgrojas@stanford.edu Molly Herman mrherman@stanford.edu I. INTRODUCTION The finance industry has been revolutionized by the increased
More informationApplication of Innovations Feedback Neural Networks in the Prediction of Ups and Downs Value of Stock Market *
Proceedings of the 6th World Congress on Intelligent Control and Automation, June - 3, 006, Dalian, China Application of Innovations Feedback Neural Networks in the Prediction of Ups and Downs Value of
More informationLecture Note 9 of Bus 41914, Spring Multivariate Volatility Models ChicagoBooth
Lecture Note 9 of Bus 41914, Spring 2017. Multivariate Volatility Models ChicagoBooth Reference: Chapter 7 of the textbook Estimation: use the MTS package with commands: EWMAvol, marchtest, BEKK11, dccpre,
More informationTwo hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER
Two hours MATH20802 To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER STATISTICAL METHODS Answer any FOUR of the SIX questions.
More information- 1 - **** d(lns) = (µ (1/2)σ 2 )dt + σdw t
- 1 - **** These answers indicate the solutions to the 2014 exam questions. Obviously you should plot graphs where I have simply described the key features. It is important when plotting graphs to label
More informationBarrier Option. 2 of 33 3/13/2014
FPGA-based Reconfigurable Computing for Pricing Multi-Asset Barrier Options RAHUL SRIDHARAN, GEORGE COOKE, KENNETH HILL, HERMAN LAM, ALAN GEORGE, SAAHPC '12, PROCEEDINGS OF THE 2012 SYMPOSIUM ON APPLICATION
More informationPredictive Model Learning of Stochastic Simulations. John Hegstrom, FSA, MAAA
Predictive Model Learning of Stochastic Simulations John Hegstrom, FSA, MAAA Table of Contents Executive Summary... 3 Choice of Predictive Modeling Techniques... 4 Neural Network Basics... 4 Financial
More informationUnderstanding Deep Learning Requires Rethinking Generalization
Understanding Deep Learning Requires Rethinking Generalization ChiyuanZhang 1 Samy Bengio 3 Moritz Hardt 3 Benjamin Recht 2 Oriol Vinyals 4 1 Massachusetts Institute of Technology 2 University of California,
More informationA Novel Prediction Method for Stock Index Applying Grey Theory and Neural Networks
The 7th International Symposium on Operations Research and Its Applications (ISORA 08) Lijiang, China, October 31 Novemver 3, 2008 Copyright 2008 ORSC & APORC, pp. 104 111 A Novel Prediction Method for
More informationStock market price index return forecasting using ANN. Gunter Senyurt, Abdulhamit Subasi
Stock market price index return forecasting using ANN Gunter Senyurt, Abdulhamit Subasi E-mail : gsenyurt@ibu.edu.ba, asubasi@ibu.edu.ba Abstract Even though many new data mining techniques have been introduced
More informationGradient Descent and the Structure of Neural Network Cost Functions. presentation by Ian Goodfellow
Gradient Descent and the Structure of Neural Network Cost Functions presentation by Ian Goodfellow adapted for www.deeplearningbook.org from a presentation to the CIFAR Deep Learning summer school on August
More informationReinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum
Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,
More informationCSE 473: Artificial Intelligence
CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due
More informationFX Smile Modelling. 9 September September 9, 2008
FX Smile Modelling 9 September 008 September 9, 008 Contents 1 FX Implied Volatility 1 Interpolation.1 Parametrisation............................. Pure Interpolation.......................... Abstract
More informationPredicting Bitcoin Exchange Rate Values Can Machine Learning Algorithms Help?
Predicting Bitcoin Exchange Rate Values Can Machine Learning Algorithms Help? Student: Kevin Su dmersen (ID: 1791791) Supervisor: Piotr Jelonek Date: September 12, 2018 University of Warwick Abstract Predicting
More informationSDMR Finance (2) Olivier Brandouy. University of Paris 1, Panthéon-Sorbonne, IAE (Sorbonne Graduate Business School)
SDMR Finance (2) Olivier Brandouy University of Paris 1, Panthéon-Sorbonne, IAE (Sorbonne Graduate Business School) Outline 1 Formal Approach to QAM : concepts and notations 2 3 Portfolio risk and return
More informationLearning from Data: Learning Logistic Regressors
Learning from Data: Learning Logistic Regressors November 1, 2005 http://www.anc.ed.ac.uk/ amos/lfd/ Learning Logistic Regressors P(t x) = σ(w T x + b). Want to learn w and b using training data. As before:
More informationInternational Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN
International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, www.ijcea.com ISSN 31-3469 AN INVESTIGATION OF FINANCIAL TIME SERIES PREDICTION USING BACK PROPAGATION NEURAL
More informationApplication of Soft-Computing Techniques in Accident Compensation
Application of Soft-Computing Techniques in Accident Compensation Prepared by Peter Mulquiney Taylor Fry Consulting Actuaries Presented to the Institute of Actuaries of Australia Accident Compensation
More informationMacroeconomics of the Labour Market Problem Set
Macroeconomics of the Labour Market Problem Set dr Leszek Wincenciak Problem 1 The utility of a consumer is given by U(C, L) =α ln C +(1 α)lnl, wherec is the aggregate consumption, and L is the leisure.
More informationIran s Stock Market Prediction By Neural Networks and GA
Iran s Stock Market Prediction By Neural Networks and GA Mahmood Khatibi MS. in Control Engineering mahmood.khatibi@gmail.com Habib Rajabi Mashhadi Associate Professor h_mashhadi@ferdowsi.um.ac.ir Electrical
More informationHandout 8: Introduction to Stochastic Dynamic Programming. 2 Examples of Stochastic Dynamic Programming Problems
SEEM 3470: Dynamic Optimization and Applications 2013 14 Second Term Handout 8: Introduction to Stochastic Dynamic Programming Instructor: Shiqian Ma March 10, 2014 Suggested Reading: Chapter 1 of Bertsekas,
More informationTrinomial Tree. Set up a trinomial approximation to the geometric Brownian motion ds/s = r dt + σ dw. a
Trinomial Tree Set up a trinomial approximation to the geometric Brownian motion ds/s = r dt + σ dw. a The three stock prices at time t are S, Su, and Sd, where ud = 1. Impose the matching of mean and
More informationAn enhanced artificial neural network for stock price predications
An enhanced artificial neural network for stock price predications Jiaxin MA Silin HUANG School of Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR S. H. KWOK HKUST Business
More informationOption Pricing Using Bayesian Neural Networks
Option Pricing Using Bayesian Neural Networks Michael Maio Pires, Tshilidzi Marwala School of Electrical and Information Engineering, University of the Witwatersrand, 2050, South Africa m.pires@ee.wits.ac.za,
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2019 Last Time: Markov Chains We can use Markov chains for density estimation, d p(x) = p(x 1 ) p(x }{{}
More informationExtend the ideas of Kan and Zhou paper on Optimal Portfolio Construction under parameter uncertainty
Extend the ideas of Kan and Zhou paper on Optimal Portfolio Construction under parameter uncertainty George Photiou Lincoln College University of Oxford A dissertation submitted in partial fulfilment for
More informationAnalysing the IS-MP-PC Model
University College Dublin, Advanced Macroeconomics Notes, 2015 (Karl Whelan) Page 1 Analysing the IS-MP-PC Model In the previous set of notes, we introduced the IS-MP-PC model. We will move on now to examining
More informationIN finance applications, the idea of training learning algorithms
890 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001 Cost Functions and Model Combination for VaR-Based Asset Allocation Using Neural Networks Nicolas Chapados, Student Member, IEEE, and
More informationCSEP 573: Artificial Intelligence
CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline
More informationOutline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.
Outline Coordinate Minimization Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University November 27, 208 Introduction 2 Algorithms Cyclic order with exact minimization
More informationDesign and implementation of artificial neural network system for stock market prediction (A case study of first bank of Nigeria PLC Shares)
International Journal of Advanced Engineering and Technology ISSN: 2456-7655 www.newengineeringjournal.com Volume 1; Issue 1; March 2017; Page No. 46-51 Design and implementation of artificial neural network
More informationCharacterization of the Optimum
ECO 317 Economics of Uncertainty Fall Term 2009 Notes for lectures 5. Portfolio Allocation with One Riskless, One Risky Asset Characterization of the Optimum Consider a risk-averse, expected-utility-maximizing
More informationChapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 29
Chapter 5 Univariate time-series analysis () Chapter 5 Univariate time-series analysis 1 / 29 Time-Series Time-series is a sequence fx 1, x 2,..., x T g or fx t g, t = 1,..., T, where t is an index denoting
More informationNeuro-Genetic System for DAX Index Prediction
Neuro-Genetic System for DAX Index Prediction Marcin Jaruszewicz and Jacek Mańdziuk Faculty of Mathematics and Information Science, Warsaw University of Technology, Plac Politechniki 1, 00-661 Warsaw,
More informationParallel Multilevel Monte Carlo Simulation
Parallel Simulation Mathematisches Institut Goethe-Universität Frankfurt am Main Advances in Financial Mathematics Paris January 7-10, 2014 Simulation Outline 1 Monte Carlo 2 3 4 Algorithm Numerical Results
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2018 Last Time: Markov Chains We can use Markov chains for density estimation, p(x) = p(x 1 ) }{{} d p(x
More informationUtility Indifference Pricing and Dynamic Programming Algorithm
Chapter 8 Utility Indifference ricing and Dynamic rogramming Algorithm In the Black-Scholes framework, we can perfectly replicate an option s payoff. However, it may not be true beyond the Black-Scholes
More informationSublinear Time Algorithms Oct 19, Lecture 1
0368.416701 Sublinear Time Algorithms Oct 19, 2009 Lecturer: Ronitt Rubinfeld Lecture 1 Scribe: Daniel Shahaf 1 Sublinear-time algorithms: motivation Twenty years ago, there was practically no investigation
More informationEco504 Spring 2010 C. Sims FINAL EXAM. β t 1 2 φτ2 t subject to (1)
Eco54 Spring 21 C. Sims FINAL EXAM There are three questions that will be equally weighted in grading. Since you may find some questions take longer to answer than others, and partial credit will be given
More informationGamma. The finite-difference formula for gamma is
Gamma The finite-difference formula for gamma is [ P (S + ɛ) 2 P (S) + P (S ɛ) e rτ E ɛ 2 ]. For a correlation option with multiple underlying assets, the finite-difference formula for the cross gammas
More informationAn adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity
An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity Coralia Cartis, Nick Gould and Philippe Toint Department of Mathematics,
More information