Neuro-Dynamic Programming for Fractionated Radiotherapy Planning

Size: px

Start display at page:

Download "Neuro-Dynamic Programming for Fractionated Radiotherapy Planning"

Melvin Young
5 years ago
Views:

1 Neuro-Dynamic Programming for Fractionated Radiotherapy Planning Geng Deng Michael C. Ferris University of Wisconsin at Madison Conference on Optimization and Health Care, Feb, 2006

2 Background Optimal delivery plan Deliver ideal dose on the target while avoid the critical organs and normal tissues.

3 Fractionated radiotherapy (Dynamic problem) Treatments usually last several weeks Limits burning Allows healthy tissue to recover Types of day-to-day error: Registration error, internal organ motion, tumor shrinkage, and non-rigid transformation. Current approach: constant policy. New option: True dose delivered can be measured during individual treatments. Update treatment plan day-to-day (online policy) Compensate for errors

4 Problem overview State and state transition: x k+1 (i) = x k (i) + u k (i + ω k ), i T. (1) Consider simple shifts in each direction Known error distributions Accumulation of errors Determine dose (u k ) to apply to minimize final error

5 Dynamic programming formulation Minimize the cost-to-go function starting at x 0 : [ N 1 ] J 0 (x 0 ) = min E g(x k, x k+1, u k ) + J N (x N ) k=0 s.t. x k+1 (i) = x k (i) + u k (i + ω k ), u k U(x k ), k = 0, 1,, N 1. (2) J N (x N ) is final cost function: J N (X N ) = i T c(i) x N (i) T (i) g(x k, x k+1, u k ) is the immediate cost delivered outside the target: g(x k, x k+1, u k ) = c(i + ω k )u k (i + ω k ) i+ω k / T

6 An iterative formulation The cost-to-go function at stage k can be formulated as: J k (x k ) = min E [g(x k, x k+1, u k ) + J k+1 (x k+1 )] u k U(x k ) Bellman s equation! This is a finite horizon dynamic programming problem.

7 Existing policies We will compare the following policies: Constant policy u k = T /N Reactive policy (Online policy) u k = max(0, T x k )/(N k) Modified reactive policy (Online policy) u k = a max(0, T x k )/(N k)

8 Why do we use NDP? Bellman s equation u k (x k ) = arg min k, x k+1, u k ) + J k+1 (x k+1 )] u k U(x k ) s.t. x k+1 (i) = x k (i) + u k (i + ω k ) (3) Dynamic programming method has difficulty to handle more than 4 stages, because of dimensionality. NDP approximates cost-to-go function J k (x k ) with a simple-structure function J k (x k, r k ). NDP solves the problem fast. NDP obtains sub-optimal solutions.

9 Approximation architectures for J(x, r) Neural network (Input information are based on feature extraction f i (x)) Heuristic mapping: J(x, r) = r 0 + I i=1 r ih ui (x). H ui (x) is the heuristic cost-to-go applying policy u i.

10 Approximate policy iteration Estimate parameters r k. x k, J(, r k ) Bellman s equation û k {x 0i, x 1i,, x Ni }, i = 1,, M Solve least squares problem in r k min r k Generate sample trajectories Evaluate costs c(x ki ) M J k (x ki, r k ) c(x ki ) i=1 Simulation and evaluation steps alternate 2

11 Computational experiments Test a simple one dimensional case and a real problem: head and neck Use 5 candidate policies at each stage Test in high and low volatility scenarios Use two approximation architectures: Neural network: features (f i (x k )) used are average dose, standard deviation of dose, and curvature of dose distribution Heuristic mapping: Heuristic policies used are constant policy, reactive policy and modified reactive policy with a = 2.

12 Performance of approximate policy iteration Final Eror Policy Iteration Number

13 Comparison results in the head and neck problem The figures show results for different policies in the high volatility case: Constant Policy Reactive Policy NDP Policy Constant Policy Reactive Policy NDP Policy Expected Error Expected Error Time Period Time Period Neural network architecture (left) and HEuristic mapping architecture (right) NDP > Reactive > Constant Results of NN and HE are comparably the same, but HE takes much longer computation time Online policies require more computational effort

14 Conclusions Online policies with extra information outperform offline policies DP method is inapplicable in practice. NDP reduces computation time and produces approximately optimal policies Implemented on real patient data Future work: Explore more policies Consider different types of error Fast computation

6.231 DYNAMIC PROGRAMMING LECTURE 8 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 8 LECTURE OUTLINE Suboptimal control Cost approximation methods: Classification Certainty equivalent control: An example Limited lookahead policies Performance bounds