Mean Reversion Stochastic Models vs Reinforcement Learning in Pair Trading: A Comparative Study

Nov 22, 2025

This weekend I feel motivated to show a bit of my experiments. This time I want to describe one model I was working in as a piece of personal research that subsequently gave rise to another related article named Deep Reinforcement Learning in Continuous Action Spaces for Pair Trading: A Comparative Study of A2C and PPO, where I’m a coautor. The difference may be noticed in the subset of asset classes, methodology and Reinforcement Learning (RL) models. This version is available also in spanish here.

⚠️ Experimental Notice:
This article presents ongoing research and is currently under construction/verification of the codes. Results, methodologies, and conclusions may be updated as the study progresses. Feedback and suggestions are welcome!

Introduction

Pair trading is a market-neutral strategy used in trading desks across various asset classes. The core idea is to profit from temporary deviations in the price relationship between two related assets. According to the Adaptive Market Hypothesis (AMH), markets incorporate factors beyond pure rationality, making deterministic dynamics impossible and opening the door to stochastic modeling approaches.

This work explores whether reinforcement learning can outperform traditional stochastic mean reversion models in identifying optimal entry levels for pair trading strategies.

Research Questions

Is there any relationship between mean reversion models and machine learning in treating a pair trading strategy?
For an asset whose spread is stable around the mean for a given time horizon, does one model perform better in terms of profitability?
Are the optimal entry levels of each model significantly different?

Theoretical Framework

Pair Trading Strategy

Pair trading is defined as a market-neutral strategy where profits are sought by arbitraging temporary deviations in the price relationship of a pair of assets. The spread signal is typically defined as:

Subtraction method:

\[\text{Signal} = S_A(t) - \beta \cdot S_B(t)\]

Ratio method:

\[\text{Signal} = \frac{S_A(t)}{S_B(t)}\]

For this study, the spread model was defined using logarithmic prices:

\[X(t) = \ln(S_A(t)) - \gamma \cdot \ln(S_B(t))\]

where $S_i(t)$ is the quote of asset $i$, $\gamma > 0$ represents the investment ratio between long and short positions, and since the conversion factor is 1:1 for the analyzed assets, $\gamma = 1$.

Ornstein-Uhlenbeck Mean Reversion Model

The Ornstein-Uhlenbeck (OU) process is one of the most recognized mean reversion models. Introduced by physicists Leonard Salomon Ornstein and George Eugene Uhlenbeck in 1930, it models the effect of friction forces proportional to pressure.

The stochastic differential equation (SDE) for the OU process with Lévy jumps is:

\[dX(t) = -\rho \cdot (X(t) - \theta) \cdot dt + \sigma \cdot dL(t)\]

Where:

$\rho$ is the mean reversion speed parameter
$\theta$ is the long-term average
$\sigma$ is the process volatility
$dL(t)$ is a Lévy process (replacing the traditional Brownian motion $dB$ to account for heavy tails)

Key Properties

The increment of $X$ is subject to a deterministic component $\mu(\theta-X_t)dt$, known as drift
The increment is also subject to a diffusion component $\sigma dL_t$, which is stochastic.
When $X_t$ is greater/less than $\theta$, a negative/positive effect is imprinted on $dX_t$, causing the next element of $X$ to be lower/higher than the current one

Conditional Density Function

Based on (Leung & Li, 2016), the conditional density function of $X_t$ at time $t_i$ with increments $\Delta t = t_i - t_{i-1}$ is:

\[f^{OU}(x_i|x_{i-1};\theta,\mu, \sigma) = \frac{1}{\sqrt{2\pi\widetilde{\sigma}^2}}\exp\left(- \frac{(x_i - x_{i-1}e^{-\mu\Delta t}-\theta (1-e^{-\mu\Delta t}))^2}{2\widetilde{\sigma}^2} \right)\]

where:

\[\widetilde{\sigma}^2 = \sigma^2\frac{1-e^{-2\mu\Delta t}}{2\mu}\]

Generalized Hyperbolic Distribution (GHYP)

The generalized hyperbolic family of probability distributions is defined as a mixture between a multivariate normal distribution and a generalized inverse Gaussian (GIG) distribution, see (Weibel et al., 2022). A vector $X$ follows a generalized hyperbolic distribution if:

\[X \stackrel{d}{=} \mu + W\gamma + \sqrt{W}AZ\]

where:

$Z \stackrel{d}{=} N_k(0,I_k)$
$A \in \mathbb{R}^{d \times k}$
$\mu, \gamma \in \mathbb{R}^d$
$W \geq 0$ is a random vector independent of $Z$ with probability distribution $GIG(\lambda, \xi, \psi)$

Density Function

The GHYP density function is:

\[f_X(x) = \frac{(\sqrt{\psi/\chi})^\lambda (\psi+\gamma' \Sigma^{-1}\gamma)^{\frac{d}{2}-\lambda}}{(2\pi)^{\frac{d}{2}} |\Sigma|^{\frac{1}{2}}} \frac{K_{\lambda-\frac{d}{2}}(\sqrt{(\chi+Q(x))(\psi+\gamma' \Sigma^{-1}\gamma)})e^{(x-\mu)' \Sigma^{-1}\gamma}}{(\sqrt{(\chi+Q(x))(\psi+\gamma' \Sigma^{-1} \gamma)})^{\frac{d}{2}-\lambda}}\]

Special Cases

Multivariate hyperbolic: When $\lambda = \frac{d+1}{2}$
Normal Inverse Gaussian (NIG): When $\lambda = \frac{1}{2}$
Variance-Gamma (VG): When $\chi=0$ and $\lambda > 0$
Generalized hyperbolic t-student: When $\psi=0$ and $\lambda<0$

Parameters Estimation

Regarding the estimation of GHYP distribution parameters: in the multivariate case a modified Expectation–Maximization (EM) algorithm is used, known as Multi-cycle Expectation Conditional Maximization (MCECM). For the univariate GHYP distribution, parameter estimation is performed via maximum log-likelihood.

A relevant feature of these algorithms is that they are commonly employed to estimate parameters of probability distributions when the observed data are incomplete, see (Dempster et al., 1977); notably, they rely on an iterative procedure.

Reinforcement Learning

Reinforcement learning (RL) is a method for learning to act by mapping environmental situations ($S_t$) to response actions ($a_t$), maximizing a numerical reward value. Unlike supervised learning, RL doesn’t receive examples classified as correct or incorrect; instead, its learning mechanism operates through positive and negative rewards—learning through trial and error.

RL Method Classification

According to Wu and Konda, RL methods are classified into three categories:

Critic-only: Methods that rely on value function approximation, oriented toward learning an approximate solution to the Bellman equation
Actor-only: Methods that rely on a family of previously parameterized policies, using the performance gradient to update the improvement direction
Actor-Critic: Methods that take advantage of both, where the critic part uses value function approximation, which is subsequently updated by the actor’s policy parameters

Key Components

Policy ($\pi_t$): The mechanism by which the agent learns from its environment—the mapping of possible states to possible actions
Reward signal ($r_t$): The objective in the RL process, represented by a number indicating whether a decision was positive or negative
Value function ($\nu_{\pi}$ or $Q_{\pi}$): The long-term reward level
Environment model: Determines the dynamics of the environment, used for planning

Value Function

The value function can be expressed as the expected value of future rewards $r_t$ discounted by $\gamma$:

\[\nu_*(a) = \mathbb{E}[R_t|A_t=a]\]

However, since $\nu_*(a)$ is only known with certainty when action $a$ is taken at some future time $t$, the value function must be estimated through function $Q_t(s,a)$:

\[Q_*(s,a) = \max_\pi \mathbb{E}[r_1 + \gamma r_{t+1} + \gamma^2 r_{t+2}+ ... | s_t = s, A_t = a, \pi]\]

where $\gamma$ is the discount factor, $0 \leq \gamma \leq 1$.

Policy Update Rules

Q-learning (off-policy):

\[Q(S_t,A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1}+\gamma \max_a Q(S_{t+1},a) - Q(S_t,A_t)]\]

SARSA (on-policy):

\[Q(S_t,A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1}+\gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t)]\]

The key difference is that Q-learning directly considers the action that maximizes the subsequent state, while SARSA only considers the value of the policy applied to the selected action.

RL Algorithms by Category

Category	Algorithm
Value-based	Q-learning, SARSA, DQN
Policy-based	REINFORCE, PG, TRPO
Actor-critic	DPG, PPO, DDPG, SAC, A2C
Others	Model-based RL, Multi-Agent RL

Model Implementation

Data and Methodology

Two liquid ETFs in the local market were selected:

iShares xxxxxxxxx ETF (ABC)
iShares xxxxxxxxx ETF (XYZ)

Both are cross-listed in the US and local markets under the same currency, eliminating the need for currency conversion. The ETFs show different spread dynamics: ABC exhibits stable mean reversion behavior, while XYZ presents temporal shocks in spread levels.

Dataset:

Optimization/Training: 2,370 daily observations (01/02/2012 to 02/01/2021)
Testing: 345 samples for performance evaluation
Total: 2,715 observations

Stochastic Mean Reversion Model

Estimating ρ (Mean Reversion Speed)

Following Göncu’s methodology, we construct a variable $\widetilde{X} = X(t) - \bar{X}$, discretize and rewrite the SDE:

\[\Delta \widetilde{X}(t) = - \rho \cdot \widetilde{X}(t) \cdot \Delta t + \sigma \cdot \Delta L(t)\] \[\widetilde{X}(t+1) = (1-\rho \cdot \Delta t) \cdot \widetilde{X}(t)+\widetilde{\epsilon}(t)\]

where $\widetilde{\epsilon}(t)$ follows a GHYP distribution, independent and identically distributed. We estimate $\Omega = (1-\rho \cdot \Delta t)$ by minimizing the error.

Results:

$\hat{\Omega}_{ABC} = 0.2825$
$\hat{\Omega}_{XYZ} = 0.9811$

Solving for $\rho$:

\[\rho = \frac{1 - \Omega}{\Delta t}\]

Calibrating the Diffusion Factor

The residuals $\widetilde{\epsilon}(t)$ show excess kurtosis, justifying the use of a Lévy process with GHYP distribution rather than normal distribution. The parameters were estimated using maximum likelihood:

Asset	$\lambda$	$\alpha$	$\chi$	$\psi$	$\mu$	$\sigma$	$\gamma$
ABC	1	4.612e-3	1.06e-5	2.0017	5.5251e-4	0.0102	-5.477e-4
XYZ	1	1.476e-5	1.09e-10	2	9.451e-4	0.01279	-9.4646e-4

Calibrating Entry/Exit Levels

Entry and exit levels are inspired in (Leung & Li, 2016) that treats the process as an arrival times process, following a Poison process. For a short position in the pair, the profit and loss function is defined as:

\[\nu^1(c,T) = \begin{cases} c & \tau_1\leq T \\ X(0) - X(T) & \tau_1 > T \end{cases}\]

where $\tau_1 = \inf \{t>0; X(t)=\bar{X} | X(0) = \bar{X}+c \}$ and $c = X(0) - \theta$.

For a long position, redefining $c = \theta - X(0)$:

\[\nu^1(c,T) = \begin{cases} c & \tau_1\leq T \\ X(T) - X(0) & \tau_1 > T \end{cases}\]

The expected utility is:

\[\mathbb{E}[\nu_1(c,T)] = P(\tau_1<T)c+(1-P(\tau_1<T))(X(0)-\mathbb{E}[X(T)|\tau_1>T])\]

To find the value of $X(T)$, we use the generalized formula:

\[\widetilde{X}_t = \Omega^n \widetilde{X}_0 + \sum_{i=1}^t{\Omega^{t-i} \widetilde{\epsilon}_i}\]

Monte Carlo Simulation:

100,000 simulations with 300 steps each
Time horizon $T$ up to 250 days (1 trading year)
Range of $c$ from 1 to observed max/min for short/long positions

Optimal Levels Found:

Short position: $c_{short} = 1.017104$
Long position: $c_{long} = 0.982084$

Reinforcement Learning Model

Model Components

Action space: 3 dimensions (Hold, Buy, Sell)
Observation space: Historical spread range ± 10%
Policy update: Both Q-learning (off-policy) and SARSA (on-policy)

Reward Function

Calibrations take into account the comments in (Yuan, 2019) where is affirmed the inclusion of signiticative wrong scenarios help to improve the learning process, in here by scaling the lose by 3. For short position:

\[r_t(a_t|\text{Position}=\text{Short}, t < T) = \begin{cases} x_t - \theta & x_t > \theta \\ -3 (x_t - \theta) & \text{otherwise} \end{cases}\]

For long position:

\[r_t(a_t|\text{Position}=\text{Long}, t < T) = \begin{cases} \theta - x_t & \theta > x_t \\ -3 (x_t - \theta) & \text{otherwise} \end{cases}\]

The negative reward has a scaling factor of -3 to create greater impact during the learning process for counterproductive actions.

Hyperparameter Optimization

The delta $\delta$ is defined as the difference between expected and observed values in the Q-value update function:

For Q-learning:

\[\delta_{Q\text{-learning}} = R_{t+1}+\gamma \max_a Q(S_{t+1},a) - Q(S_t,A_t)\]

For SARSA:

\[\delta_{SARSA} = R_{t+1}+\gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t)\]

Analysis of $\alpha$ (learning rate) and $\gamma$ (discount factor):

$\alpha$ shows an inverse relationship with $\delta$
$\gamma$ shows a direct relationship with $\delta$

Optimal Parameters

Position	$\alpha$	$\gamma$	Algorithm	$\delta$	Episodes for Convergence
Short	0.8	0.2	SARSA	0.175	1,000
Short	0.8	0.2	Q-learning	0.175	680
Long	0.75	0.2	SARSA	0.25	3,000
Long	0.75	0.2	Q-learning	0.25	13,000

Training Details:

Episodes: 5,000
Steps per episode: 30
$\epsilon$-greedy strategy for exploration
Random selection of time series segments for each episode

RL Optimal Levels

The optimal levels were selected by finding the state $X_j$ that maximizes the Q-value in the respective column:

\[X_{\text{short/long}}^* = \{ X_j \mid j = \max_i Q_{i,\text{short/long}} \}\]

Results:

Upper threshold (short position): $c_{RL-upper} = 1.022086$
Lower threshold (long position): $c_{RL-lower} = 0.993484$

Results and Comparison

Optimal Entry Levels

Asset	Model	$P_{short}$	$P_{long}$	$\theta$
ABC	Stochastic Mean Reversion	1.017104	0.982084	1.000292
	Reinforcement Learning	1.022086	0.993484
XYZ	Stochastic Mean Reversion	1.01	0.9755	0.9695
	Reinforcement Learning	1.0106	0.9737

Key Observation: RL assigns thresholds asymmetrically around the mean $\theta$, with the lower level closer to $\theta$ than in the stochastic model.

Trading Rules

Since ABC ETF is denominated in USD in both US and local markets, no exchange rate is needed
Long position: Sell ABC in US market while simultaneously buying ABC in local market
Short position: Buy ABC in US market while simultaneously selling ABC in local market
All positions liquidated when spread crosses mean value $\theta=1.0002921$
Multiple positions in the same direction allowed, but no simultaneous long and short positions
Each operation: 1 share of asset A and B simultaneously
Forced exit after 30 days if no favorable level reached

Number of Positions Opened

Asset	Model	Short Operations	Long Operations	Total
ABC	Stochastic Mean Reversion	9	12	21
	Reinforcement Learning	4	70	74
XYZ	Stochastic Mean Reversion	124	115	139
	Reinforcement Learning	124	115	139

Analysis: For ABC, the RL model generated significantly more long operations (70 vs 12) but fewer short operations (4 vs 9), consistent with its asymmetric threshold placement.

Profit & Loss Calculation

\[P\&L_{\text{Short/Long}} = P_{t1_{\text{sell/buy}}}^{foreign} - P_{t1_{\text{buy/sell}}}^{local} - P_{t2_{\text{buy/sell}}}^{foreign} + P_{t2_{\text{sell/buy}}}^{local}\]

where subscripts indicate timing (t1 = open, t2 = close) and superscripts indicate market (foreign/local).

Performance Metrics

The following metrics were chosen to provide a comprehensive evaluation of both profitability and risk-adjusted performance for each model:

Profit Metrics: Measure absolute and relative returns, as well as the consistency of winning trades.
Risk Metrics: Assess volatility, drawdowns, downside risk, and gain-loss ratios to capture the risk profile of each strategy.
Risk-Adjusted Metrics: Include Sharpe, Sortino, and Calmar ratios to evaluate returns relative to risk taken.

This selection ensures a balanced comparison between the stochastic mean reversion (MR) and reinforcement learning approaches, highlighting both their strengths and limitations in practical pair trading scenarios.

Metric Group	Metric	Asset	Stochastic MR	Reinforcement Learning
Profit Metrics	Profit Rate	ABC	939.38%	855.06%
		XYZ	174.95%	174.95%
	Win Rate	ABC	90.47%	79.72%
		XYZ	51.88%	51.88%
Risk Metrics	Volatility	ABC	57.8%	459.31%
		XYZ	884.27%	884.27%
	Max Drawdown	ABC	0	-2.45
		XYZ	-2.54	-2.54
	Downside Deviation	ABC	2.33	2.42
		XYZ	0.177	0.177
	Gain-Loss Ratio	ABC	5.12	0.984
		XYZ	0.3484	0.3484
Risk-Adjusted	Sharpe Ratio	ABC	2.73	1.45
		XYZ	0.360	0.360
	Sortino Ratio	ABC	3.43	1.92
		XYZ	0.6808	0.6808
	Calmar Ratio	ABC	∞	-1.90
		XYZ	-0.0474	-0.0474

Key Findings

For ABC (stable mean reversion):

Stochastic model shows superior performance across most metrics:
- Higher profit rate (939.38% vs 855.06%)
- Higher win rate (90.47% vs 79.72%)
- Much lower volatility (57.8% vs 459.31%)
- No drawdown vs -2.45
- Better risk-adjusted returns (Sharpe: 2.73 vs 1.45, Sortino: 3.43 vs 1.92)
RL model generates more trades but with higher volatility

For XYZ (structural breaks):

Both models produce identical results
Both models found very similar optimal thresholds
Performance is inferior across all metrics compared to ABC
This suggests the spread doesn’t follow stable mean reversion

Conclusions

Model Relationships

Monte Carlo Connection: Both RL and stochastic mean reversion models use Monte Carlo methods to find expected values along to profit and loss functions for the stochastic model and Q-values for RL.
Bellman Equation: Both approaches leverage the Bellman equation for optimization—explicitly in RL and through stochastic optimization in the mean reversion model. RL can be considered analogous to stochastic optimization using Monte Carlo methods.

Performance Analysis

Spread Behavior Matters: Both models are effective for processes with stable mean reversion behavior (ABC) but show limited differentiation for spreads with structural breaks (XYZ).
Asymmetric vs Symmetric Thresholds:
- The stochastic model produces relatively symmetric levels around the mean
- The RL model shows asymmetric levels, closer to $\theta$ on the long side
- For ABC, the symmetric approach (stochastic) yielded better risk-adjusted returns
Trading Horizon Impact: The threshold levels are influenced by the strategy’s cut-off horizon (30 days in this case).

Practical Considerations

While the RL model showed impressive absolute returns, several real-world frictions were not incorporated:

Transaction costs
Bid-ask spreads
Tax implications in each market
Margin costs for short selling

Within the academic exercise framework, the stochastic mean reversion model demonstrated superior performance for pair trading on assets with stable mean-reverting spreads, particularly when considering risk-adjusted metrics.

Advantages of Each Approach

Stochastic Mean Reversion Model:

More predictable behavior
Lower volatility
Better risk-adjusted returns for stable spreads
Simpler interpretation and implementation
Symmetric thresholds around mean

Reinforcement Learning Model:

Learns from both good and bad decisions through reward mechanisms
Analogous to solving Bellman equations for optimization
Can adapt to complex reward structures
Potential for improvement with deep learning enhancements
More flexible in asymmetric market conditions

Future Research Directions

Different Asset Pairs: Test models on assets that are not cross-listed, especially pairs with minor structural changes
Deep Learning Integration: Enhance RL models with:
- Deep Q-Networks (DQN)
- Deep deterministic policy gradients (DDPG)
- Convolutional layers for feature extraction
- Meta-learning approaches
Real-World Frictions: Incorporate transaction costs, slippage, and tax implications
Alternative Distributions: Explore other heavy-tailed distributions beyond GHYP for the diffusion factor
Multi-Asset Strategies: Extend to portfolios of multiple pairs simultaneously

Technical Implementation Notes

Monte Carlo Simulation Parameters

Stochastic Model:

3 million random numbers generated
Grid: 250 × 12,000 (250 days × 12,000 scenarios)
GHYP parameters estimated via maximum likelihood

Reinforcement Learning:

Q-value matrix dimensions: 504 × 3 (states × actions)
Episode length: randomly selected 20-30 day windows
Exploration: $\epsilon$-greedy strategy
State space: discretized spread range

Algorithm Pseudocode

Initialize: episodes=5000, maxSteps=30, ε=ε*, α=α*, γ=γ*
Initialize: Q(nS, nA) ← 0

For episode_i = 1 to episodes:
    env ← reset_environment()
    a_t ← select_action(ε)
    step_t ← 0
    
    While not terminated:
        r_t ← reward(a_t)
        step_t ← step_t + 1
        s_t ← new_state
        a_t ← select_action(ε)
        Q(nS, nA) ← update_Q(a_t, a_{t-1}, s_t, s_{t-1})
        terminated ← evaluate(conditions)

Where:

reset_environment(): Randomly selects a time segment of length maxSteps
select_action(ε): Implements ε-greedy policy
update_Q(): Applies Q-learning or SARSA update rule

References

This research draws on the intersection of stochastic calculus, optimization theory, and modern machine learning (Baldi, 2017; Sutton & Barto, 2020; Powell, 2022). The comparison reveals that while reinforcement learning shows promise, classical stochastic models remain highly effective for well-behaved mean-reverting processes (Bertram, 2009; Bertram, 2010; Leung & Li, 2016), particularly when incorporating realistic risk management considerations.

The Ornstein-Uhlenbeck mean reversion model (Ornstein & Uhlenbeck, 1930; Schwartz, 1997) has been extensively studied for pair trading applications (Goncu & Akyildirim, 2016; Zeng & Lee, 2014; Avellaneda & Lee, 2010). The use of generalized hyperbolic distributions (Konlack & Wilcox, 2014; Weibel et al., 2022) for modeling heavy-tailed distributions provides a more realistic representation of market dynamics compared to normal distributions (Madan et al., 1999; Carr & Wu, 2004).

On the reinforcement learning side, Q-learning and SARSA algorithms (Sutton & Barto, 2020; Kaelbling et al., 1998) have been applied to trading strategies (Chakole et al., 2021; Carapuco et al., 2018; Wu et al., 2020). Recent studies have explored actor-critic methods (Konda & Tsitsiklis, 2003) and deep reinforcement learning approaches (Plaat, 2022; Dong et al., 2020) for financial applications (Sun et al., 2021; Carta et al., 2021; Kowalik et al., 2019).

The theoretical foundations draw from the Efficient Market Hypothesis (Fama, 1970) and its evolution into the Adaptive Market Hypothesis (Lo, 2004), acknowledging behavioral factors (Shiller, 2014) and even chaotic dynamics (Klioutchinov et al., 2017; Minsky, 1979) in financial markets. The pair trading strategy itself is well-documented in quantitative finance literature (Isichenko, 2021; Chan, 2013; Vidyamurthi, 2004; Krauss, 2015), building on modern portfolio theory (Elton et al., 2014).

Parameter estimation techniques include maximum likelihood methods (Ait-Sahala, 2002; Mejía, 2018; Dempster et al., 1977) and specialized approaches for stochastic processes (Haress & Hu, 2021; Minnis, 2012).

The key insight is not that one model is universally superior, but rather that model selection should be driven by the characteristics of the underlying spread dynamics. For stable mean-reverting spreads, stochastic models provide excellent risk-adjusted returns with lower volatility. For more complex or non-stationary spreads, enhanced RL approaches with deep learning may offer advantages—an avenue ripe for future exploration.

Keywords: mean reversion, stochastic models, reinforcement learning, pair trading, statistical arbitrage, Ornstein-Uhlenbeck, Q-learning, SARSA, quantitative finance, algorithmic trading

Bibliography

Leung, T., & Li, X. (2016). Optimal mean reversion: Mathematical analysis and practical applications. World Scientific.
Weibel, M., Breymann, W., & Luthi, D. (2022). gyph: A package on generalized hyperbolic distributions. CRAN-repository.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1–22.
Yuan, Y. (2019). A novel multi-step Q-learning method to improve data efficiency for deep reinforcement learning. Knowledge-Based Systems, 175, 107–117.
Baldi, P. (2017). Stochastic calculus: An introduction through theory and excercises. Springer.
Sutton, R., & Barto, A. (2020). Reinforcement learning: An introduction. The MIT Press.
Powell, W. (2022). Reinforcement learning and stochastic optimization: An unified framework for sequential decitions. Wiley.
Bertram, W. (2009). Optimal trading Strategies for Ito diffusion processes. Physica A: Statistical Mechanics and Its Applications, 338, 2865–2873.
Bertram, W. (2010). Analytic solution for optimal statistical arbitrage trading. Physica A: Statistical Mechanics and Its Applications, 389, 2234–2243.
Ornstein, L., & Uhlenbeck, G. (1930). On the theory of the brownian motion. Physical Review Journal, 36(5), 823–841.
Schwartz, E. (1997). Stochastic behavior of commodity prices: Implications for valuation and hedging. Journal of Finance, 52(2), 923–973.
Goncu, A., & Akyildirim, E. (2016). A stochastic model for commodity pairs trading. Quantitative Finance, 16(12), 1843–1857.
Zeng, Z., & Lee, C.-G. (2014). Pairs trading: optimal threshold and profitability. Quantitative Finance, 14(11), 1881–1893.
Avellaneda, M., & Lee, J.-H. (2010). Statistical arbitrage in the US equity markets. Quantitative Finance, 10(7), 761–782.
Konlack, V., & Wilcox, D. (2014). A comparison of generalized hyperbolic distribution models for equity returns. Journal of Applied Mathematics, 15.
Madan, D., Carr, P., & Chang, E. (1999). The Variance Gamma process and option pricing. European Finance Review, 2(1), 79–105.
Carr, P., & Wu, L. (2004). Time-changed Levy processes and option pricing. Journal of Financial Economics, 71(1), 113–141.
Kaelbling, L., Littman, M., & Cassandra, A. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134.
Chakole, J., Kolhe, M., Mahapurush, G., Yadav, A., & Kurhekar, M. (2021). A Q-learning agent for automated trading in equity stock markets. Expert Systems with Applications, 163, 1–12.
Carapuco, J., Neves, R., & Horta, N. (2018). Reinforcement learning applied to forex trading. Applied Soft Computing Journal, 73, 783–794.
Wu, X., Chen, H., Wang, J., Troiano, L., Loia, V., & Fujita, H. (2020). Adaptative stock trading strategies with deep reinforcement learning methods. Information Science, 538, 142–158.
Konda, V., & Tsitsiklis, J. (2003). On actor-critic algorithms. SIAM J. Control Optim, 42(4), 1143–1166.
Plaat, A. (2022). Deep reinforcement learning. Springer.
Dong, H., Ding, Z., & Zhang, S. (2020). Deep reinforcement learning: Fundamentals, research and applications. Springer.
Sun, S., Wang, R., & An, B. (2021). Reinforcement learning for quantitative trading. ArXiv EPrint.
Carta, S., Corriga, A., Ferreira, A., & Podda, A. (2021). A multi-layer and multi-ensemble stock trader using deep learning. Applied Science, 51, 889–905.
Kowalik, P., Kjellevold, A., & Gropen, S. (2019). A deep reinforcement learning approach for stock trading [Master's thesis]. Norwegian University of Science and Technology.
Fama, E. (1970). Efficient capital markets: A review of theory and empirical work. Journal of Finance, 25(2), 383–417.
Lo, A. (2004). The adaptative market hypothesis: Market efficiency from an evolutionary perspective. The Journal of Portfolio Management.
Shiller, R. (2014). Speculative Asset Prices. The American Economic Review, 104(6), 1486–1517.
Klioutchinov, I., Sigova, M., & Beizerov, N. (2017). Chaos theory in finance. Procedia Computer Science, 119, 368–375.
Minsky, H. (1979). The financial instability hypothesis: An interpretation of Keynes an alternative to ’standard’ theory. Nebraska Journal of Economics, Business, 16(1), 5–18.
Isichenko, M. (2021). Quantitative Portfolio Management: The art and science of statistical arbitrage. Wiley Finance Series.
Chan, E. (2013). Algorithmic trading: winning strategies and their rationale. Wiley Trading Series.
Vidyamurthi, G. (2004). Pairs trading: Quantitative methods and analysis. Wiley Finance.
Krauss, C. (2015). Statistical arbitrage pairs trading strategies: Review and outlook (No. 9; Number 9). Institut für Wirtschaftspolitik und Quantitative Wirtschaftsforschung, Nürnberg - Working Paper.
Elton, E., Brown, S., Gruber, M., & Goetzman, W. (2014). Modern portfolio theory and investment analysis. Wiley.
Ait-Sahala, Y. (2002). Maximum likelihood estimation of discretely sampled diffusions: A closed-form approximation approach. Econometrica, 70(1), 223–262.
Mejía, C. (2018). Calibration of exponential Ornstein-Uhlenbeck process when spot prices are visible through the maximum log-likelihood method. Example with gold prices. Spring Open Journal: Advances in Difference Equations, 269.
Haress, E., & Hu, Y. (2021). Estimation of all parameters in the fractional Ornstein-Uhlenbeck model under discrete observations. Statistical Inference for Stochastic Processes, 24, 327–351.
Minnis, M. (2012). Mean reverting Levy based processes. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2086485