Pair trading strategy using reinforcement learning algortihms (PPO and A2C)

Apr 21, 2025

Deep Reinforcement Learning for Pair Trading

This is a brief summary of a recently achieved personal goal: conducting a comprehensive study of reinforcement learning (RL) strategies using real-world parameters and a rigorous evaluation process. Specifically, the research investigates RL algorithms—Proximal Policy Optimization (PPO) and Advantage Actor-Critic (A2C)—applied to pair trading strategies. The findings are detailed in two articles (Quintero et al., 2024; Quintero et al., 2025).

A suggestion for the reader to learn about pair trading and/or statistical arbitrage, is the book titled The art and sci-ence of statistical arbitrage written by Isichenko. It contains the fundamentals of the strategy and some statistical methods to manage it. As said before, my approach was different in the way I followed partially such principles, still keeping in mind the fundamentals.

From Experimental Frameworks to Comparative Strategy Analysis

The project was performed in three stages.

The first stage corresponds to my master thesis in Quantitative Finance (QF), where I do encompass the strategy by using three learning environments. The two learning environments are aimed to adopt a strategy different from pair trading strategy with discrete action spaces. The third one respecting the rules of pair trading, but in continuous time. Combination of strategies and action spaces were introduced as an excercise of confirmation since RL models have gained a lot of attention in algorithmic trading. As results, a blended output is noticed where the main takeaway that marks my analysis is the fact that not all RL policies are deterministic, therefore model’s output is exposed to variance when extrapolated to stochastic policies. In effect, this first experiment revealed that given that PPO and A2C use stochastic policy, the analysis under only one path is not enought to fully rely on. It is required to conduct the analysis given a number of simulations, at Monte Carlo simulation style.
A second stage consists in a conference paper as result of proceedings of 11th Workshop on Engineering Applications, WEA 2024, Barranquilla, Colombia, where co-authors such Diego Leon, Javier Sandoval and German Hernandes presented this work. This conference paper helped us to identify some drawbacks in the first analysis. The main one, and I didn’t mention in foregoing paragraph was the arbitrary inclusion of technical indicators, given my experience in traditional trading. It was an assumption that we expected to see an effect from. Our agnostic analysis of their influence suggested they could apport some support to the output though, when phase 2 was performed we realized the technical indicators do not provide information to the model for making decisions, as we considered in the experiment. (Quintero et al., 2024).
The third stage is summarized in a study case article where taking into account the output in stage #2, the signal was redefined together tachnical analysis indicators removal. Additionally, fine tunning is performed using Ray framework for model parameters optimization. The output of this new approach suggests a slight improvement mainly attributed to the signal used to allow the RL agents identify opportunies to get in and get out a position in a pair trading strategy. (Quintero et al., 2025).

I have to acknowledge that the work done by contributors of FinRL boosted our analysis due to they provide a framework to execute this sort of analysis. The benefit of being open-source is that we dive into the bowels of the library. From a technical perspective I can highlight the learning on different functions that describe RL agents and how small changes may introduce significan impact among them when learning; furthermore, the interaction with gym library from OpenAI, among others. They have a good amount of agents, that I encorage the reader to read about in here.

Data

The selected instrument for these experiments were stocks. More specifically, a subset of pairs that we concluded are candidates after a rigurious fildering all combinations using S&P 500 at 17th October, 2022. We run statistical tests and a measure to determine candidates. More details in phase 1 document. This task took several hours in my laptop, I have to say. But it was worth

Now, the project is summarized in two research studies exploring the use of Deep Reinforcement Learning (DRL) for pair trading — evolving from an experimental prototype to a comparative evaluation using well-known DRL algorithms.

Conference paper and paper

📘 Phase 1: Experimental DRL Framework for Pair Trading

📄 Title: Reinforcement Learning Model Applied in a Pair Trading Strategy
📚 Source: Springer WEA, 2024
🔗 DOI: Link to Chapter

🧪 Highlights

Developed three customized RL environments simulating a pair trading as a single-agent control task over a subset of 5 stock pairs from S&P 500.
Continuous action space: agent decides both position direction and size.
Reward function: Portfolio value, discounting trading costs.
The action space includes technical indicators.

✅ Results

PPO showed the most consistent performance in Sharpe ratio and cumulative return.
A2C and DDPG had higher variance—highlighting sensitivity to hyperparameters and reward design.
RL models do not outperform classical methods in pair trading strategies, consistently.

🚀 Phase 2: Comparative Study of A2C and PPO, but with new signal

📄 Title: Deep Reinforcement Learning in Continuous Action Spaces for Pair Trading: A Comparative Study of A2C and PPO 📚 Source: SN Computer Science, 2025
🔗 DOI: Link to Article

🧠 Methodology

Same universe a in phase 1
Different signal. This time it is adjusted by the moving average of last month to remov the trend.
Agents trained in a different environment where action space is continuous, but letting them oscilate depending of the band in the strategy.
Technical indicators were removed from action space.
Comparison against classical method of pair trading strategy reveals an improvemnt respect to phase 1 but still without outperforming classical methods.

📊 Key Results

Both A2C and PPO partially outperformed the benchmark.
PPO showed faster convergence and lower drawdowns.
Bounded action space for each of the bands in pair trading strategy improve the learning process

🔁 Summary of Contributions

Feature	Phase 1	Phase 2
Environment Design	✅ Custom built	✅ Reused and validated
DRL Algorithms Tested	A2C, PPO, DDPG	A2C, PPO
Dataset	S&P500 subset, 2018–2023	S&P500 subset, 2018–2023
Reward Design	Portfolio Value, including costs	Portfolio Value, including costs
Benchmark Comparison	✅ Yes, (mean-reversion)	✅ Yes (mean-reversion)
Real-world Pair Trading Viability	No outperforms classical method	Partially outperforms classical method

🔮 Future Directions

Introduce multi-agent frameworks for long/short coordination.
Use microstructure features (e.g., order book signals, bid/ask spreads).
Impact analysis of including technical indicators.
Use of transformers architecture.

📂 Resources

📜 Paper 1: Experimental Framework
📜 Paper 2: Comparative DRL Study
🧠 Code & data (coming soon / available on request)

📚 Full References 📢

Quintero, C., Leon, D., Sandoval, J., & Hernandez, G. (2024). Reinforcement Learning Model Applied in a Pair Trading Strategy. In J. C. Figueroa-García, F. S. Garay-Rairán, G. J. Hernández-Pérez, & Y. Díaz-Gutierrez (Eds.), Applied Computer Science in Engineering (Vol. 2094, pp. 29–42). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-74595-9_3
Quintero, C., Leon, D., Sandoval, J., & Hernandez, G. (2025). Deep Reinforcement Learning in Continuous Action Spaces for Pair Trading: A Comparative Study of A2C and PPO. SN Computer Science, 6(3), 348. https://doi.org/10.1007/s42979-025-03854-0