Pair trading strategy using reinforcement learning algortihms (PPO and A2C)

Pair trading strategy using reinforcement learning algortihms (PPO and A2C)

Apr 21, 2025

Deep Reinforcement Learning for Pair Trading

This is a brief summary of one of personal goals achieved recently. It consisted of creating a full study of reinforcement learning strategy using real parameters and going through the full process to assess the output. That said, it consists in a study of reinforcement learning (RL) algorithms known as Proximal Policy Optimization (PPO) and Advantage-Actor-Critic model (A2C) in a pair trading strategy.

A suggestion for the reader to learn about pair trading and/or statistical arbitrage, is the book titled The art and sci-ence of statistical arbitrage written by Isichenko. It contains the fundamentals of the strategy and some statistical methods to manage it. As said before, my approach was different in the way I followed partially such principles, still keeping in mind the fundamentals.

From Experimental Frameworks to Comparative Strategy Analysis

The project was performed in three stages. The first one correspond to my master’s thesis where I encompas the strategy by using three learning environments. The first two ones as strategies different to pair trading strategy with discrete action spaces. The third one respecting the rules of pair trading, but in continuous time. The results where mixed, but the takeaway that marked my analysis in this first experiment is the fact that not all RL policies are deterministic. Actually, this first experiment revealed that given that PPO and A2C used stochastic policy, as usual, the analysis under only one path is not something to rely on. You must do the analysis given a number of simulations, or Monte Carlo simulation.

Afterwards, what I call phase 1 consists in the first conference paper, that as team we participated in. Details of conference paper below. This conference paper helped us to identify some drawbacks in the first analysis. The main one, and I didn’t mention in foregoing paragraph was the arbitrary inclusion of technical indicators, given my experience in manual trading. It was simply an assumption. Nonetheless, and even our agnostic analysis of their influence suggested they could apport some support to the output, when phase 2 was performed, we realized they do not provide too much information to the model as we though once signal was changed. And was changed becase the last one in phase 2 makes more sense for mean-reversion assumption. This is widely used in the industry.

I have to acknowledge that the work done by contributors of FinRL boosted our analysis due to they provide a framework to execute this sort of analysis. The benefit of being open-source is that we dive into the bowels of the library. We learned a lot as well during the process, given its interaction with gym library from OpenAI, among others. They have a good amount of agents, that I encorage you to read about here.

Data

The selected instrument for these experiments were stocks. More specifically, a subset of pairs that we concluded are candidates after a rigurious fildering all combinations using S&P 500 at 17th October, 2022. We run statistical tests and a measure to determine candidates. More details in phase 1 document. This task took several hours in my laptop, I have to say. But it was worth

Now, the project is summarized in two research studies exploring the use of Deep Reinforcement Learning (DRL) for pair trading — evolving from an experimental prototype to a comparative evaluation using well-known DRL algorithms.


Conference paper and paper

📘 Phase 1: Experimental DRL Framework for Pair Trading

  • 📄 Title: Reinforcement Learning Model Applied in a Pair Trading Strategy
  • 📚 Source: Springer WEA, 2024
  • 🔗 DOI: Link to Chapter

🧪 Highlights

  • Developed three customized RL environments simulating a pair trading as a single-agent control task over a subset of 5 stock pairs from S&P 500.
  • Continuous action space: agent decides both position direction and size.
  • Reward function: Portfolio value, discounting trading costs.
  • The action space includes technical indicators.

✅ Results

  • PPO showed the most consistent performance in Sharpe ratio and cumulative return.
  • A2C and DDPG had higher variance—highlighting sensitivity to hyperparameters and reward design.
  • RL models do not outperform classical methods in pair trading strategies, consistently.

🚀 Phase 2: Comparative Study of A2C and PPO, but with new signal

📄 Title: Deep Reinforcement Learning in Continuous Action Spaces for Pair Trading: A Comparative Study of A2C and PPO 📚 Source: SN Computer Science, 2025
🔗 DOI: Link to Article

🧠 Methodology

  • Same universe a in phase 1
  • Different signal. This time it is adjusted by the moving average of last month to remov the trend.
  • Agents trained in a different environment where action space is continuous, but letting them oscilate depending of the band in the strategy.
  • Technical indicators were removed from action space.
  • Comparison against classical method of pair trading strategy reveals an improvemnt respect to phase 1 but still without outperforming classical methods.

📊 Key Results

  • Both A2C and PPO partially outperformed the benchmark.
  • PPO showed faster convergence and lower drawdowns.
  • Bounded action space for each of the bands in pair trading strategy improve the learning process

🔁 Summary of Contributions

Feature Phase 1 Phase 2
Environment Design ✅ Custom built ✅ Reused and validated
DRL Algorithms Tested A2C, PPO, DDPG A2C, PPO
Dataset S&P500 subset, 2018–2023 S&P500 subset, 2018–2023
Reward Design Portfolio Value, including costs Portfolio Value, including costs
Benchmark Comparison ✅ Yes, (mean-reversion) ✅ Yes (mean-reversion)
Real-world Pair Trading Viability No outperforms classical method Partially outperforms classical method

🔮 Future Directions

  • Introduce multi-agent frameworks for long/short coordination.
  • Use microstructure features (e.g., order book signals, bid/ask spreads).
  • Impact analysis of including technical indicators.
  • Use of transformers architecture.

📂 Resources


📢 Citation

If you find this work helpful, please cite the following:

```bibtex @inproceedings{deep_reingorment_learning_pair_trading_strategy_ppo_a2c_conferencepaper, title={Reinforcement Learning Model Applied in a Pair Trading Strategy }, author={Cristian Quintero, Diego Leon, Javier Sandoval & German Hernandez}, booktitle={Applied Computer Science in Engineering – WEA 2024}, year={2024}, publisher={Springer} }

@article{deep_reingorment_learning_pair_trading_strategy_ppo_a2c_comparative, title={Deep Reinforcement Learning in Continuous Action Spaces for Pair Trading: A Comparative Study of A2 C and PPO}, author={Cristian Quintero, Diego Leon, Javier Sandoval & German Hernandez}, journal={SN Computer Science}, year={2025}, publisher={Springer} } ```# coding: utf-8

Gem::Specification.new do |spec| spec.name = “forty_jekyll_theme” spec.version = “1.2” spec.authors = [“Andrew Banchich”] spec.email = [“andrewbanchich@gmail.com”]

spec.summary = %q{A Jekyll version of the “Forty” theme by HTML5 UP.} spec.homepage = “https://gitlab.com/andrewbanchich/forty-jekyll-theme” spec.license = “MIT”

spec.files = git ls-files -z.split(“\x0”).select { f f.match(%r{^(assets _layouts _includes _sass LICENSE README)}i) }

spec.add_development_dependency “jekyll”, “~> 4.0” spec.add_development_dependency “bundler”, “~> 2.1” end

Phone

(+48) 539 641 920

Address

Warsaw
Warsaw, WAR Poland