The world of cryptocurrency trading has long been a playground for innovation, where cutting-edge technologies like artificial intelligence and machine learning are reshaping how we think about automated investing. One of the most compelling developments in recent years is the use of deep reinforcement learning (DRL) to build intelligent trading agents capable of making profitable decisions—without human intervention.
In this article, we dive into a groundbreaking experiment by AI engineer Adam King, who developed a Python-based Bitcoin trading bot that achieved an astonishing 850% return over a test period. While past models showed promise, they lacked consistency. This new version introduces critical upgrades: LSTM networks, Bayesian optimization, advanced feature engineering, and risk-adjusted reward functions—all designed to boost performance and stability.
Let’s explore how this high-performing trading agent was built, optimized, and tested—using real data and proven techniques.
Improving the Deep Reinforcement Learning Model
At the core of this project lies a deep reinforcement learning framework, specifically the Proximal Policy Optimization (PPO2) algorithm. The original model used a basic Multi-Layer Perceptron (MLP) for decision-making but struggled with volatility and inconsistent returns.
To overcome these limitations, two major improvements were introduced:
1. Replacing MLP with LSTM Networks
Traditional neural networks treat each input independently, which is problematic for time-series data like cryptocurrency prices. Enter Long Short-Term Memory (LSTM) networks—a type of Recurrent Neural Network (RNN) that maintains an internal memory state across time steps.
👉 Discover how advanced AI models can transform trading strategies.
By replacing the MLP with an LSTM-based policy network, the agent can now:
- Retain historical price behavior without relying on sliding windows.
- Dynamically update its internal state at each time step.
- Learn complex temporal patterns from sequential data more effectively.
This change allows the bot to understand trends, cycles, and momentum shifts in Bitcoin prices with far greater accuracy than before.
2. Ensuring Data Stationarity
Cryptocurrency price data is inherently non-stationary, meaning it exhibits trends, seasonality, and shifting variances—making predictions unreliable.
To fix this, the dataset undergoes preprocessing:
- Log transformation: Reduces exponential growth bias and stabilizes variance.
- Differencing: Removes trends by calculating period-over-period changes (i.e., returns).
The result? A stationary time series where statistical properties remain constant over time—ideal for machine learning models.
Validation via the Augmented Dickey-Fuller (ADF) test confirmed stationarity with a p-value of 0.00, rejecting the null hypothesis of non-stationarity with high confidence.
Feature Engineering: Enhancing the Observation Space
A powerful model needs high-quality inputs. To give the agent deeper market insight, advanced feature engineering techniques were applied.
Adding Technical Indicators
Using Python's ta library, 32 technical indicators (58 features total) were evaluated for relevance. To avoid redundancy:
- Correlation analysis grouped similar indicators (momentum, volatility, trend).
- Only the least correlated feature per category was retained.
After filtering out highly correlated features (|r| > 0.5), 38 unique technical indicators were added to the observation space—including RSI, MACD, Bollinger Bands, and moving averages.
These provide contextual signals about market conditions, helping the agent anticipate reversals or breakouts.
Incorporating SARIMAX Predictions
Beyond technicals, a SARIMAX model (Seasonal AutoRegressive Integrated Moving Average with eXogenous factors) was used to forecast future prices.
Why SARIMAX?
- Handles seasonality and trends natively.
- Provides prediction intervals—giving the agent confidence levels.
- Helps distinguish between high-uncertainty and high-opportunity scenarios.
These predictions and their confidence bands are fed into the agent’s input layer, enriching its situational awareness.
Reward Function Optimization: Beyond Simple Profits
Early versions of the bot used profit maximization as the sole reward signal. While effective in bull markets, this led to overtrading and catastrophic drawdowns during downturns.
To create a more robust strategy, four risk-adjusted reward metrics were tested:
1. Sortino Ratio – Focused on Downside Risk
Unlike the Sharpe Ratio, which penalizes both upside and downside volatility, the Sortino Ratio only considers downside deviation.
$$ \text{Sortino Ratio} = \frac{R_p - R_f}{\sigma_d} $$
Where:
- $R_p$ = Portfolio return
- $R_f$ = Risk-free rate
- $\sigma_d$ = Downside standard deviation
This encourages aggressive participation during uptrends while protecting capital during crashes.
2. Calmar Ratio – Managing Maximum Drawdown
Maximum drawdown measures peak-to-trough loss—the worst-case scenario for any investor.
The Calmar Ratio uses this metric in place of volatility:
$$ \text{Calmar Ratio} = \frac{R_p - R_f}{\text{Max Drawdown}} $$
It rewards consistent growth and penalizes large losses—ideal for long-term sustainability.
3. Omega Ratio – Full Distribution Analysis
The most sophisticated of all, the Omega Ratio evaluates the entire return distribution above and below a threshold:
$$ \Omega(r) = \frac{\int_r^\infty (1 - F(x))dx}{\int_{-\infty}^r F(x)dx} $$
Higher values indicate greater upside potential relative to downside risk.
All three ratios were computed efficiently using the empyrical Python package, integrated directly into the reward function at each time step.
Hyperparameter Tuning with Bayesian Optimization
Even the best architecture fails without optimal settings. Enter Optuna, a powerful hyperparameter optimization framework leveraging Bayesian optimization.
Instead of random or grid searches, Optuna uses:
- Tree-structured Parzen Estimators (TPEs) to model promising regions of the search space.
- Parallel execution to speed up convergence.
Key parameters tuned:
n_steps: Number of steps between policy updates (log-uniform: 16–2048)cliprange: PPO clipping range (uniform: 0.1–0.4)- Learning rates, batch sizes, network layers
Each trial trained and evaluated the agent in simulation, returning negative average return as the loss (since Optuna minimizes objectives).
After thousands of trials on GPU-accelerated hardware, Optuna identified the optimal configuration—maximizing performance while minimizing overfitting.
Performance Comparison: Does It Beat the Market?
The final test compared agents trained under different reward schemes against benchmark strategies using unseen data (last 20% of hourly OHLCV data from CryptoDataDownload).
Baseline Strategies
- Buy and Hold (HODL): Passive investment.
- RSI Divergence: Buy low, sell high based on momentum shifts.
- SMA Crossover: Golden/death cross signals.
Results
| Strategy | Return (%) |
|---|---|
| Buy & Hold | ~120% |
| RSI Divergence | ~180% |
| SMA Crossover | ~210% |
| Profit-Based Reward | 350% |
| Calmar/Omega Rewards | ~370–400% |
| Sortino-Based Reward | 850% |
👉 See what tools power next-gen trading algorithms today.
The Sortino-optimized agent outperformed all others by a wide margin—demonstrating superior risk management and timing precision.
Visual inspection revealed:
- Well-timed entries/exits (green/red triangles).
- Avoidance of major market dips.
- Minimal overtrading compared to Omega-driven agents.
Frequently Asked Questions (FAQ)
Q1: Can this bot work in live markets?
While trained on historical data, real-time deployment requires additional safeguards: latency handling, exchange API integration, slippage modeling, and continuous retraining. Future plans include testing on Coinbase Pro with multiple cryptocurrencies.
Q2: Is this considered financial advice?
No. This research is strictly educational. Past performance does not guarantee future results. Always conduct independent due diligence before investing.
Q3: How do you prevent overfitting?
Multiple strategies reduce overfitting risk:
- Train/test split (80/20).
- Out-of-sample validation.
- Use of generalizable reward functions (e.g., Sortino).
Still, live testing remains essential for validation.
Q4: Why choose Python for this project?
Python dominates quantitative finance due to rich libraries (pandas, numpy, tensorflow, optuna, empyrical) and strong community support—making rapid prototyping and iteration possible.
Q5: What’s next for AI-driven trading?
Next steps include:
- Multi-currency support (Ethereum, Litecoin).
- Real-time execution engine.
- Ensemble modeling and uncertainty estimation.
The goal: production-ready autonomous trading systems.
Final Thoughts
This experiment demonstrates that deep reinforcement learning, when enhanced with proper feature engineering, stationarity treatment, and intelligent reward design, can significantly outperform traditional trading strategies—even achieving an 850% return in backtesting.
Core advancements include:
- LSTM-powered policy networks.
- Stationary input preprocessing.
- 38+ engineered features + SARIMAX forecasts.
- Risk-aware rewards via Sortino, Calmar, and Omega ratios.
- Optuna-driven hyperparameter optimization.
While not yet production-ready, this system represents a major leap forward in algorithmic trading. As AI continues to evolve, so too will its impact on finance.
👉 Stay ahead of the curve—explore cutting-edge trading platforms now.
Note: All methods discussed are for educational purposes only. Cryptocurrency trading involves substantial risk. Manage your funds responsibly.