Overview
FLAG-Trader: LLM + RL Fusion
StockBench: Real-Market Benchmark
Multi-Agent Investment Teams
Code Example
Architecture
Key Results
See Also
References

Financial Trading Agents

LLM-powered agents for financial trading fuse language understanding with reinforcement learning for sequential decision-making, while benchmarks reveal that most LLM agents struggle to outperform simple buy-and-hold strategies in real-market conditions.

Overview

Financial trading demands reasoning over multimodal data (price time series, fundamentals, news), sequential decision-making under uncertainty, and risk management. Three research threads address this: FLAG-Trader fuses LLMs with gradient-based RL for policy optimization, StockBench provides a contamination-free benchmark for realistic multi-month trading evaluation, and multi-agent investment teams deploy collaborative agent architectures for portfolio management.¹⁾²⁾³⁾

FLAG-Trader: LLM + RL Fusion

FLAG-Trader uses a partially fine-tuned LLM as the policy network within a reinforcement learning framework:

Architecture: A parameter-efficient fine-tuning (PEFT) module encodes market data into textual state representations fed to the LLM policy network. Only a subset of LLM parameters is updated to balance domain adaptation with preservation of pre-trained knowledge.

State Representation: Temporal market data and textual streams (news, reports) are jointly processed into unified inputs:

<latex>s_t = \text{Encode}(x_t^{price}, x_t^{fund}, x_t^{text})</latex>

Policy Optimization: The LLM serves as policy $\pi_\theta(a|s)$ and is trained via policy gradient:

<latex>\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta), \quad J(\theta) = \mathbb{E}\left[\sum_{t=0}^{T} \gamma^t r_t\right]</latex>

where $r_t$ captures trading rewards (returns, risk-adjusted metrics) and $\gamma$ is the discount factor.

Key Result: A 135M-parameter open-source model with RL fine-tuning surpasses larger proprietary models (e.g., GPT-o1-preview) in cumulative return and Sharpe ratio.

StockBench: Real-Market Benchmark

StockBench evaluates LLM agents in realistic, multi-month stock trading environments:

Contamination-free: Uses recent market data to prevent data leakage
Daily decision cycle: Agents receive prices, fundamentals, and news daily, making buy/sell/hold decisions
Financial metrics: Cumulative return, maximum drawdown, Sortino ratio

<latex>\text{Sortino} = \frac{R_p - R_f}{\sigma_d}</latex>

where $R_p$ is portfolio return, $R_f$ is risk-free rate, and $\sigma_d$ is downside deviation.

Key Finding: Most LLMs struggle to outperform buy-and-hold, revealing that strong static QA performance does not translate to effective trading behavior. Only select models (DeepSeek-V3 with lowest return variance, some GPT variants) show potential for higher risk-adjusted returns.

Multi-Agent Investment Teams

Collaborative multi-agent architectures deploy specialized roles for portfolio management:

Analyst Agent: Processes fundamentals and generates research reports
Strategist Agent: Formulates trading strategies based on market regime
Risk Manager Agent: Monitors portfolio exposure and enforces risk limits
Executor Agent: Optimizes trade execution timing and sizing

<latex>w^* = \arg\max_w \frac{\mathbb{E}[R_w] - R_f}{\sqrt{\text{Var}(R_w)}} \quad \text{s.t.} \quad \sum_i w_i = 1, \; w_i \geq 0</latex>

Code Example

from dataclasses import dataclass
import numpy as np
 
@dataclass
class MarketState:
    prices: np.ndarray
    fundamentals: dict
    news: list[str]
    timestamp: str
 
class FLAGTrader:
    def __init__(self, llm_policy, risk_threshold: float = 0.05):
        self.policy = llm_policy
        self.risk_threshold = risk_threshold
        self.portfolio = {"cash": 100000.0, "holdings": {}}
 
    def encode_state(self, state: MarketState) -> str:
        price_summary = f"Prices: {state.prices[-5:].tolist()}"
        news_summary = " | ".join(state.news[:3])
        return (
            f"{price_summary}\n"
            f"Fundamentals: {state.fundamentals}\n"
            f"News: {news_summary}"
        )
 
    def decide(self, state: MarketState) -> dict:
        encoded = self.encode_state(state)
        action = self.policy.generate(
            f"Market state:\n{encoded}\n"
            f"Portfolio: {self.portfolio}\n"
            f"Decision (buy/sell/hold with sizing):"
        )
        return self.parse_action(action)
 
    def calculate_sortino(self, returns: np.ndarray,
                          risk_free: float = 0.02) -> float:
        excess = returns - risk_free / 252
        downside = np.sqrt(np.mean(np.minimum(excess, 0) ** 2))
        return np.mean(excess) / max(downside, 1e-8) * np.sqrt(252)
 
    def backtest(self, market_data: list[MarketState]) -> dict:
        daily_returns = []
        for state in market_data:
            action = self.decide(state)
            pnl = self.execute(action, state)
            daily_returns.append(pnl)
        returns = np.array(daily_returns)
        return {
            "cumulative_return": np.prod(1 + returns) - 1,
            "max_drawdown": self.calc_max_drawdown(returns),
            "sortino_ratio": self.calculate_sortino(returns)
        }

Architecture

graph TD A[Daily Market Data] --> B[Price Encoder] A --> C[Fundamentals Encoder] A --> D[News/Text Encoder] B --> E[Unified State Representation] C --> E D --> E E --> F[LLM Policy Network - PEFT] F --> G[Action: Buy/Sell/Hold + Size] G --> H[Risk Manager] H --> I{Within Risk Limits?} I -->|Yes| J[Execute Trade] I -->|No| K[Adjust Position Size] K --> J J --> L[Portfolio Update] L --> M[Reward Calculation] M --> N[RL Policy Gradient Update] N --> F subgraph Evaluation - StockBench O[Multi-Month Simulation] --> P[Cumulative Return] O --> Q[Max Drawdown] O --> R[Sortino Ratio] end

Key Results

System	Metric	Finding
FLAG-Trader (135M)	Sharpe ratio	Outperforms GPT-o1-preview
FLAG-Trader	Cumulative return	Best across trading scenarios
StockBench	Buy-and-hold comparison	Most LLMs underperform
StockBench	Best performer	DeepSeek-V3 (lowest variance)
Multi-agent teams	Portfolio management	Role specialization improves risk-adjusted returns

References

¹⁾

Xiong et al. "FLAG-Trader: Fusion LLM Agent with Gradient-based RL for Financial Trading" (2025)

²⁾

Chen et al. "StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?" (2025)

³⁾

"Multi-Agent Investment Teams for Portfolio Management" (2026)

Table of Contents