Trading Made Botter: Essential Metrics for Evaluating a Backtest: Understanding and Analyzing Your Strategy

Wondering how to analyze your trading strategies without losing your mind? Perfect, because today, we're diving into the fascinating (and sometimes frustrating) world of backtesting metrics. No confusing jargon or endless math formulas here—promise! It'll be clear, practical, and, where possible, intuitive.

Before we dive in, if you're a beginner, I highly recommend checking out my articles on the basics of backtesting (so you don’t feel like you’ve walked into the middle of a movie) and on step-by-step backtesting, where I detail the process to obtain the numbers we'll be analyzing today. Because yes, before you judge a strategy, you need to understand how to get the results.

Now that the stage is set, picture this: you’ve followed every step of backtesting, carefully configured your parameters, launched your simulation, and here’s what you get:

------ Statistics ------

Number of trades: 19
Annualized Arithmetic Mean: 3.68%
Annualized Geometric Mean: 3.49%

Win Rate: 53.15%
Sharpe Ratio: -0.04
Sortino Ratio: -0.06

Maximum Drawdown: -8.63%
Maximum Drawdown Duration: 284 days
Value at Risk (1%): -1.05%
Conditional Value at Risk (1%): -1.50%

------------------------

PAUSE ! Do you invest in this strategy ? 🤔

With these results in hand, a few key questions come to my mind:

Will I actually make money with this strategy?
Will I have to sell my house to cover the losses?
Would my grandmother—who thinks Bitcoin sounds like a sci-fi movie—invest in this?

That’s exactly what we’re going to explore together. The goal is to see how to interpret these metrics so you can make an informed decision: tweak, improve, or abandon the strategy. Get ready to take a critical look and draw insights that can truly make a difference in your backtests.

Ready? Let’s get serious !

The Basics of Performance Metrics: Contextualizing Results

The first metrics I look at provide the context for my strategy—essentially the foundational information. For this, I focus on expectancy, geometric mean, and the number of trades.

Number of Trades

The goal here is to answer the question: “Is my strategy actually trading?” Yes, I know, it might sound obvious, but imagine this scenario: you’ve been working on your ML-powered bot project for a week, you’ve run an initial backtest with 80,000 data points split 50/50, and the results are just mind-blowing. You’re over the moon.

Then you think, “Alright, let’s test this on a shorter timeframe with a walk-forward approach.” You run a backtest with an 80/20 split on 200 data points, and… disaster. The results are terrible. You throw the strategy away, convinced it’s a lost cause.😔

Two weeks later, you revisit some ideas and start thinking, “Wait, wasn’t that a bit weird?” So you decide to take another look at all the metrics, and by pure chance, you stumble across the number of trades taken during your test periods. And that’s when it hits you. You realize your model simply didn’t have enough data to learn properly, and it took the easy way out by barely making any trades.😦

Yeah, it’s a bit ridiculous, but honestly, all of this could’ve been avoided if I’d just checked the average number of trades between the training and testing phases. The moral of the story? A good backtest isn’t just about the raw results. Make sure you check the number of trades to confirm that your strategy actually had the opportunity to perform.

Expectation

Expectation —or expected value— is the average value of a random variable, weighted by its probability. In simpler terms, it answers the question: if I repeat this experiment over and over, what will the average result look like for each attempt?

Practically speaking, expectation is calculated for things like winning or losing trades, returns, and many other datasets. For example, if I calculate the expectation for 5-minute returns, I’m asking myself: if I repeat another 5-minute candlestick, what will the average return of my strategy be?

If the expectation is positive, it means there’s a higher probability of making money. Conversely, if the expectation is negative, it’s a signal that you’re more likely to lose. But here’s the catch: before calculating expectation, you must deduct all fees (commissions, spreads, slippage, etc.)—otherwise, the calculation will be biased.

Mathematically, the formula is straightforward:

$\text{Expectation} = \frac{\sum_{i=1}^{n} X_i}{n}$

You sum up all the values and then divide by the total number of values. That’s essentially the average. While this metric is fundamental, it has two notable limitations:

All values are treated equally. That feels a bit counterintuitive, doesn’t it? The most recent values (which are often more relevant to the current situation) should carry more weight than older ones, but the formula doesn’t make any distinction.
Outliers have disproportionate influence. Because of the division by the total number of values, extreme results (very high or very low) tend to skew the average, sometimes distorting it significantly.

One final note that must not be overlooked: as R. Vince explains in his book The Mathematics of Money Management:

"The difference between a negative expectation and a positive one is the difference between life and death."

In other words, if you truly want to make money, your expectation must be positive. Otherwise, it’s no different than playing against the house at a casino.😌

Geometric Mean

The geometric mean addresses a key limitation of the arithmetic mean: it accounts for the fact that returns are not simply added together but are multiplied across periods. In other words, it better represents the impact of compounded returns (i.e., your capital generates returns on returns).

To calculate the geometric mean, you take the n-th root of the product of all values, where n is the total number of periods. The formula is as follows:

$\text{Geometric Mean} = \left( \prod_{i=1}^{n} (1 + X_i) \right)^{\frac{1}{n}} - 1$

This method "smooths" the results, ensuring that extreme values (such as large losses or gains) do not overly influence the average. Put simply, it linearizes the effect of returns over multiple periods.

One of the greatest advantages of this method is that it accounts for significant drawdown periods. In other words, it reminds us that recovering from losses requires larger gains. By using the geometric mean, you answer the question:
"If I repeat this same 5-minute candlestick, what will my average percentage gain be over the long term?"

That said, there is one small caveat: the geometric mean cannot be directly applied to returns of -100% or worse (if you lose all your capital, you cannot compute the geometric mean as usual). To work around this, we add 1 to each return (e.g., a 50% return becomes 1.50) before calculating.

Example: The Impact of the Geometric Mean

Let’s take a simple example to understand the difference between the two means:

You start with €100.
You gain 50% (you now have €150).
You lose 33% (you’re back to €100).

The arithmetic mean suggests a total return of 8.3%, but the geometric mean shows a total return of 0%! Your portfolio is back to its starting point, despite a +50% gain followed by a -33% loss. This demonstrates that the geometric mean more accurately reflects the real return of a strategy, especially in scenarios with successive gains and losses.

The arithmetic mean answers the question:
"What is the average return per period if each period were independent?"
The geometric mean answers the question:
"What is the total real return after multiple periods, accounting for compounding?"

Over the long term, especially for strategies where compounding is involved, the geometric mean is a better metric for evaluating the real performance of your portfolio.

Performance Metrics: Evaluating Gains, Not Just Results

Once we've ensured that the strategy works, with a positive expectation and geometric mean, we can move on to metrics that analyze gains and, more specifically, how those results are achieved.

Win Rate

Before diving into a deeper analysis of my strategy, I first want to understand how often I win versus how often I lose. This is where the famous win rate comes into play.

The win rate (WR) is simply the percentage of winning trades compared to the total number of trades executed. If I have a low win rate but remain profitable, it means that I rarely win big, but my losses are small and frequent. Conversely, if my WR is high, I win often, but my gains are likely small, and my losses are rare but potentially larger.

Here’s the formula for WR:

\text{Win Rate} = \frac{\text{Number of Winning Trades}}{\text{Total Number of Trades}} \times 100

An interesting aspect of the WR is its direct influence on risk management. For easier risk management, a low WR is often preferred because small losses are much easier to recover from than a sudden large loss, which could wipe out a significant portion of the capital. Statistically, the magnitude and frequency of small losses are better understood because they occur more frequently and are thus well-documented in our trading history. On the other hand, rare and significant losses introduce greater uncertainty—we don't know when they will occur or their exact impact.

This means that even though a low WR may seem discouraging at first, it can actually be more predictable and easier to manage in terms of risk, especially if each loss is relatively small. On the other hand, a high WR with large losses can make a strategy much riskier, as periods of significant losses may be unpredictable and harder for the capital to absorb.😊

Sharpe Ratio

The Sharpe Ratio is an incredibly powerful metric for evaluating the return of an investment relative to the risk it carries. It is calculated as follows:

\text{Sharpe Ratio} = \frac{\mathbb{E}[R_a - R_f]}{\sigma_a}

In this formula, the numerator $\mathbb{E}[R_a - R_f]$ represents the excess return — the difference between the expected return of the asset ($R_a$) and the risk-free rate ($R_f$), such as government bond yields. This excess return is essentially the premium you earn for taking on additional risk compared to a risk-free investment. The denominator $\sigma_a$ measures the volatility or risk of the asset. In simple terms, the Sharpe Ratio evaluates how much excess return (in basis points) you get for each unit of risk taken.

For instance, a Sharpe Ratio of 1 means that for every unit of risk, you are earning one unit of excess return. That’s a good sign, but the higher the ratio, the better.

One of the greatest advantages of the Sharpe Ratio is that it allows you to compare investments with different risk profiles. It helps answer a critical question: Is a high return the result of excessive risk-taking, or is it genuinely sustainable in the long run? A high Sharpe Ratio indicates that an asset generates consistent excess returns, even during periods of market volatility. This makes it a valuable tool for portfolio allocation decisions and justifying investment choices.

In essence, this metric helps you not only assess the quality of an investment but also determine whether the risk taken is justified by the returns achieved.

I could easily write an entire article on the Sharpe Ratio, but to save you time, I recommend this detailed video: The Sharpe Ratio Explained. It covers everything you need to know about mastering this metric, all clearly explained!

Sortino Ratio

The Sortino Ratio is a variation of the Sharpe Ratio that specifically focuses on downside volatility. It is calculated as follows:

\text{Sortino Ratio} = \frac{\mathbb{E}[R_a - R_f]}{\sigma_{a,down}

In this formula, the numerator is the excess return, just like the Sharpe Ratio. However, instead of considering the total volatility of the asset ($\sigma_a$), the denominator of the Sortino Ratio only takes into account the downside volatility ($\sigma_{a,down}$). This means it only measures the impact of losses and ignores positive fluctuations. Why? Because, for an investor, the problem is primarily the negative volatility (the losses), rather than the gains.

Let’s imagine a strategy with strong gains: volatility will increase, and if we only used the Sharpe Ratio, we might think the strategy is becoming less efficient. However, the situation is actually favorable for the investor. The Sortino Ratio corrects this bias by considering only negative returns, thus providing a better picture of the real risk the investor is exposed to.👌

That being said, it’s important to note that in classic statistical distributions (like the normal distribution), high positive volatility is often correlated with high negative volatility. In other words, even though the Sortino Ratio gives a more accurate view of risk, it should not be used in isolation, as there may still be collateral effects tied to positive fluctuations.

Risk Measures: Do Not Underestimate the Danger

Here, we ask a crucial question: how risky is my strategy? These metrics are the starting point for considering the risk management of a strategy, but they do not provide a complete analysis of it. If the strategy seems promising, this is when you’ll need to really dive into the concepts of risk management.

Maximum Drawdown

The Maximum Drawdown (MDD) is simply the maximum loss that a strategy can experience relative to its previous peak. This loss is often expressed as a percentage to make it easier to interpret. MDD is a crucial metric because it represents the worst-case scenario for an investor, and it forms the foundation of concepts like modern money management (according to R. Vince).

A practical tip: to maximize geometric return, you can calculate the optimal portion of your stake to reinvest using the following formula:

$f = \frac{\text{Expectation}}{\text{MDD}}$

This approach is inspired by the Kelly criterion, and I plan to write a full article on money management in the near future. In the meantime, for those who can't wait, I highly recommend checking out the book "The Mathematics of Money Management: Risk Analysis Techniques for Traders" by R. Vince. I'm a huge fan of his work!🤩

Maximum Duration in Drawdown

Another key element to understand is the maximum duration in drawdown. Essentially, this allows you to mentally prepare for periods where you won’t be in profit. It helps you avoid overreacting, overtrading, or panicking during these phases. The goal is, of course, to minimize these periods so that the strategy remains as operational as possible.

It's important to note that the duration of a drawdown is not necessarily a negative aspect in itself. For example, it is entirely possible to combine a strategy with long drawdowns alongside other strategies to reduce this metric overall. So, it's not an end in itself, but rather a factor to consider when constructing a diversified portfolio.😉

Moreover, there are money management techniques that can be applied to strategies with prolonged drawdowns. For instance, by running tests or performing correlation analyses, you can determine when to increase or decrease your position size. You can also trade directly on the equity curve (the curve of your capital), always relying on statistical metrics or calculating moving averages to generate signals.

VaR / CVAR

The VaR (Value at Risk) and CVaR (Conditional Value at Risk) metrics are two powerful tools I use for deeper analysis. These metrics are part of the toolkit I reserve for evaluating extreme risk, complementing the other metrics I have already mentioned. I won’t go into the mathematical details or the deep intuition behind these metrics here, as they will be covered in dedicated articles.

The VaR answers the following question: "Given a certain probability, what is the maximum loss I can expect?" In other words, it is a way of quantifying the maximum risk of a strategy at a specified confidence level. For example, a 95% VaR tells us the maximum expected loss with a 95% probability.

To go even further, the CVaR is often used, which focuses on extreme cases. In fact, CVaR calculates the expectation of the losses that occur beyond the VaR, giving us a more precise idea of the average losses in the worst-case scenarios.

These two metrics allow, based on a user-defined confidence level, to answer two crucial questions:

What is the maximum expected loss for a given probability?
What is the average expected loss in the worst cases (beyond the VaR)?

Here are the formulas underlying these two concepts:

\text{VaR}_{\alpha}(a) = -\inf \{ x \in \mathbb{R} \mid P(L_a \leq x) \geq \alpha\}

CVaRα(a)=−E[La∣La≤VaRα(a)]

In these formulas, $L_a$ represents the losses associated with the asset or strategy in question, typically measured as the percentage change in price or strategy performance.

As you can see, in a standard Gaussian distribution (mean 0, standard deviation 1), the VaR at the 5% level corresponds to an event where the loss exceeds -1.65 standard deviations, with a 95% chance that the loss will stay below this threshold. This represents an extreme, but still relatively likely, event. CVaR at 5% goes a step further, calculating the average loss in cases that exceed the VaR, typically around -2.08 standard deviations.

In a Gaussian distribution, about 95% of events fall within the -2 to +2 standard deviation range (±2 $\sigma$). However, in finance, returns often don't follow a normal distribution; they have "fatter tails" leading to more extreme events. This is why VaR and CVaR, while useful, can sometimes underestimate the real risk of rare but catastrophic events.

Conclusion

Evaluating a trading strategy is not just about looking at the gross return. It's about understanding the overall performance and considering the associated risks to determine whether this strategy can realistically fit into a broader portfolio. Analyzing the metrics helps shed light on the hidden aspects of a strategy's performance.

To summarize what we've covered, we can analyze our results:

With only 19 trades, the sample size is far too small to determine the statistical significance of any metric. This is not enough to draw reliable conclusions.
Expectation and geometric mean show a positive overall return, but the first remains low (below 4%), suggesting that the gains, while positive, are far from exceptional.
The win rate of 53% is fairly solid, but it is not enough to offset the low relative return, which is why the Sharpe ratio and Sortino ratio are negative, indicating that the risk taken is not justified by an adequate return.
On the other hand, the maximum drawdown of -8.63% remains acceptable, but the maximum drawdown duration of 284 days (nearly half the time) is concerning, indicating that the strategy can experience long periods without gains.
The VaR at 1% is reasonable (-1.05%), meaning there is only a 1% chance of losing more than 1.05% per day. If an extreme loss occurs, the CVaR shows that the average loss would be around -1.5%, which is also reasonable.

In conclusion, the strategy in question does not yet seem viable in the short term because it lacks the profitability to justify its risk. However, the metrics are close to those of a strategy that could become solid with more trades and better optimization. The risk metrics are also reasonable, which is often a good sign when the return is still low. If I were to achieve these kinds of results with more data, I would be optimistic, as it would provide a solid foundation on which to improve the strategy and potentially make it more profitable.

Ultimately, I encourage you to apply these analyses to your own strategies or those I have developed during the train-test in the article Backtest Step by Step. One crucial aspect to keep in mind is to monitor the evolution of these metrics over time, as they can fluctuate depending on market conditions. Furthermore, feel free to explore Bayesian methods to refine your predictions on the future performance of your KPIs.

Don’t hesitate to comment, share, and most importantly, code!
I wish you an excellent day and lots of success in your trading projects!
La Bise et à très vite! ✌️

References :

The Mathematics of Money Management: Risk Analysis Techniques for Traders, Ralph Vince, 1992
The Sharpe Ratio Explained (by a quant trader), Wall Street Quant, youtube 2024

Essential Metrics for Evaluating a Backtest: Understanding and Analyzing Your Strategy