Skip to main content

Predictor Discovery

Overview

Predictor discovery is a foundational capability in Merlin that helps identify which features have strong statistical relationships with strategy performance. Using statistical methods including Cramér's V and the MIFS (Mutual Information based Feature Selection) model, Merlin quantifies the association between potential predictors and trading outcomes, helping identify the features with the most predictive power.

This process helps to pre-select features for trading models to reduce the feature space from potentially hundreds of variables to a manageable set of predictive features.

How Predictor Discovery Works

The predictor discovery workflow in Merlin follows these steps:

  1. Feature Extraction: Calculate features from option chain data via MesoSim
  2. Target Calculation: Determine trading outcomes (profit/loss, win/loss)
  3. Feature Binning: Group continuous features into categories for analysis
  4. Statistical measurement: Measure the statistical association between each feature and the target with Cramer's V or MIFS
  5. Feature Ranking: Sort features by their predictive strength

Cramér's V Statistical Test

Cramér's V is a statistical measure of association between two categorical variables, derived from the chi-squared statistic. It ranges from 0 (no association) to 1 (perfect association).

For trading strategy development, Cramér's V helps determine which features most strongly relate to strategy performance by:

  1. Converting continuous features into categorical bins
  2. Cross-tabulating each feature's bins against the target's bins or tails
  3. Calculating the chi-squared statistic from this contingency table
  4. Normalizing the result to produce the Cramér's V coefficient

A higher Cramér's V value indicates a stronger relationship between the feature and trading outcomes.

MIFS: Advanced Feature Selection

The MIFS (Mutual Information based Feature Selection) model provides an advanced alternative to Cramér's V for feature selection. MIFS offers several advantages:

  • Information Theory Based: Uses mutual information to measure predictive relationships, capturing both linear and nonlinear dependencies
  • Redundancy Minimization: Systematically selects features that add unique predictive information rather than just individual strength
  • Statistical Validation: Includes Monte Carlo Permutation Testing to validate statistical significance of each selected feature
  • Iterative Selection: Builds feature sets incrementally, ensuring each addition provides meaningful predictive value

MIFS also returns the respective Cramér's V values for each selected feature, allowing you to see both the individual and combined predictive power.

When to use each method

Use Cramér's V when:

  • You need quick feature screening
  • Looking for individual feature strength assessment

Use MIFS when:

  • Want statistical validation of feature selection
  • Working with complex feature sets where interactions matter
  • Preparing features for SETS model optimization

Command Line Usage

merlin discover-predictors <strategy-json> [--data-coll-backtest-id=MESOSIM_BACKTEST_ID]
[--add-ivs] [--add-greeks] [--add-prices] [--add-derived-metrics]
[--add-user-vars] [--add-feature-transforms] [--add-csv=CSV_FILE_PATH]
[--feature-selection=METHOD]
[--feature-grouping=INDICATOR_GROUPING] [--bins-or-tails=BINS_OR_TAILS]
[--target-grouping=TARGET_GROUPING] [--target-bins=TARGET_BINS]
[--start=START_DATE] [--end=END_DATE]
[--result-dir=RESULT_DIR]

Example

This command discovers predictors for the Boxcar strategy, including implied volatility and Greek metrics, using the top and bottom 10% of values for feature binning.

merlin discover-predictors configs/strategies/boxcar-strategy.json --add-ivs --add-greeks --feature-selection=cramers_v --feature-grouping=tails --bins-or-tails=0.1

Configuration Parameters

Feature Selection

ParameterDescription
--add-ivsInclude implied volatility features
--add-greeksInclude option Greeks features
--add-pricesInclude price-based features
--add-derived-metricsInclude derived and ratio metrics
--add-user-varsInclude user variables from the strategy definition file
--add-feature-transformsApply transformations to features
--add-csv=CSV_FILE_PATHInclude features from a CSV file
--feature-selection=METHODFeature selection method: "cramers_v" or "mifs"

Feature Selection Configuration

ParameterDescription
--feature-groupingHow to group features: "bins" (equal-sized bins) or "tails" (extremes)
--bins-or-tailsNumber of bins or percentage of tails (0-1)
--target-groupingHow to group the target: "bins" (quantiles) or "sign" (positive/negative)
--target-binsNumber of target bins when using "bins" grouping

Date Range

ParameterDescription
--startStart date for analysis (YYYY-MM-DD)
--endEnd date for analysis (YYYY-MM-DD or "now")

Output

ParameterDescription
--result-dirDirectory to save analysis results

Feature Grouping Methods

Merlin supports two primary methods for grouping continuous features into categories:

Bins Grouping

Divides the feature's range into equal-sized bins:

  • Creates N equally populated bins across the full range
  • Higher number of bins improving accuracy, but may lead to too many ties in each groups
  • Example: 5 bins would create quintiles (20% of data in each bin)
merlin discover-predictors <strategy-json> --feature-grouping=bins --bins-or-tails=5

Tails Grouping

Focuses on the extreme values at both ends of the distribution:

  • Creates 3 groups: bottom X%, middle, top X%
  • The middle range is ignored. Power is in the tails!
  • Example: 0.1 tails would use bottom 10%, middle 80%, top 10%
merlin discover-predictors <strategy-json> --feature-grouping=tails --bins-or-tails=0.1

Target Grouping Methods

Similarly, the target variable (the position's realized PnL) can be grouped in two ways:

Bins Grouping

Divides the target into equal-sized quantiles:

  • Creates N equally populated bins of performance outcomes
  • Good for strategies with varied performance distributions
  • Example: 3 bins would create bottom a split with winners, negligible winners and losers and losers
merlin discover-predictors <strategy-json> --target-grouping=bins --target-bins=3

Sign Grouping

Divides the target based on positive/negative outcomes:

  • Creates 2 groups: losing trades and winning trades
  • Good for simple win/loss analysis
  • Simplifies analysis to focus on directional correctness
merlin discover-predictors <strategy-json> --target-grouping=sign

Suggested configuration

When using Cramer's V, start with Tail Grouping with 10% on each side for the Features and 3 bins for the Target:

merlin discover-predictors <strategy-json> --feature-selection=cramers_v --feature-grouping=tails --bins-or-tails=0.1 --target-grouping=bins --target-bins=3

When MIFS is utilized, then the Feature and Target Grouping should be set to Bins with 2 and 3 bins respectively:

merlin discover-predictors <strategy-json> --feature-selection=mifs --feature-grouping=bins --bins-or-tails=2 --target-grouping=bins --target-bins=3

Interpreting Results

Cramér's V

Merlin outputs a ranked list of features based on their Cramér's V scores. For example:

   $ merlin discover-predictors --feature-selection=cramers_v  --add-ivs --add-greeks --add-derived-metrics merlin-configs/strategies/boxcar-strategy.json

CramersV pValue
entry_leg_pcs_short_vega 0.628 0.0001
entry_leg_pds_long_wvega 0.617 0.0001
entry_leg_pcs_short_wvega 0.612 0.0001
entry_underlying_price 0.612 0.0001
entry_leg_pcs_long_wvega 0.591 0.0002
entry_underlying_hv 0.580 0.0003
entry_leg_pcs_long_vega 0.541 0.0009
entry_leg_pds_long_vega 0.528 0.0012
entry_pcs_short_by_pds_long_iv_ratio 0.516 0.0017
...

Interpretation:

Cramér's VAssociation Strength
0.00 - 0.2Weak
0.20 - 0.40Moderate
0.40 - 1.00Strong

Suggestion: Start with 0.3 or higher to filter potential predictors

MIFS

When MIFS is used as feature selection algorithm, then a ranked list of features returned based on their Mutual Information Score (Score). The For example:

   $ merlin discover-predictors --feature-selection=mifs  --add-ivs --add-greeks --add-derived-metrics merlin-configs/strategies/boxcar-strategy.json

Score MCpValue CramersV
Feature
entry_pos_wvega 0.8607 0.005 0.5275
entry_leg_pcs_long_iv 0.9154 0.005 0.4213

The Score is the Mutual Information Score, which indicates the strength of the relationship between the feature and the target variable. The MCpValue is the Monte Carlo Permutation test p-value, which indicates the statistical significance of the feature's selection. A lower value (e.g., < 0.1) suggests that the feature is statistically significant. The Cramér's V value is also provided for reference, indicating the strength of the relationship between the feature and the target variable.

Please refer to the MIFS model documentation for more details on how MIFS works and how to interpret its results.

Using during Strategy and Portfolio Optimization

Cramer's V and MIFS parameters can be set in the Features section of the Model Config JSON file for Strategy Optimization. Please refer to the Strategy Optimization documentation for more details.

{
"Features": {
"AddIVs": true,
"AddGreeks": false,
"AddPrices": false,
"AddDerivedMetrics": true,
"AddUserVars": false,
"AddFeatureTransforms": false,
"AddCSV": null,
"CramersV": {
"Enabled": true,
"Threshold": 0.3,
"FeatureGrouping": "tails",
"FeatureBinsOrTails": 0.1,
"TargetGrouping": "bins",
"TargetBins": 3
},
"MIFS": {
"Enabled": false,
"MCpThreshold": 0.1,
"CramersVThreshold": null,
"MIScoreThreshold": null,
"FeatureGrouping": "bins",
"FeatureBinsOrTails": 2,
"TargetGrouping": "bins",
"TargetBins": 3
}
}
}