Predictor Discovery
Overview
Predictor discovery is a foundational capability in Merlin that helps identify which features have strong statistical relationships with strategy performance. Using statistical methods including Cramér's V and the MIFS (Mutual Information based Feature Selection) model, Merlin quantifies the association between potential predictors and trading outcomes, helping identify the features with the most predictive power.
This process helps to pre-select features for trading models to reduce the feature space from potentially hundreds of variables to a manageable set of predictive features.
How Predictor Discovery Works
The predictor discovery workflow in Merlin follows these steps:
- Feature Extraction: Calculate features from option chain data via MesoSim
- Target Calculation: Determine trading outcomes (profit/loss, win/loss)
- Feature Binning: Group continuous features into categories for analysis
- Statistical measurement: Measure the statistical association between each feature and the target with Cramer's V or MIFS
- Feature Ranking: Sort features by their predictive strength
Cramér's V Statistical Test
Cramér's V is a statistical measure of association between two categorical variables, derived from the chi-squared statistic. It ranges from 0 (no association) to 1 (perfect association).
For trading strategy development, Cramér's V helps determine which features most strongly relate to strategy performance by:
- Converting continuous features into categorical bins
- Cross-tabulating each feature's bins against the target's bins or tails
- Calculating the chi-squared statistic from this contingency table
- Normalizing the result to produce the Cramér's V coefficient
A higher Cramér's V value indicates a stronger relationship between the feature and trading outcomes.
MIFS: Advanced Feature Selection
The MIFS (Mutual Information based Feature Selection) model provides an advanced alternative to Cramér's V for feature selection. MIFS offers several advantages:
- Information Theory Based: Uses mutual information to measure predictive relationships, capturing both linear and nonlinear dependencies
- Redundancy Minimization: Systematically selects features that add unique predictive information rather than just individual strength
- Statistical Validation: Includes Monte Carlo Permutation Testing to validate statistical significance of each selected feature
- Iterative Selection: Builds feature sets incrementally, ensuring each addition provides meaningful predictive value
MIFS also returns the respective Cramér's V values for each selected feature, allowing you to see both the individual and combined predictive power.
Use Cramér's V when:
- You need quick feature screening
- Looking for individual feature strength assessment
Use MIFS when:
- Want statistical validation of feature selection
- Working with complex feature sets where interactions matter
- Preparing features for SETS model optimization
Command Line Usage
merlin discover-predictors <strategy-json> [--data-coll-backtest-id=MESOSIM_BACKTEST_ID]
[--add-ivs] [--add-greeks] [--add-prices] [--add-derived-metrics]
[--add-user-vars] [--add-feature-transforms] [--add-csv=CSV_FILE_PATH]
[--feature-selection=METHOD]
[--feature-grouping=INDICATOR_GROUPING] [--bins-or-tails=BINS_OR_TAILS]
[--target-grouping=TARGET_GROUPING] [--target-bins=TARGET_BINS]
[--start=START_DATE] [--end=END_DATE]
[--result-dir=RESULT_DIR]
Example
This command discovers predictors for the Boxcar strategy, including implied volatility and Greek metrics, using the top and bottom 10% of values for feature binning.
merlin discover-predictors configs/strategies/boxcar-strategy.json --add-ivs --add-greeks --feature-selection=cramers_v --feature-grouping=tails --bins-or-tails=0.1
Configuration Parameters
Feature Selection
| Parameter | Description |
|---|---|
| --add-ivs | Include implied volatility features |
| --add-greeks | Include option Greeks features |
| --add-prices | Include price-based features |
| --add-derived-metrics | Include derived and ratio metrics |
| --add-user-vars | Include user variables from the strategy definition file |
| --add-feature-transforms | Apply transformations to features |
| --add-csv=CSV_FILE_PATH | Include features from a CSV file |
| --feature-selection=METHOD | Feature selection method: "cramers_v" or "mifs" |
Feature Selection Configuration
| Parameter | Description |
|---|---|
| --feature-grouping | How to group features: "bins" (equal-sized bins) or "tails" (extremes) |
| --bins-or-tails | Number of bins or percentage of tails (0-1) |
| --target-grouping | How to group the target: "bins" (quantiles) or "sign" (positive/negative) |
| --target-bins | Number of target bins when using "bins" grouping |
Date Range
| Parameter | Description |
|---|---|
| --start | Start date for analysis (YYYY-MM-DD) |
| --end | End date for analysis (YYYY-MM-DD or "now") |
Output
| Parameter | Description |
|---|---|
| --result-dir | Directory to save analysis results |
Feature Grouping Methods
Merlin supports two primary methods for grouping continuous features into categories:
Bins Grouping
Divides the feature's range into equal-sized bins:
- Creates N equally populated bins across the full range
- Higher number of bins improving accuracy, but may lead to too many ties in each groups
- Example: 5 bins would create quintiles (20% of data in each bin)
merlin discover-predictors <strategy-json> --feature-grouping=bins --bins-or-tails=5
Tails Grouping
Focuses on the extreme values at both ends of the distribution:
- Creates 3 groups: bottom X%, middle, top X%
- The middle range is ignored. Power is in the tails!
- Example: 0.1 tails would use bottom 10%, middle 80%, top 10%
merlin discover-predictors <strategy-json> --feature-grouping=tails --bins-or-tails=0.1
Target Grouping Methods
Similarly, the target variable (the position's realized PnL) can be grouped in two ways:
Bins Grouping
Divides the target into equal-sized quantiles:
- Creates N equally populated bins of performance outcomes
- Good for strategies with varied performance distributions
- Example: 3 bins would create bottom a split with winners, negligible winners and losers and losers
merlin discover-predictors <strategy-json> --target-grouping=bins --target-bins=3
Sign Grouping
Divides the target based on positive/negative outcomes:
- Creates 2 groups: losing trades and winning trades
- Good for simple win/loss analysis
- Simplifies analysis to focus on directional correctness
merlin discover-predictors <strategy-json> --target-grouping=sign
Suggested configuration
When using Cramer's V, start with Tail Grouping with 10% on each side for the Features and 3 bins for the Target:
merlin discover-predictors <strategy-json> --feature-selection=cramers_v --feature-grouping=tails --bins-or-tails=0.1 --target-grouping=bins --target-bins=3
When MIFS is utilized, then the Feature and Target Grouping should be set to Bins with 2 and 3 bins respectively:
merlin discover-predictors <strategy-json> --feature-selection=mifs --feature-grouping=bins --bins-or-tails=2 --target-grouping=bins --target-bins=3
Interpreting Results
Cramér's V
Merlin outputs a ranked list of features based on their Cramér's V scores. For example:
$ merlin discover-predictors --feature-selection=cramers_v --add-ivs --add-greeks --add-derived-metrics merlin-configs/strategies/boxcar-strategy.json
CramersV pValue
entry_leg_pcs_short_vega 0.628 0.0001
entry_leg_pds_long_wvega 0.617 0.0001
entry_leg_pcs_short_wvega 0.612 0.0001
entry_underlying_price 0.612 0.0001
entry_leg_pcs_long_wvega 0.591 0.0002
entry_underlying_hv 0.580 0.0003
entry_leg_pcs_long_vega 0.541 0.0009
entry_leg_pds_long_vega 0.528 0.0012
entry_pcs_short_by_pds_long_iv_ratio 0.516 0.0017
...
Interpretation:
| Cramér's V | Association Strength |
|---|---|
| 0.00 - 0.2 | Weak |
| 0.20 - 0.40 | Moderate |
| 0.40 - 1.00 | Strong |
Suggestion: Start with 0.3 or higher to filter potential predictors
MIFS
When MIFS is used as feature selection algorithm, then a ranked list of features returned based on their Mutual Information Score (Score). The For example:
$ merlin discover-predictors --feature-selection=mifs --add-ivs --add-greeks --add-derived-metrics merlin-configs/strategies/boxcar-strategy.json
Score MCpValue CramersV
Feature
entry_pos_wvega 0.8607 0.005 0.5275
entry_leg_pcs_long_iv 0.9154 0.005 0.4213
The Score is the Mutual Information Score, which indicates the strength of the relationship between the feature and the target variable. The MCpValue is the Monte Carlo Permutation test p-value, which indicates the statistical significance of the feature's selection. A lower value (e.g., < 0.1) suggests that the feature is statistically significant. The Cramér's V value is also provided for reference, indicating the strength of the relationship between the feature and the target variable.
Please refer to the MIFS model documentation for more details on how MIFS works and how to interpret its results.
Using during Strategy and Portfolio Optimization
Cramer's V and MIFS parameters can be set in the Features section of the Model Config JSON file for Strategy Optimization. Please refer to the Strategy Optimization documentation for more details.
{
"Features": {
"AddIVs": true,
"AddGreeks": false,
"AddPrices": false,
"AddDerivedMetrics": true,
"AddUserVars": false,
"AddFeatureTransforms": false,
"AddCSV": null,
"CramersV": {
"Enabled": true,
"Threshold": 0.3,
"FeatureGrouping": "tails",
"FeatureBinsOrTails": 0.1,
"TargetGrouping": "bins",
"TargetBins": 3
},
"MIFS": {
"Enabled": false,
"MCpThreshold": 0.1,
"CramersVThreshold": null,
"MIScoreThreshold": null,
"FeatureGrouping": "bins",
"FeatureBinsOrTails": 2,
"TargetGrouping": "bins",
"TargetBins": 3
}
}
}