Predictor Discovery

Overview

Predictor discovery is a foundational capability in Merlin that helps identify which features have strong statistical relationships with strategy performance. Using statistical methods including Cramér's V and the MIFS (Mutual Information based Feature Selection) model, Merlin quantifies the association between potential predictors and trading outcomes, helping identify the features with the most predictive power.

This process helps to pre-select features for trading models to reduce the feature space from potentially hundreds of variables to a manageable set of predictive features.

How Predictor Discovery Works

The predictor discovery workflow in Merlin follows these steps:

Feature Extraction: Calculate features from option chain data via MesoSim
Target Calculation: Determine trading outcomes (profit/loss, win/loss)
Feature Binning: Group continuous features into categories for analysis
Statistical measurement: Measure the statistical association between each feature and the target with Cramer's V or MIFS
Feature Ranking: Sort features by their predictive strength

Cramér's V Statistical Test

Cramér's V is a statistical measure of association between two categorical variables, derived from the chi-squared statistic. It ranges from 0 (no association) to 1 (perfect association).

For trading strategy development, Cramér's V helps determine which features most strongly relate to strategy performance by:

Converting continuous features into categorical bins
Cross-tabulating each feature's bins against the target's bins or tails
Calculating the chi-squared statistic from this contingency table
Normalizing the result to produce the Cramér's V coefficient

A higher Cramér's V value indicates a stronger relationship between the feature and trading outcomes.

MIFS: Advanced Feature Selection

The MIFS (Mutual Information based Feature Selection) model provides an advanced alternative to Cramér's V for feature selection. MIFS offers several advantages:

Information Theory Based: Uses mutual information to measure predictive relationships, capturing both linear and nonlinear dependencies
Redundancy Minimization: Systematically selects features that add unique predictive information rather than just individual strength
Statistical Validation: Includes Monte Carlo Permutation Testing to validate statistical significance of each selected feature
Iterative Selection: Builds feature sets incrementally, ensuring each addition provides meaningful predictive value

MIFS also returns the respective Cramér's V values for each selected feature, allowing you to see both the individual and combined predictive power.

When to use each method

Use Cramér's V when:

You need quick feature screening
Looking for individual feature strength assessment

Use MIFS when:

Want statistical validation of feature selection
Working with complex feature sets where interactions matter
Preparing features for SETS model optimization

Command Line Usage

merlin discover-predictors <strategy-json> [--data-coll-backtest-id=MESOSIM_BACKTEST_ID]
                                           [--add-ivs] [--add-greeks] [--add-prices] [--add-derived-metrics]
                                           [--add-user-vars] [--add-feature-transforms] [--add-csv=CSV_FILE_PATH]
                                           [--feature-selection=METHOD]
                                           [--feature-grouping=INDICATOR_GROUPING] [--bins-or-tails=BINS_OR_TAILS]
                                           [--target-grouping=TARGET_GROUPING] [--target-bins=TARGET_BINS]
                                           [--start=START_DATE] [--end=END_DATE]
                                           [--result-dir=RESULT_DIR]

Example

This command discovers predictors for the Boxcar strategy, including implied volatility and Greek metrics, using the top and bottom 10% of values for feature binning.

merlin discover-predictors configs/strategies/boxcar-strategy.json --add-ivs --add-greeks --feature-selection=cramers_v --feature-grouping=tails --bins-or-tails=0.1

Configuration Parameters

Feature Selection

Parameter	Description
--add-ivs	Include implied volatility features
--add-greeks	Include option Greeks features
--add-prices	Include price-based features
--add-derived-metrics	Include derived and ratio metrics
--add-user-vars	Include user variables from the strategy definition file
--add-feature-transforms	Apply transformations to features
--add-csv=CSV_FILE_PATH	Include features from a CSV file
--feature-selection=METHOD	Feature selection method: "cramers_v" or "mifs"

Feature Selection Configuration

Parameter	Description
--feature-grouping	How to group features: "bins" (equal-sized bins) or "tails" (extremes)
--bins-or-tails	Number of bins or percentage of tails (0-1)
--target-grouping	How to group the target: "bins" (quantiles) or "sign" (positive/negative)
--target-bins	Number of target bins when using "bins" grouping

Date Range

Parameter	Description
--start	Start date for analysis (YYYY-MM-DD)
--end	End date for analysis (YYYY-MM-DD or "now")

Output

Parameter	Description
--result-dir	Directory to save analysis results

Feature Grouping Methods

Merlin supports two primary methods for grouping continuous features into categories:

Bins Grouping

Divides the feature's range into equal-sized bins:

Creates N equally populated bins across the full range
Higher number of bins improving accuracy, but may lead to too many ties in each groups
Example: 5 bins would create quintiles (20% of data in each bin)

merlin discover-predictors <strategy-json> --feature-grouping=bins --bins-or-tails=5

Tails Grouping

Focuses on the extreme values at both ends of the distribution:

Creates 3 groups: bottom X%, middle, top X%
The middle range is ignored. Power is in the tails!
Example: 0.1 tails would use bottom 10%, middle 80%, top 10%

merlin discover-predictors <strategy-json> --feature-grouping=tails --bins-or-tails=0.1

Target Grouping Methods

Similarly, the target variable (the position's realized PnL) can be grouped in two ways:

Bins Grouping

Divides the target into equal-sized quantiles:

Creates N equally populated bins of performance outcomes
Good for strategies with varied performance distributions
Example: 3 bins would create bottom a split with winners, negligible winners and losers and losers

merlin discover-predictors <strategy-json> --target-grouping=bins --target-bins=3

Sign Grouping

Divides the target based on positive/negative outcomes:

Creates 2 groups: losing trades and winning trades
Good for simple win/loss analysis
Simplifies analysis to focus on directional correctness

merlin discover-predictors <strategy-json> --target-grouping=sign

Suggested configuration

When using Cramer's V, start with Tail Grouping with 10% on each side for the Features and 3 bins for the Target:

merlin discover-predictors <strategy-json> --feature-selection=cramers_v --feature-grouping=tails --bins-or-tails=0.1 --target-grouping=bins --target-bins=3

When MIFS is utilized, then the Feature and Target Grouping should be set to Bins with 2 and 3 bins respectively:

merlin discover-predictors <strategy-json> --feature-selection=mifs --feature-grouping=bins --bins-or-tails=2 --target-grouping=bins --target-bins=3

Interpreting Results

Cramér's V

Merlin outputs a ranked list of features based on their Cramér's V scores. For example:

   $ merlin discover-predictors --feature-selection=cramers_v  --add-ivs --add-greeks --add-derived-metrics merlin-configs/strategies/boxcar-strategy.json
   
                                             CramersV  pValue
   entry_leg_pcs_short_vega                  0.628  0.0001
   entry_leg_pds_long_wvega                  0.617  0.0001
   entry_leg_pcs_short_wvega                 0.612  0.0001
   entry_underlying_price                    0.612  0.0001
   entry_leg_pcs_long_wvega                  0.591  0.0002
   entry_underlying_hv                       0.580  0.0003
   entry_leg_pcs_long_vega                   0.541  0.0009
   entry_leg_pds_long_vega                   0.528  0.0012
   entry_pcs_short_by_pds_long_iv_ratio      0.516  0.0017
   ... 

Interpretation:

Cramér's V	Association Strength
0.00 - 0.2	Weak
0.20 - 0.40	Moderate
0.40 - 1.00	Strong

Suggestion: Start with 0.3 or higher to filter potential predictors

MIFS

When MIFS is used as feature selection algorithm, then a ranked list of features returned based on their Mutual Information Score (Score). The For example:

   $ merlin discover-predictors --feature-selection=mifs  --add-ivs --add-greeks --add-derived-metrics merlin-configs/strategies/boxcar-strategy.json
   
                            Score  MCpValue  CramersV
   Feature
   entry_pos_wvega        0.8607     0.005    0.5275
   entry_leg_pcs_long_iv  0.9154     0.005    0.4213

The Score is the Mutual Information Score, which indicates the strength of the relationship between the feature and the target variable. The MCpValue is the Monte Carlo Permutation test p-value, which indicates the statistical significance of the feature's selection. A lower value (e.g., < 0.1) suggests that the feature is statistically significant. The Cramér's V value is also provided for reference, indicating the strength of the relationship between the feature and the target variable.

Please refer to the MIFS model documentation for more details on how MIFS works and how to interpret its results.

Using during Strategy and Portfolio Optimization

Cramer's V and MIFS parameters can be set in the Features section of the Model Config JSON file for Strategy Optimization. Please refer to the Strategy Optimization documentation for more details.

{
  "Features": {
    "AddIVs": true,
    "AddGreeks": false,
    "AddPrices": false,
    "AddDerivedMetrics": true,
    "AddUserVars": false,
    "AddFeatureTransforms": false,
    "AddCSV": null,
    "CramersV": {
      "Enabled": true,
      "Threshold": 0.3,
      "FeatureGrouping": "tails",
      "FeatureBinsOrTails": 0.1,
      "TargetGrouping": "bins",
      "TargetBins": 3
    },
    "MIFS": {
        "Enabled": false,
        "MCpThreshold": 0.1,
        "CramersVThreshold": null,
        "MIScoreThreshold": null,
        "FeatureGrouping": "bins",
        "FeatureBinsOrTails": 2,
        "TargetGrouping": "bins",
        "TargetBins": 3
    }
  }
}

Overview​

How Predictor Discovery Works​

Cramér's V Statistical Test​

MIFS: Advanced Feature Selection​

Command Line Usage​

Example​

Configuration Parameters​

Feature Selection​

Feature Selection Configuration​

Date Range​

Output​

Feature Grouping Methods​

Bins Grouping​

Tails Grouping​

Target Grouping Methods​

Bins Grouping​

Sign Grouping​

Suggested configuration​

Interpreting Results​

Cramér's V​

MIFS​

Using during Strategy and Portfolio Optimization​

Overview

How Predictor Discovery Works

Cramér's V Statistical Test

MIFS: Advanced Feature Selection

Command Line Usage

Example

Configuration Parameters

Feature Selection

Feature Selection Configuration

Date Range

Output

Feature Grouping Methods

Bins Grouping

Tails Grouping

Target Grouping Methods

Bins Grouping

Sign Grouping

Suggested configuration

Interpreting Results

Cramér's V

MIFS

Using during Strategy and Portfolio Optimization