Supported binning methods

Bins are used to represent probability distributions and divergence values for data drift. The number of bins impacts the quality of drift values and in general, Model Monitor’s performance itself. If you have more than 20 bins, this can cause false alarms which can impact performance.

By default, Model Monitor uses the Freedman Diaconis Estimator method to calculate the number of bins for numerical variables. If this method returns a count higher than 20, then the count is capped at 20.

For numerical variables, the Model Monitor automatically adds one guard bin for values that fall outside the minimum and maximum range of the values present in the training data. For training data this guard bin will have a zero count (unless the user uses the ‘binEdges' override strategy). However, for Prediction data, values might fall in this bin, indicating that prediction data has values outside the min-max seen on the training data.

For categorical variables, the class values are used as bins capped at 20 unique values. The Model Monitor automatically adds one guard bin ‘Untrained Classes'. For training data, this guard bin will have zero counts (unless the binCategories override strategy is used or if there are more than 20 unique values for the variable). However, for Prediction data, counts of all classes that were not present in the training data will fall in this bin. You can use this bin to detect new classes previously unseen during training.

Use the following attributes in the Monitoring config JSON to override these defaults and fine tune the bin creation.

Note	After a model is registered, you can’t change bins.

For numerical data columns, you can use one of the following approaches:

binsNum
- This takes a positive integer >= 2 and > 20 as input.
- The Model Monitor will create that number of equal sized bins for the numerical variable.
- The Model Monitor uses the max and min value in the training dataset to determine the bin widths.
- The Model Monitor will add two guard bands in addition to the user-defined bins.
- For example:
  - “binsNum”: 10`
binsEdges
- This takes an array of real numbers as input.
- Edges can be both positive and negative decimal numbers (except Infinity).
- These correspond to actual bin edges.
- To create N user-defined bins, users must provide N+1 bin edges.
- You can provide a minimum of 3 and maximum of 20 numbers or edges in the array.
- They must monotonically increase (lowest to highest) from the start of the array to end of the array.
- This is similar to histogram_bin_edges method used in Numpy.
- The Model Monitor will add two guard bands in addition to the user-defined bins.
- All provided values must be unique.
- For example:
  - “binsEdges”: [-10, -4.5, -0.25, 0, 3.2, 5.11111]
- Examples of invalid “binsEdges”:
  - “binsEdges”: [-10, 4, -0.25, 0, 3.2, 5.11111] –> not monotonically increasing
  - “binsEdges”: [-10, XYZ, -0.25, 0, 3.2, 5.11111] –> string value present
  - “binsEdges”: [1,2] –> less than 3 edges provided
  - “binsEdges”: [1,2,2,4,6] –> duplicates present

For categorical data columns, you can use the following approach:

binsCategories
- This takes an array of strings as input (length must be less than 100) and creates a bin for each of them.
- The values must ideally correspond to class values present in the data column in the training data or class values that you expect to find in the prediction data.
- Counts of all other class values of the training and prediction data columns will fall in the 'Untrained Classes' guard bin.
- If the user has specified an Untrained Classes bin as part of the binsCategories, then it will correspond to the internal Untrained Classes bin.
- Example:
  - “binsCategories”: [“red”, “blue”, “green”, “white”, “yellow”]