Features Shared Distribution

dinaberenbaum
Aug 27, 2020
1 min read

Shared distribution heatmap is a very useful and visual way to examine the spread and legitimacy of the data and understand how different attributes appear together in the data.

For each pair of features we count the amount of samples that have every combination of their values. For categorical features this is done per each category. For continuous features, the values are first binned and the count is done per each bin.

The x-axis and y-axis of the heatmap represent the chosen two features bins or categories.

The color of the heatmap shows the amount of samples in each combination of bins/categories. The range bar above the figure allows to adjust for the maximum and minimum color bar boundaries. Any value that is above or bellow the chosen boundaries will be colored according to the closest edge. This allows to emphasize the variability between the values on a specific sub-range. If the chosen range is low, the variability of the lower values will be more dominant, and vice versa.

The heatmap allows you to examine some important aspects, including:

- How well is the data spread among this two features? For example, in the heatmap above, we can see the "Without-pay" category doesn't have any representation in many of the age bins. This could be representative of the real spread or this could point to a skewed data.

- Are there any implausible combinations of features? For example, we don't expect to see many 17-19 year olds working in the federal government.

#explainableai #eda #features

Features Shared Distribution

Recent Posts

Comments