The similarity or difference of two datasets can be determined using different methods. But perhaps the most accurate is the use of statistical methods.
Many comparison criteria have been invented in statistics:
- different kinds of averages;
- percentiles and modas;
- different kinds of deviations, asymmetries;
- simple and confidant intervals;
- correlations of distributions;
- quantiles and quartiles;
- kurtosis of values, etc.
This is all a huge area of data science and each of these parameters is calculated by its own formulas (which are often very difficult to understand).
Fortunately, in today’s programs for data analysis, we don’t have to calculate them manually. And one of the most powerful and faithful data scientist assistants in this matter is Excel.
At the link below you can see a research and comparison of two datasets (male and female weights) using Excel statistical tools. Dozens of statistical parameters were analyzed in this file. In addition, several distribution and sets comparison graphs were plotted. Such an in-depth analysis allows us to reliably understand the accuracy of our statistical hypotheses.