How to get the maximum out of dirty data?

How to get the maximum out of dirty data?

The analyst and data scientist often have to work with raw, dirty data. The datasets may have no column names, their order may be broken, and the rows themselves may have tens of percent missing values.

All of these problems of course can be solved, but the dirty data severely complicate the work as well as bringing the data to a normal form takes extra time.

I had the opportunity to analyze health data on horses experiencing intestinal colic. The dataset with this data was poorly compiled, with no column names, with a messed up column order. Complicating the situation was the fact that there were many columns – 28. In addition, there were more than 30% missing values in the data set, instead of NaN values, the lines had “?”, and the numbers were presented in string format.

Of course, no Python or Pandas analysis tools were able to analyze such data properly. So the first thing I started with was to bring this data set into a normal form. After that I started to fill in the empty rows. But I decided to do this not just with averages, but considering correlation.

Eventually, after a few hours of work, I got a completely new dataframe, which contained no missing data at all. But what is even more interesting: all the statistical parameters of the data have not changed. That is, I managed to enrich the dataset by 30% without violating its general features.

Links

Leave a Reply

Your email address will not be published. Required fields are marked *