Main challenges of Machine Learning
We are living a wonderful era of Machine Learning, dealing enjoying with machine learning, deep learning powered applications.
One of the keys that made us arrived to this stage is 4 letter word “data”. For better performance, those applications rely on huge quantity of data that’s is generated every day.
There is a real competition between internet compagnies even countries on which can gather more data the new oil.As Dr Kai Fu Lee said, China is the new Saudi Arabia.
Nevertheless, everything is not just fine ready to work, there are still difficulties to be solved.
The core difficulty is the Insufficient quantity of training data.
For a child to learn to identify an Elephant, the kid just need a few samples of pictures or drawings of that animal. That’s not the case for a computer. For a computer to get close precision to the child’s performance at identifying elephants, it, needs millions of training samples. This huge quantity of data is not always available and may cost a lot to gather them.
In 2001 Microsoft’s researchers Michele Banko and Eric Brill showed very different Machine Learning algorithms, including very simple ones, performed almost identically well on complex problem of natural language disambiguation once they were given enough data.
In addition, sampling problem is another data related challenge to Machine Learning.
The US presidential election ( Landon vs Roosevelt ) in 1936 is the best example to illustrate this problem. The literary Digest conducted a very large poll, sending mail to about 10 million people. They got 2.4 million answers, and predicted with high confidence that London would get 57 % of the votes. But Roosevelt won the election with 62% of the votes. There is a reason to that. To obtain the addresses to send the polls, the Literary Digest used telephone directories, list of magazine subscribers, club membership lists. All these lists tended to favour wealthier people wo were more likely to vote Republican hence Landon
Another challenge is Poor Quality data. Poor quality data is a data sets with a huge quantity of outliers, missing data labels, duplicated items, incorrect format samples…
Those above-mentioned challenges can really low the performance of our model. I think that’s why there is a new approach being developed which is the Data centric approach. This new way sof dealing with ML problems focus more on improving data quality rather than the code only.