The myth of modern machine learning


There are lots of cargo cults in technology. Blockchain is the most obvious current one, with only 2 uses; 1. Gambling on cryptocurrency and 2. Using it as a term to attract investment before pivoting to another more suitable technology. Machine learning seems like it shouldn’t fall into this trap but for a number of reasons it’s currently high on my list of cargo cult tech.

1. It’s expensive

To do machine learning you need an accurately tagged training data set. To data machine learning well on a large general dataset you need a lot of accurately tagged data. Most companies that do this need dedicate teams of people to tag that data. Companies like Google will have teams of PhDs tagging data.

2. It’s a forever process

You don’t just do machine learning and stop. As soon as a model has been created it starts to decay, or more accurately the world changes, which makes the model less accurate. Sometimes this is due to an adversarial situation, think spammers trying to beat your amazing machine learned spam filter. More often this will just be due to the environment changing, think new words being introduced into a language after you’ve trained your translation models. You need constant resource on tagging and updating your models.

3. There are rapidly diminishing returns

Google Photos has a huge repository of photos to be used for training data. They have the most modern machine learning systems running on custom hardware. They perform object recognition at about the same level of performance as my local NAS running an underpowered Intel CPU using models that would have been generated on a far smaller corpus. The additional accuracy from more training data seems to diminish very rapidly, while the costs scale pretty much linearly.

4. It’s opaque

There are often 2 ways of solving a problem. Let’s take the example of trying to predict batches with defects in a manufacturing plant. It would be reasonably trivial to load in all of the input data to something like tensorflow and build a model that accurately predicts which batches are likely to be faulty so you can avoid them being sent to your customer. You might even be able to get some nice PR on how you solved the problem. What you won’t generally do through this method is understand why defects are happening to be able to better prevent them in the future. More traditional statistical methods are more likely to reveal how the inputs really relate to each other and what the causes of the faulty batches are. It’s less impressive but is probably the better business tool in the long run.

5. It distracts from solving the real problems

One of the things I’m a real stickler for is data normalization. As early as possible in a data pipeline all data should be in the same format, preferably at the time of generation. What I repeatedly see is companies using machine learning to cleanse dirty data instead of fixing the cause of dirty data. More often than not this is a sales led decision rather than engineering, trying to avoid having to have a conversation with a client about how they’ve implemented their tech incorrectly. This is potentially fine for the first client, but then the next client comes along and the onboarding team don’t even try to get the data correct because it’ll be magically fixed by the ML. Very quickly the inbound data is pretty much just junk that is extremely expensive to process, eating heavily into profits.


Obviously machine learning does have positives. Google and Amazon’s work on generating speech recognition models has made voice assistants feasible, and on device voice recognition improved very rapidly. Some of the work on analysis of medical imagery to better diagnose cancers is amazing and still improving. For most scenarios, though, the correct tool to use is probably something more transparent and if you’re building your whole business on delivering a machine learning product just be aware that the cost of competing will only get higher for diminishing returns.