Finding the data to your business problemIt seems an obvious and pretty simplistic thing to say, yet so many projects end up dying because this very first step is overlooked. We have loads of data, tons of open resources, but if they are not specific to the use case, it equates to providing an answer to a different question than the one the business is trying to address. Solving a problem with ‘any’ available data would be like building a chatbot answering in English to French native speakers: it’s technically feasible but completely irrelevant.
The Data case: Go holistic, or get skewed (results)Let’s take a well-known business case here as an example. Let’s say you are an analytics company, and you want to predict the price of bread for the next quarter. You have past and forecasted weather data, and you think it might be a good idea to base your price prediction on that information given the correlation between agriculture production rates and weather. Guess what: you’ll end up with a terrible result. Weather data is meaningful information when it comes to predicting food prices, but it is an incomplete one, as the weather is not the only variable in the price equation. To have a relevant information structure for your prediction, you should add in more contextual data pertaining to the bread production chain, such as global supplies, energy, and distribution costs, with some local focus, and even more. Not doing so would probably sign up the end of your company.
Data quality first!The same goes for Natural Language Processing (NLP): practitioners can produce some understanding from raw text inputs, but the output is flawed because the structure of information in the raw text is incomplete. Research in deep learning is all about automatically deducing this information structure (the syntax and semantic features of a given language) from raw text. But there is still a long way to go. Today, encoding the relevant structure of information in a model comes before addressing the Natural Langage Understanding/ Natural Langage Generation machine learning production issues while the data quality component should be embedded in the AI system from inception, during the design phase. In NLP, it all comes down to linguistics: providing the data we need to drive significant results, with a large field of relevant engineerable features, able to summarize text complex information structures as well as domain knowledge.
The Product case: data quality = improved performancesIf you are not convinced, here is another reason why you’d want data quality in your NLP applications: in some cases, with significant information encoded in your data, a simple supervised learning model can outperform a complex deep learning one. Mark my words: I’m not saying deep learning is useless in NLP and you shouldn’t even try (try transfer learning and weak supervision!), I’m saying deep learning is not the default solution. Just like in any data science pipeline, deep learning models should always be benchmarked against some concurrent statistical models, considering prediction performance, on the one hand, cost and interpretability on the other. (plus: any Machine Learning (ML) model should be benchmarked against other possible technologies, but this is a topic for another post). In some cases, such as sentiment analysis, for instance, a neural network that outperforms a well-defined Bayesian classifier could be so expensive as to potentially invalidate the cost of your product, meaning it’s not because you can do it that you need to.
Eventually, product designed AI systems (with end-user in mind), data quality exploration and benchmarked models lead the way to accelerated time to market and significant lowering of production costs.One last take-away: Integrate the data collection pipeline in the core of your product (which for sure is now the new name of your AI system), and you’re good to go!