Website Name Change
I have changed the name of this website from Mining Business Data to BotFlo. I am offering a 60% off discount on both my Dialogflow ES and Dialogflow CX courses till April 20th 2021 for people who can help me spread the word about my new website.
At the moment, not much.
I am contributing to a Kaggle competition which is trying to use ML techniques to answer freeform questions from medical literature. At first, the technical folks were running helter-skelter trying to come up with useful visualizations (but not coming up with much).
Until one day an epidemiologist showed up, and started talking about visualizations in the form of evidence gap maps, and people realized that getting useful data from this dataset wasn’t going to be such an easy task with the existing ML tools we have at our disposal.
That seemed to have prompted a thread on what would actually make competition entries useful.
Someone on the forum wrote a great comment under that same thread which very accurately summarizes the state-of-the-art.
This task is much more like a real-world data science project than many Kaggle ML competitions, which are cleverly designed to give contestants a very specific task. The Covid-19 challenge is very broad and diverse, the deadline is “yesterday”, and contestants have to define the problem themselves to some extent.
So it makes a lot of sense to get the basics in place first:
- ETL – sourcing, cleaning, joining, and representing the data
- Filtering – how can dataset be reduced on a problem-by-problem basis (a paper about bats from 2007 isn’t going to tell you anything about the social impact of travel restrictions, you don’t need AI to tell you that)
- Additional metadata – how can you augment usefully (e.g. language detection, boolean fields on whether the paper covers Covid-19, etc.)
- Define your problem – which question are you trying to answer? How will you measure success (Kaggle cannot answer this – it depends on how you’ve defined your problem)?
Only at that point does it make any sense to even consider which ML / AI / NLP techniques you’re going to apply to the dataset.
And bear in mind, academic papers are not written in “natural language”. They are more structured and codified than, for example, a novel, a news article, or a Tweet. Terms are used in more precise ways. There are reporting conventions that are specific to the field of study. A lot of this stuff can be detected with old-fashioned data mining tools – i.e. keyword searches, regular expressions, pattern matching.
What isn’t remotely feasible (IMO) is to reinvent the ontology of epidemiology in the next month. (emphasis mine) Even if it were possible, would it help? Think about your audience, and what they’re up against. Pick one problem and see if you can hack together something that works. Iterate through it, get feedback from someone with domain knowledge. Do some background research of your own.
There are some very interesting takeaways from the specific comment, the full thread and even the entire competition in general.
A lot of ML/AI capabilities are overhyped
On the surface, the tasks posed in the challenge seem so simple, but we don’t have the tools in ML to be able to do them.
You don’t need a background in ML to be able to help
Based on the comment above, you can choose one of the tasks you find interesting, formulate a good question and subquestion, get some reasonable answers, and you would have already made a contribution. As the commenter points out, while you do need to be technical, you don’t need a background in ML to be able to help out.
Get insights into NLU and ML for text by following the progress
I think following the progress of this competition will help you get a lot of insights into the state of the art when it comes to using NLU as well as Machine Learning for analyzing textual data.
Personally, I think we have a long way to go.
You cannot “auto-generate” a chatbot
Even in Dialogflow, you need to first define a set of questions and answers. Dialogflow won’t automatically generate the training phrases and responses for you.
And no, the knowledge connector feature cannot help you either. In fact, here is a good way to understand why – try to use the knowledge connector feature and submit an entry to the competition and see how far you can go with it.
This is also why it isn’t so easy to “auto-generate” a chatbot by throwing a text corpus at some ML tool (quite a few clients have asked me this question).