Random Thoughts and Observations on Data Science and Beyond

The Data Science Summer


First Published on July 18, 2018

Data Science and Artificial Intelligence are closely related yet different. In this article I briefly mention three key differences between the two, why these differences matter, and finally share my thoughts on why I think we are in midst of a Data Science summer!

How AI and Data Science differ

Research Vs Applied: All Artificial Intelligence problems were/are research problems. Think about the Playing Chess, Planetary Exploration, Driverless cars…These were/are exciting research problems. By their very nature outcome of research is not certain. The concerns about an AI winter1 are probably overblown just like the hype surrounding AI. Some research problems are likely to be solved earlier than others2. For example, planetary exploration is more likely (is becoming) a reality than a self-driving car. AI-based Planetary Exploration is already giving results3. And so —unless hype reaches stratospheric levels—concerns about an AI winter are probably misplaced given that it’s essentially a research problem. Data Science is essentially about applications.

Domains: AI operates at the intersection of Computer Science, Statistics and Language/ Computer Vision/ Robotics4, while Data Science operates at the intersection of Computer Science, Statistics and Domain (other than Language/ Computer Vision/ Robotics).

Problem Statements: The AI problem statements are likely to be better understood at least intuitively. For example, consider the problem statement ‘Find a new planet’ or ‘Self-driving car’ or ‘Diagnose a Cancer’. Why? That’s because we often have some notions of what they mean because of our experiences like driving a car or what we think Astronomers or Oncologists do. On the Contrary the Data Science problem statements are likely to be less understood intuitively. For example, consider the problem statement ‘Recommend a book’ because it seems like a vague and a humanly impossible for most bookshops/libraries.

Note: The above three differences are a generalization. There can be an exception like Using a machine for making an Oncology assessment. For the purpose of this article Using a machine for an Oncology Assessment is an AI problem and not a Data Science problem (on two out of the three parameters mentioned above it’s an AI problem). There can be a view that in such cases there isn’t a difference between AI and Data Science however for the purpose of whether there is a risk of a Data Science winter the author believes his classification is both material and correct.

Does that mean AI is always about research? No. When the domain is Language, Computer Vision or Robotics it’s possible consumer AI research will be consumed directly by the market. Examples include Google Search (Language), Personal Assistants like Alexa (Language) or the self driving car (Computer Vision and Robotics) once the pilots are successful. Think of this as research as Drug Discovery on steroids. Additionally, huge amount of data gets collected at scale not generally seen in the Pharmaceuticals industry.  

How AI supports Data Science

In the age of open-source software everything that’s published— especially when it’s available in a public software code repository — is applied! This is also true about Artificial Intelligence. And so all applications of AI that are not pure research problems get subsumed into Data Science!

Does the difference between AI and Data Science matter?

Mostly, yes!

Difference in skill sets: The skill sets and gestation periods for AI and Data Science projects are very different. Research will generally take more time than the time for applying the methodology invented through research. While an AI researcher and Data Science practitioner could at times morph into each other the former relies more on analytical skills whereas the latter relies more on synthesis skills. Let me explain. Formulation of a Data Science problem is tough! That’s because it often requires a inter-disciplinary understanding. Furthermore, it’s harder to imagine Data Science problems without being close to the problem.

In midst of a Data Science summer

The availability of near cutting-edge AI work in the public domain for Data Science practitioners, cheap computational power thanks to a new variant of Moore’s Law5 and the Python6 programming language means that inter-disciplinary work is relatively easier. Now, there is an almost never-ending supply of problems for those who are close to it. And if those close to the problems recognize it (that’s not always easy) then we have a long Data Science summer.

End Notes:

1.      AI Winter (Wikipedia article): https://en.wikipedia.org/wiki/AI_winter

2.      We will probably trust self-driving cars much before we trust a machine to over-rule our doctors (there is very little utility when the machine has the same opinion as the doctor). Also, see this and this. https://www.statnews.com/2017/09/05/watson-ibm-cancer/ https://www.forbes.com/sites/matthewherper/2017/02/19/md-anderson-benches-ibm-watson-in-setback-for-artificial-intelligence-in-medicine/#57def9673774

3.      NASA press release: https://www.nasa.gov/press-release/artificial-intelligence-nasa-data-used-to-discover-eighth-planet-circling-distant-star

4.      This is very loosely based on the definition of AI given in Russell & Norvig.

5.      Sebastian Raschka explains that Moore’s Law continues to be seen in practice when we consider GPUs. GPUs are inherently suitable for large computations. GPUs are cheap in comparison with CPUs.

6.      Python is the lingua franca of Data Science. It’s less cryptic than many other programming languages and offers an excellent stepping stone for non-Computer Science persons to learn programming.

Aniruddha M Godbole has interests in Computer Science, Statistics and Finance. These are his personal views. These views are evolving and this article will likely get updated. This was first publised at https://www.linkedin.com/pulse/data-science-summer-aniruddha-godbole/