Interesting to read what some statisticians write about data science, on the American Statistical Association (ASA) blog. Most of us don't care about our job title - there are so many breeds of statisticians and data scientists after all - and they do overlap to some extent. While I was once a statistician, I now call myself data scientist or business scientist. Anyway, below are some extracts from very lively and interesting discussions taking place on the ASA blog.
Tommy Jones posted The Identity of Statistics in Data Science on the American Statistical Association (ASA) website in December 2015. In his long and very interesting article, he wrote (this is just a tiny extract):
Judging by current statistics curricula, statistics is more closely tied to the mathematics of probability than to fundamentals of data management.[...] As models have become more accurate, they have also become more complex.
Dogling Yan commented:
In that data analyst job, I barely used any statistical models because people don’t really care about p-values. Also, with the size of current datasets, p-values are always very small. The models, analysis methods that most people learned at school are not very useful since the simple model and more valid and complex models tend to give the same conclusion when sample size is large.
As a data scientist, I work on making models (actually, absence of models, but instead data-driven systems) simpler, not more sophisticated, and fit for black-box processing of big data in production mode. That is, robustness is more important than 100% accuracy, especially if your data is 70% accurate. And also, I work on designing a new statistical framework that is free of mathematics, traditional probability theory, random variables, and so on - so that anyone who know Excel can learn it. Even to compute confidence intervals or more elaborate forecasting systems. It will be published in my upcoming book, Data Science 2.0.
Jennifer Lewis Priestley also posted on ASA, in January 2016: Data Science: The Evolution or the Extinction of Statistics?
In this article, she wrote:
While data scientists can do a great many things I can’t do—mainly in the areas of coding, API development, web scraping, and machine learning—they would be hard pressed to compete with a PhD student in statistics in supervised modeling techniques or variable reduction methods.
Read my article about a fast, efficient, combinatorial algorithm for feature selection using predictive power to jointly select variables. It is the data science approach to variable reduction and variable generation. Likewise, supervised modeling - which it also belongs to machine learning - is not foreign to data scientists. Read about my automated indexation/tagging algorithm, used for taxonomy creation/maintenance or cataloguing: it performs clustering of n data points in O(n), and can cluster billions of web pages in very little time. It is also used to turn unstructured data into structured data.
And my reply to someone (Peter) who commented on LinkedIn, saying that "the feature selection method mentioned in the blog is still a heuristic method i.e. no guarantee to find the optimal subset of variables."
Peter, data scientists are usually interested in local optima, easy to detect, and that provide almost the same yield as the global optimum which has two drawbacks: (1) the global optimum could be an unstable optimum, and (2) it might take far more time to compute if the data set is immense.
Some opinions expressed in this article may be those of a guest author and not necessarily Analytikus. Staff authors are listed http://www.datasciencecentral.com/profiles/blogs/what-statisticians-think-about-data-scientists