What are the skill sets data scientists need to succeed?
June 24, 2014
Thousands of new data jobs will be created in the next couple of years, making “data scientist” one of the hottest emerging job titles. New data jobs will center on “analytics, data architects, data scientists [and] data modelers,” according to Inhi Cho Suh, Vice President and General Manager of Big Data, Integration & Governance at IBM. But what skill sets do you need to develop to become a data scientist?
At the recently concluded Hadoop Summit 2014, IBM’s Suh and other guests joined theCUBE co-hosts Jeff Frick, John Furrier and Jeff Kelly to discuss needed data scientist skill sets as well as the talent shortage. “[The] talent gap continues to be a big issue,” summarized John Furrier, founder of SiliconANGLE. “There’s a huge demand for data scientists.”
A data scientist is described in many ways. According to Josh Wills, Senior Director of Data Science at Cloudera, it’s a “person who is better at statistics than any software engineer and better at software engineering than any statistician,” he wrote in a May 3, 2012 tweet @josh_wills. Also, data scientists must know math, statistics, experiments, causal inference, machine learning, and software, according to a blog post by data scientist Trey Causey.
Most knowledge workers today already possess some data scientist skills, according to Jeff Kelly, analyst at Wikibon. “It’s interesting that the [U.S.] government points out that…170,000 more data jobs [will be created in the next two years],” said Kelly at Hadoop Summit 2014. “But really, if you think about it, most knowledge workers are becoming data professionals in a lot of ways. You’ve got to understand how to interpret data and how to communicate with data. And that’s one of the softer problems, one of the non-technology problems that I think a lot of organizations run into.”
Data scientists should be able to program but don’t need to be masters of a language out of the gate. “It doesn’t matter what language you learn first. Pick a language and learn it,” Causey advised aspiring data scientists in his blog post. “Write bad code that breaks. Just learn it.” By the time you figure out what your language is bad at or can’t do, Causey wrote, you’ll already know enough about programming that you’ll know which language you need to learn next to solve your data problem.
As enterprise applications become more data-centric, the roles of the data scientist and the application developer are actually merging, according to Kelly. “In the short-term, this means the two roles must learn to collaborate more effectively and both must assume new ways of thinking,” Kelly recently wrote. “For data scientists, this means starting to think more about how the insights they uncover can be translated into repeatable form factors consumable by end-users. And application developers need to gain a better understanding of data flows and how analytic requirements impact application performance.”
Gaining data scientist skill sets
Filling in the talent gap for data scientists may require a combination of efforts by both universities and industry vendors. “Universities are big, slow-moving beasts and they don’t necessarily have in place ‘data science’ schools,” Mark Lowerison, Director of Research and Academics at the University of Calgary, told theCUBE’s Furrier at Hadoop Summit 2014. “They have molecular biology schools and genetic schools and statistics schools and computer science programs. It’s the people who…[attend]…those schools that become the good candidates for working in our industry.”
There are vendor-sponsored data scientist education programs to help address the talent gap. Cloudera, for one, offers a course toward a certification called a Cloudera Certified Professional: Data Scientist (CCP: DS). According to Cloudera, candidates must prove their abilities under real-world conditions by designing and developing a production-ready data science solution that is peer-evaluated for its accuracy, scalability, and robustness.
For the CCP: DS certification, Cloudera assesses 11 different areas ranging from the ability to ingest data, transform data, query complex math across that data and deploy machine learning algorithms, “[and to] deploy all of that at scale,” said Brad Johnson, Certification Manager at Cloudera in a video on the company’s website.
Some vendors, such as IBM, have been working with colleges and universities to help churn out tomorrow’s data scientists. “We’re…actually working with several universities globally to actually put together a curriculum—both in the business school as well as in the technical schools—for certifications and advanced sort of Masters classes around various data type jobs,” said IBM’s Suh at Hadoop Summit 2014.
Watch the full HadoopSummit2014 interview of Inhi Cho Suh, Vice President and General Manager of Big Data, Integration & Governance at IBM: