What is a Data Scientist?
It’s hard to define what exactly is a data scientist. A few years ago when data scientist got coined as a sexy job title, companies were all chasing the unicorns who are capable of doing all things data. But it turns out that there are not many unicorns after all and even there are they probably wouldn’t be asked to do everything.
Data Scientist is hard to define also because different companies see them differently. As Chris Dixon once said in a tweet,
A Data Scientist is a statistician who lives in San Francisco.
Josh Wills also has a famous tweet that goes:
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.
These definitions are not totally true but quite fun to read.
In practice, data scientists tend to be associated with a few interdisciplinary skills. You’ve probably seen the data scientist venn diagram below in many places. WeCloudData’s faculty added some hard/soft skills in the diagram. In short,
- Data scientists need to have good programming skills.
- More specifically, a data scientist should at least be comfortable with linux commands so they can work with cloud platforms such as AWS and Azure.
- A data scientist should also know at least one programming language such as Python, Scala, Java.
- A modern data scientist should also be familiar with cloud computing platforms such as AWS, GCP and big data frameworks such as Apache Spark and Hadoop or Hive.
- Data scientists need to know the basics of math, statistics, as well as machine learning algorithms.
- Math is not required in order to work with the modern machine learning packages. To be frank, it’s quite easy to pick up a sample Jupyter Notebook and start building models without knowing the nitty-gritty of mathematics behind machine learning algorithms. But a data scientist without proper knowledge of math and algorithms will NOT be able to produce reliable results.
- Data Scientists are usually required to have knowledge in statistical learning and machine learning. After all, AI and ML are eating the software world. With machine learning skills, a data scientist will be able to apply advanced analytics to help businesses solve tougher challenges.
- Data scientists need to be a good storyteller.
- Storytelling is indeed very important. Even though data scientists are working on exciting technical challenges, without the trust and understanding by the business, marketing, and product teams, the work usually go straight to the garbage bins.
- Data Science needs to generate real business value and that requires strong communication skills, interpersonal skills, as well as presentation skills.
Different Types of Data Scientists
The Data Science field is getting more specialized. Companies realize that having only data scientists who build statistical models on the team is not enough unless all of them are unicorns. Unfortunately, that’s usually not the case and that dream team might be so expensive to keep. Instead, companies start to look for more specific data science skills among candidates as their data science maturity level grows. In general, there are 4 types of data scientists as illustrated in the diagram below.
- Operational Data Scientist
- 70%-80% of the data scientist jobs out there will likely fall in the Operation Data Scientist category. These data scientists typically have strong data munging skills and tend to work in a specific line of businesses such as sales, marketing, and product.
- 70%-80% of the data scientist jobs out there will likely fall in the Operation Data Scientist category. These data scientists typically have strong data munging skills and tend to work in a specific line of businesses such as sales, marketing, and product.
- ML Researcher
- ML Researchers are usually not called data scientists in companies. But their work that involves machine learning overlaps with data scientist’. However the focus is more on the research side. ML Research requires advanced academic degrees, preferably PhD and Masters in Computer Science.
- ML Researchers are strong in algorithms and programming and are able to implement novel ML and AI algorithms that become a company’s IP.
- ML Engineer
- ML Engineering is on the rise. Many companies start to realize that they need more than data scientists who work with Jupyter Notebooks and python packages for modeling. They need MLEs who can help tune advanced deep learning models and deploy ML models and pipelines in production.
- ML Engineers usually have strong software engineering background or at least know how to write good code. They know how to orchestrate ML pipelines as well as how to deploy models in production in a scalable way.
- If you’re interested in a career in ML Engineering, you should read WeCloudData’s ML Engineer Career Guide.
- Data Product (Owner) Manager
- In the data science ecosystems, not all roles require coding and model building. Product managers usually work closely with the software and the data teams. They are sometimes the data owners as well so they will work with data scientists to acquire the proper data required for machine learning.
- Data product owners are sometimes senior executives as well such as Director of Data Science or even Chief Data Officer. They may come from a technical data scientist background but their roles as the leaders require more communication and strategic planning.
- If you’re coming from a product management, project management, or business management background, you have two options:
- Train yourself to become a data scientist
- Learn enough data science and coding so you can appreciate the data science lifecycle and know how to work with a data science team.
The four above covers most of the data science roles but in recent years some new types of data scientist
also emerged. For example, Citizen Data Scientist emerged due to the shortage of well-trained data scientists. Therefore companies are training domain information experts and empowering them with modern analytics tools (and even autoML tools) to build machine learning models that otherwise require traditional data scientists. In some ways, Citizen Data Scientists are similar to Operational Data Scientists but they tend to work with low-code (or no-code) analytics tools and apply autoML models for prototyping.
A Day in the Life of a Data Scientist
What does a typical day look like for a data scientist? Well, it varies and depends on what type of data scientist we’re talking about. Since Operational Data Scientist is the most popular type, let’s have a look at a typical day in the life of a data scientist in the marketing analytics team. Note that not all data scientists work in the same way so take it with a grain of salt.
- 9:30 – 10:00 – Arrive at the office with her morning coffee, plan for the day and look into some emails
- 10:00 – 10:30 – Meet with the team in daily project standup meeting to discuss accomplishment, challenges, roadblocks in the past day
- 10:30 – 11:00 – Join a business meeting to discuss AB test results of the marketing campaigns
- 11:00 – 12:30 – Check the model dashboard to make sure pipelines are running properly. Then pick an assigned task from JIRA and start to work on a data extraction cleaning task
- 12:30 – 13:30 – Lunch break
- 13:30 – 14:30 – Wrapping up the data cleaning task and run a few SQL queries to double check the quality of the expected output.
- 14:30 – 15:00 – Join another business meeting to discuss the customer segmentation models with the marketing team. Present a few slides that explain the methodologies in layman’s term.
- 15:00 – 17:00 – Pick up another task from JIRA that requires a new predictive churn model. Data engineers have prepared new time series features so she wants to try out some deep learning based models. Google and find some good articles about the new algorithms and learn on the fly.
Data Scientist vs. Data Analyst vs. Data Engineer
We always get asked questions such as “What’s the difference between a data scientist, analyst, and engineer?” Since the skills are overlapping among different data roles, it is indeed confusing from time to time.
Let’s clear this up.
Business Interaction
- Among the 3 roles, the Data Analyst role is usually less technical but it requires more interactions with the business directly.
- Data analysts tend to take on more ad-hoc data analytics requests.
- Data analysts use SQL most frequently and need to use python as well depending on the company’s toolbox. Many data analyst jobs still require lots of Excel and VBA.
- Data Engineer role probably requires the least interaction with business directly. But it doesn’t mean data engineers are always in the silo. It’s just that the nature of the job requires the data engineers to interact more directly with data scientists, analysts, as well as IT.
Technical Skills
- Among the 3 roles, data engineering jobs usually require stronger coding skills, followed by data scientist, and then data analyst.
- Data scientists also need strong programming skills but they deal less with systems and automation tools. The focus is more on the advanced analytics side.
- Data Scientists, Data Engineers, and Data Analysts all need to be very good at SQL and Python programming. Being able to retrieve data from databases, data lakes is critical.
- Data Science and Engineering require more big data and cloud related knowledge and experience as data scientist and engineers tend to work with bigger datasets.
The following example may help you understand the difference in a clear and concise way. Say a startup just released a new app feature and the product manager expects the new feature to help the company grow its Daily Active Users (DAU). But unfortunately, the growth curve didn’t pick up and instead, there was a dip in the DAU. This is when the product manager is going to approach the data team to get a better understanding of the situation.
Data Analyst
- The data analyst is probably the first person the product manager wants to talk to. Business will probably start with the what and why questions.
- In this case the analyst will need to write a lot of ad-hoc queries and help the product manager validate her assumptions.
- Product manager may also request insight into user engagement and trend. If the analysis is helpful the product manager may request a dashboard that requires regular updates.
Data Scientist
- The product manager will probably go to the data scientist if she wants to know the potential impact of the product feature change. If the cause has been identified, the data scientist might be requested to build predictive models to identify high-risk customers that might cancel the services
- The product manager may also want the data scientist to collect customer feedback data from twitter or call center logs. An NPL model might need to be trained or applied to tell the overall sentiment of the customer feedback.
Data Engineer
- None of the above analysis would be possible without the data engineers who help prepare the data and make sure the data is clean and well organized.
- The data engineer may pull new data for the data scientist who wants to try out new features and help automate the ML pipelines as well if the models need to be deployed.
- The data engineer may also need to help the data analyst automate a data pipeline that extracts data from 8 different sources, transform the data and load it into a Snowflake database running in the AWS cloud. Automation is done using Apache Airflow.
Whether you’re interested in becoming a Data Scientist, a Data Analyst, or a Data Engineer, WeCloudData can help. Learn more about the Data Engineer career path and programs please go here, and to learn more about the BI/DA career path, go here. For Data Science, continue reading the rest of this career guide.