During my interviews with various data scientists, Shaohua Zhang is someone who struck me as unique for two reasons: 1) his incredible commitment and generosity to share his experience, and 2) his transition from the corporate sector to the educational sector while staying abreast of the latest developments in Data Science.
Shaohua is the Co-founder and CEO at WeCloudData Academy, a data training company focusing on providing corporate and public data science training services. He is also a Data Growth Coach at Communitech where he helps provide data strategy mentorships to tech startups.
Before co-founding WeCloudData, Shaohua had extensive experience in the industry. He worked for data-driven companies such as Rogers, BlackBerry, and Kik Interactive. He helped lead the data science team at BlackBerry and played a key role in assisting the BlackBerry Messenger team to leverage big data and machine learning to drive advertising revenue.
I sat down with him to talk about his inspiring journey as a Data Scientist- from being one of the first persons in the world to become a SAS Certified Predictive Modeller in 2007, to his observations about Data Science practices and his latest ventures.
Reena: In your 8 years of experience in Data Science in the corporate sector, if you had to undo any aspect about a project that you were a part of, what would it be?
Shaohua: If I could reverse something, I would significantly shorten the time and effort spent on building the MVP in one of the projects I led. It is also one of the many things I learned from my manager- tackling a Data Science problem is an interactive process.
There are many more challenges than doing research and building machine learning models. Since it was a new project that had not been done in the organization before, I should have probably spent more time on building the data pipelines, iterating faster on the machine learning piece, and getting the results out soon enough so that we could get early feedback from the business teams. Creating the feedback loop is very important so that your time and effort is not wasted.
R: Data Scientists are not primarily Software Engineers. Can you illustrate from experience, why engineering skills are important in a production environment; and what practical engineering skills should Data Scientists know to be successful in a production environment?
S: Engineering skills are absolutely necessary for data scientists that work on building data products. In the traditional analytics environment, data scientists only need to focus on building models while the data engineers usually develop the modeling data set, and the models are deployed to production by software engineers or developers.
One challenge is that data scientists and researchers sometimes do not speak the same language as engineers, so the communications cost becomes very high in the production environment. Developers/Engineers often need to take over the messy code written in a Jupyter Notebook environment and rewrite the data scientists’ code. This is problematic because the engineer may not understand exactly why the data scientist chooses some methods over the others, and modeling methodologies and processes are not always well documented.
Data scientists and ML researchers also need to be able to develop complex SQL queries to extract data and know-how to deal with distributed systems such as Hadoop and Spark. Data democracy will enable the data scientist to dig into the many data sources and raw logs instead of only working with prepared data. It means they will be able to look at the raw data to gain better intuition and then come up with better features for the ML models.
For the Data Scientists who work in a production environment, being able to write efficient and reusable code is also important. A lot of data scientists build models in an interactive Jupyter Notebook environment, but when it comes to model deployment, an IDE such as PyCharm is probably a much better way to develop, debug and test your code. It means that the Data Scientists need to know how to modularize code, write functions and classes and know how to do version control and unit testing.
In some environments, a data scientist is required to build several models a week and iterate very fast. Keeping track of hundreds/thousands of models developed by different data scientists poses challenges in production. So it will be great if the data scientists know how to version control not only their code but also the models and the virtual environments in which the models are trained so that the results are reproducible.
In short, the data/software engineering tools I would suggest a data scientist know include version control, cloud platforms (AWS/GCP), data pipeline tools (Airflow/Luigi), distributed systems(Hadoop/Spark), model tracking tools(MLflow), and understand how Prediction APIs work.
R: While a student, how did you prepare yourself for your first job as a Predictive Modeller at Rogers?
S: Before I landed my first data job, I spent a lot of effort in learning hard skills (SAS, predictive modeling) and built a strong data project portfolio.
I did my Master’s of E-Commerce program at Dalhousie University and became interested in data analytics through the “Marketing Informatics” course taught by Prof Dr. Tony Schellinck. I had exposure to SAS programming in his class and worked on a hands-on project using real datasets from a famous grocery chain.
I became fascinated by predictive modeling and how models can be applied in retention campaigns to help reduce customer churn. So I decided to learn more. Unfortunately back in 2007, there were not many training programs that taught practical predictive modeling and machine learning skills. So I decided to take several training classes offered by SAS Institute. I spent quite a bit of effort on finding interesting datasets to practice, reading blog posts, and work on hands-on projects that are related to industry use cases. In 2007, I did the SAS Certified Predictive Modeller test in Las Vegas and was among the first few to get certified. Eventually, the effort paid off because the SAS programming and predictive modeling skills I acquired through those training convinced the employer that I could do the job.
I started my career working with SAS and later on switched to open source when I was working for BlackBerry. Due to the massive amounts of data, we had to deal with and some use cases involving recommender systems and unstructured data, we decided to go open source and adopted platforms such as Hadoop, Spark, and graph databases.
R: What were the differences between a startup and a large corporation that you observed, in its underlying data infrastructure? What were the changes in the mindsets and skills that you had to implement accordingly?
S: I was fortunate enough to have worked for some data-driven companies. Based on my personal experience, the differences I observed include the choice of data infrastructure, the role of a data scientist, and the data team culture. There are many more to be discussed, of course.
When I was working at BlackBerry, I started as the first Data Scientist in the marketing analytics team and later led the team to build data products for the BBM team. BlackBerry has a strong engineering culture and an awesome infrastructure team to support its big data initiatives. The good thing is that data scientists get to focus on building machine learning models and delivering data insight to the business. There are standard processes and lots of learning opportunities.
When I joined Kik Interactive as a data scientist, I had to learn one of the Cloud platforms very quickly because there was no in-house platform engineering team. That means I had to spin up cloud instances to run data collection jobs, launch Hadoop clusters in the cloud and use Apache Spark to analyze big data and terminate the clusters after the jobs are completed; and even automate the entire process by using the python API provided by the cloud provider. From time to time, I had to develop python scripts to collect cluster utilization stats and sent a summary report for the team. Sometimes I also worked as an engineer to build simple Data APIs and monitor the API utilization dashboards. In that kind of environment, I learned how to wear multiple hats (data scientist, engineer, DevOps) to get things done and started to appreciate the importance of data engineering and software engineering.
R: How does the Data Science program at WeCloudData differ from other comparable programs?
S: At WeCloudData Academy, we are firm believers of project-based learning. While some other programs focus on teaching students essential tools and theories, we focus on teaching students practical methodologies, industry use cases, and providing mentorship to students on data science project portfolios.
For example, in our Machine Learning course, our instructors not only teach students how to build machine learning models in a Jupyter Notebook environment but also how to properly structure the project using PyCharm and deploy ML models using APIs. In the big data course, students learn how to build an end-to-end ML pipeline in the AWS cloud. The course covers several different tools from AWS, Kafka, Spark Streaming, Spark ML, to DynamoDB, Hive, and Airflow.
Data Science and Data Engineering skills are hard to teach because the best way to teach these skills is hands-on project-based learning, and it’s hard to find good instructors who have strong industry experience, who are passionate about teaching, and who also excel at teaching. We are fortunate enough to have worked with some very talented and passionate Data Scientists and Data Engineers who work for tech firms like Microsoft, Zero Gravity Labs, Adeptmind AI, Rubikloud, CapitalOne, Integrati.ai, etc.
Having exposure to meetup communities and alumni networks is also our strength. WeCloudData actively participates in the data communities in Canada. We not only run the Toronto Data Science & Engineering meetup that has 5,700 members, work with corporations such as Microsoft and LoyaltyOne, but also work with other awesome meetups such as TDLS, Women Who Code, and sponsor events such as Data For Good datathons. Our students have access to guest lectures offered by industry speakers, hands-on workshops, as well as our alumni network.
R: How do you ensure that the Data Science skills that you teach are relevant in an ever-changing industry?
S: This is a great question! In an ever-changing industry, we always need to make sure that the skills we teach stay current and relevant. That is why students from our data science programs could find Data Scientist and Data Engineer jobs at companies such as LoyaltyOne, RBC, TD, Rubikoud, etc.
To give you an example, the ‘Data Science at Scale’ course we teach today looks very different from the one we taught two years ago and is much more advanced and practical than the courses offered by comparable institutions. The curriculum we used two years ago has a focus on Apache Hadoop, Hive, and Pig, and we chose sandboxes offered by Hadoop vendors as the teaching environments. The curriculum we use for teaching today covers Amazon Web Services, Docker, Kafka, NoSQL, Airflow, and has a big focus on Apache Spark.
We make sure that the course materials are up to date by refining/updating the course material every quarter, actively engaging in the data communities, working with some of the best data scientists and engineers in the field, and actively seeking feedback from our industry partners.
Our Corporate Data Training Programs also help us stay very close to the industry trends because we work closely with our corporate clients to refine the course offerings and ensure we cover the skills that matter most to the industry.
Since I teach a big data course myself at WeCloudData, I make sure that I teach relevant skills by attending conferences, reading industry papers, and working with consulting clients. My role as a Data Growth Coach at Communitech also helps me stay close to the industry and understand new technology trends.
I’d really like to thank Shaohua for providing us with very actionable takeaways with his answers.
More information on WeCloudData can be found here: https://weclouddata.com/
I’d honestly appreciate any feedback and ideas for interviews that my lovely readers have for me. In case you wish to schedule interviews with me or know of someone/a company that deserves to be covered, please feel free to reach me via Linkedin here. If you liked this article, please give it a clap and share it on any platform you like:)