Program  

Courses
Location
Corporate
Our Students
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science & Big Data
Data Engineering

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Our Students

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Let’s get together and enjoy the fun from treasure hunting in massive real-world datasets

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Blog

Career Guide, Guest Blog, Learning Guide

Data Engineering Series #1: 10 Key Tech Skills You Need, to Become a Competent Data Engineer.

December 3, 2020

Bridging the gap between Application Developers and Data Scientists, the demand for Data Engineers rose up to 50% in 2020, especially due to increase in investments on AI based SaaS products.

Graph showing the fastest growing tech occupation

After going through multiple Job Descriptions and based on my experience in the field , I have come up with the detailed skill sets to become a competent Data Engineer.

If you are a Backend Developer, some of your skills will overlap with this list below. Yes, it’s quite easier for you to make the jump provided the skill gaps are addressed.

???? Must Haves

1️⃣ The Art of Scripting and Automating
Can’t stress this enough.
Ability to Write a reusable code and to know the Common Libraries and frameworks used in Python for:

* Data Wrangling operations – Pandas,numpy,re
* Data Scraping – requests/BeautifulSoup/lxml/Scrapy
* Interacting with External APIs and other Data Sources, Logging
* Parallel processing Libraries – Dask, Multiprocessing

2️⃣ Cloud Computing Platforms
The rise of cloud storage and computing has changed a lot for data engineers. So much that, being well versed in at least one of the cloud platforms is required.

*Serverless Computing, Virtual Instances, Managed Docker and Kubernetes Services
* Security Standards, User Authentication and Authorization, Virtual Private Cloud, Subnet

  • Either start with AWS or GCP services.

3️⃣ Linux OS
Importance of Working with Linux OS is often overlooked.
“90% of the public cloud workloads are running on Linux based OS”

* Bash Scripting concepts in Linux like control flow, looping, passing input parameters
* File System Commands
* Running daemon processes

4️⃣ Database Management – Relational Databases, OLAP vs OLTP, NoSQL

* Creating tables, Read,Write,Update and Delete operations, joins, procedures, materialized views, aggrgated views, window functions
* Database vs Data warehouse. Star and snowflake schemas, facts and dimension tables.

  • Common Relational Databases preferred – PostgreSQL, MySQL etc

5️⃣ Distributed Data Storage Systems

* Knowledge of how distributed data store works.
* Understanding the Concepts like partitioned data storage, sorting key, SerDes, data replication, caching and persistence.

  • Some of the mostly used ones – HDFS, AWS S3 or any other NoSQL database (MongoDB, DynamoDB,Cassandra)

6️⃣ Distributed Data Processing Systems

* Common techniques and patterns for data processing such as partitioning, predicate pushdown, sort by partition, maintaining size of shuffle blocks, window function
* Leveraging all cores and memory available in the cluster to improve concurrency.

  • Common Distributed processing frameworks – Map Reduce, Apache Spark (Start with Pyspark if you are already comfortable with Python)

7️⃣ ETL/ELT tools and Modern Workflow management Frameworks
Different companies will have different ways to pick ETL frameworks,
One with an In-house data engineering team would prefer to have ETL jobs set up with properly managed workflow management tools for Batch Processing.

* ETL – ETL vs ELT, Data connectivity, Mapping, Metadata, Types of Data Loading
* When to use a Workflow Management System – Directed Acyclic graphs, CRON scheduling, Operators

  • ETL Tools: Informatica, Talend
  • Workfow Management Frameworks: Airflow, Luigi

????Good To Have

8️⃣ JAVA / JVM Based Frameworks
Knowledge of a JVM based language such as Java or Scala will be extremely useful

– Understand both functional and object oriented programming concepts
– Many of the high performance data science frameworks that are built on top of Hadoop usually are written using Scala or Java.

  • JVM Based Frameworks – Apache Spark, Apache Flink, etc

9️⃣ Message Queuing Systems

* Understanding how data injestion happens in Message Queues
* What are Producer and Consumers and how are they implemented
* Sharding, Data Retention, Replay, de-duplication

  • Popular Messaging queues: Kafka,RabbitMQ, Kinesis, SQS etc

???? Stream Data Processing

*Differentiating between Real-time, Stream and Batch Processing.
*Sharding, Repartioning, Poll Wait time, topics/groups,brokers

  • Commonly used frameworks: AWS Kinesis Streams,Apache Spark,Storm,Samza etc

If you find any other skill that will be helpful, comment below on the post.

Going forward, I’ll publish detailed posts on tools and frameworks used by Data Engineers day in and day out.


Follow for updates.

 
To read more posts from Srinidhi, check out her posts here.
Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Luis Vieira. I will be showing how to build a real-time dashboard on…
by Student WeCloudData
October 21, 2020
Uncategorized
Take a central role The Bank of Canada has a vision to be “a leading central bank—dynamic, engaged and…
by Shaohua Zhang
May 21, 2020
Uncategorized
Big Data for Data Scientists – Info Session from WeCloudData…
by WeCloudData
November 9, 2019
Previous
Next

Kick start your career transformation

WeCloudData

WeCloudData is the leading data science and AI academy. Our blended learning courses have helped thousands of learners and many enterprises make successful leaps in their data journeys.

Sign up for newsletter
This field is for validation purposes and should be left unchanged.