Big Data is a term that seems to get thrown around a lot… but what exactly is it?
We’ve broken down this deceptively simple two-word term into the following sections to explain Big Data for beginners:
If you’re a Big Data beginner or would like a refresher – read on!
What is Big Data
The ‘Big' in Big Data refers to a massive volume of data. Although there's no concrete definition of Big Data, according to most interpretations of the phrase, data becomes ‘Big' when it can't be stored on one computer or node. With the widespread adoption of smart devices or IoT, there is an ever-growing amount of data that can be collected from just about anywhere. In 2015, Forbes wrote that by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet.
The ‘data’ in Big Data can refer to structured or unstructured data. Without getting into too many details, structured data refers to data that has a defined length or format (e.g., dates, numbers). Structured data includes click-stream data (i.e., every time you click a link), sensor data (e.g., GPS, medical devices), and the like.
Unstructured data refers to data that doesn’t easily live in a database – like audio files, email messages, and all the photos on Instagram captioned with the hashtag #pizza. Unstructured data is difficult to mine, and as a result, it's not utilized as much as structured data is.
Although they sound similar, Big Data, Data Analytics, and Data Science are not the same thing.
Data Science refers to the cleansing, preparation, and analysis of data or the tool to ‘tackle' big data. It's a function that involves the combined skill of math, stats, and programming. The top languages used to do the aforementioned are Python, Java, R, Jula, SAS, and SQL.
Each language comes with its own set of strengths and weaknesses. For example, Python is easy to learn and can help you do a variety of tasks, but R is more statistics-driven and can be more conducive to data visualization. Here’s a handy guide to get a better idea of the backend languages to see which suits your needs best.
Data Analytics is the process of analyzing data to gather insights from data which will then go on to inform business decisions.
How does it work?
Big Data works on the basis that the more data points you have, the better you are able to make predictions and glean insights.
Using Data Science, you can answer five types of questions. A Microsoft blog post breaks it down:
- Is it A or B? Or questions with two possible answers
- Which brings in more customers: a $5 coupon or a 25 percent discount
- Is this weird? Or anomaly detection
- Are these pressure gauges reading normal?
- How much? Or regression algorithms
- What will be the temperature next Tuesday?
- How is this organized? Or clustering algorithms
- Which viewers like the same type of movies
- What should I do now? Or reinforcement learning algorithms
- If I’m a self-driving car: At a yellow light, brake or accelerate?
With a limit on the types of question, one can ask, a good deal of creativity is needed when attempting to answer complex questions.
Founder and CEO of PureStrategy.ai, Briana Brownell says that within Data Science “creativity is extremely important. You'll come across so many challenging problems in the field that need creative solutions.”
Which leads us to the third section:
Working in Big Data
We asked Big Data experts about the common misconceptions about the field. Kavita Ganesan, a practicing Senior Data Scientist at Github, says that “its an extremely broad field. Someone who specializes in NLP and text analytics may not necessarily be experts in analyzing images.”
As a result of the specializations within Big Data, Brownell says that “there is a huge amount of collaboration both within the data science team as well as with individuals throughout an organization.” She adds that “the biggest misconception about being a data scientist is that it's a mostly solo occupation. Not so!”
And while it is an exciting field to be in, Eric Brown, Data & Technology Consultant & Strategist, remarks that “data science is a lot of work. It’s a lot of data cleaning and a lot of dead end roads. You need to be OK with NOT finding answers when working with data.”
The future of Big Data
Big Data is and will continue to become more accessible than ever. Tools and technology related to Big Data, such as cloud storage and AI assistants, are more widespread and affordable. Ganesan says:
“Currently, only big name companies like Google, Microsoft, and IBM have been able to use data science at its maximum potential because of all the IT infrastructure that they have and the high profile talent that they are able to hire. With all the new training programs and easy to use tools, more and more people are going to be able to make that same impact.“
Additionally, the slower a business is to utilize Big Data the more likely they are to get left behind. According to an Accenture study, 79 percent of enterprise executives “agree that companies that do not embrace Big Data will lose their competitive position and may even face extinction.’”
With the proliferation of sources data can be collected from, comes concerns over privacy, data security, and discrimination. Big Data is powerful in the predictions it can make about us, and “with great power, there must also come – great responsibility.” An example of these concerns coming to fruition was the Cambridge Analytica scandal during the 2016 American election and campaign period.
Thankfully, on the 25th of May 2018, the General Data Protection Regulation (GDPR) came into effect in the EU. Wired describes the GDPR as a framework that “sets a new standard for data collection, storage, and usage among all companies that operate in Europe. It will change how companies handle consumer privacy and will give people new rights to access and control their own data on the internet.”
While the legal framework only applies to EU citizens’ data, it will hopefully set a precedent for other countries or blocs to follow suit.
How to get started
At present, there is a huge demand for Big Data skills and not enough supply. According to the 2015 MIT Sloan Management Review, 40 percent of the companies surveyed were struggling to find and retain their data analytics talent.
To crack into the Big Data world, one needs to have a multidisciplinary skill set of math, statistics, and programming. Because the jobs within the discipline are so diverse, it is difficult to say what specific skills you would need. A good bet is to check job boards for Big Data jobs that interest you and note down the skills they require or prefer from candidates.
While the hard skills (e.g. coding and stats) are a fundamental part of the industry, communication is a vital skill. Brown says “The programming is the easy part. Learn math and learn to communicate. A data scientist is almost worthless if they can't communicate their findings.”
A willingness to continue to learn and upskill is a must. Brownell says that “the science and technology is changing rapidly, along with customer expectations of how it should be implemented in business. So I think that a characteristic of good data science is someone who is able to continue learning throughout their career. Most of what you will learn in a course will be different when you get into the field.”
Ganesan recommends “find[ing] a niche that is interesting to you and really specialize in it.” Those who do that “will really start making a difference.”
Want to learn more?
Interested in data and programming?
Learn more about processing data sets using code.