One of the prime instigators of the digital revolution is the entity that we call “big data.” A term originally coined to describe datasets that were too large for typical software tools of the time to capture, manage, and process within a reasonable time frame, big data has emerged in recent years as the leading driver of the digital transformation. Now that analytics tools have become sophisticated enough to tackle the scope of these ever-growing datasets, virtually all of today’s digital decision-making involves some aspect of big data analysis. From traffic patterns and medical records to music downloads, the process of recording, storing, and analyzing today’s wealth of data is what enables the effective operation of the technology and services that power our digital world.
As our mastery of big data has evolved, so too have the characteristics we use to precisely define and understand it. Data scientists today usually use what have become known as “the four V’s” to explain exactly what big data is, how it works, and how we can make the most of it.
Given that it contains the word “big” right in its name, it’s hardly surprising that volume is the original defining attribute of big data. The sheer scale of big data has, and indeed remains, its most staggeringly inescapable quality due to the exponential rate at which it continues to grow. According to Fortune magazine, the total amount of data generated by humans prior to 2003 was 5 billion gigabytes (also known as 5 exabytes). In 2011, it took a mere two days to produce the same amount of data, and by 2013, we were producing even more data than that every 10 minutes. In the last two years alone, an astonishing 90% of all the world’s data was created. Today, according to statistics from IBM, we generate 2.5 quintillion bytes of data every single day, which is enough to fill about 10 million Blu-ray discs. (Fun fact: if this many Blu-ray discs were stacked on top of one another, the height of that stack would equal the height of four Eiffel Towers.)
The rate at which new data is created, as well as the speed with which it is consumed, is another key characteristic of big data. The global flow of data today is massive and continuous. Every single minute, more than 200 million emails are sent, 216,000 photos are posted on Instagram, and 72 hours of footage are uploaded to YouTube. IBM estimates that by 2018, global Internet traffic will reach a rate of 50,000 gigabytes per second. In terms of data-driven decisions, one of the most interesting implications and consequences of the velocity of big data has been the development of dynamic, real-time processing capabilities. A significant amount of data—such as streaming video from traffic cameras, for example—is only useful if it can be analyzed as soon as it has been generated. With speed of the essence, data scientists have focused on creating the analytical algorithms and data transmission infrastructure that makes it possible to feed this data into business processes as it happens in real time.
In the early days of data, spreadsheets and databases were the most common—and indeed, nearly the only—sources of data. Today, however, data comes in a staggering variety of forms and from an incredible variety of sources. Modern data is comprised of everything from tweets and emails to customer purchase histories and customer service calls. Due to the prevalence of social media, the biggest data growth (about 80% of data) currently consists of video, photos and images, and documents. However, as the Internet of Things becomes more widespread, we can expect to see rapid growth in the amount of data generated from things like wearable fitness devices, smart homes, and self-driving cars.
Perhaps the most subjective attribute of big data, and one that has some particularly tricky implications for data-driven decision-making, is veracity. That is, with so much data coming so quickly from such a wide array of places, how can we be sure it’s all accurate? How can we know that we are interpreting it correctly when most datasets contain significant amounts of abnormalities, biases, and other “noise”? Studies show this is no idle concern. IBM reports that one in every three business leaders does not trust the data and information they rely on to make business decisions. Furthermore, poor data quality is estimated to cost the US economy more than $3 trillion every year. Despite the fact that big data underpins so much of our world, the uncertainty associated with its reliability is an attribute that is vital and should not be overlooked.
The fifth V?
While these four characteristics are the main factors that define big data, it’s the so-called “fifth V” where the true potential of big data lies. This last V stands for value or the ability to leverage the insights generated by superior analytics capabilities to achieve greater value across all sectors and activities. Indeed, the “fifth V” represents what is perhaps the most important aspect to understand about big data: that it is only as useful as the actionable insights we can draw from it.