Big Data refers to massive amounts (or high-volume) of varied types of data that cannot be processed using traditional software tools or relational databases, like spreadsheets and SQL tables. “High volume” usually means terabytes and petabytes (a terabyte is one thousand gigabytes). Today we are at a point where we have sensors that collect all sorts of information about the environment around us: from temperature, location, atmospheric pressure to weather conditions and even traffic jams. We also use our digital devices throughout the day, both at home, school, or work and on the go, collecting location data wherever we are.
All this information together is becoming increasingly available for processing using new technologies. These technologies are being developed to analyze Big Data with SQL-like languages such as Hive, Pig, Cascading, or Apache Spark. To store large amounts of data, filtering, sorting, and visualizing them becomes necessary. NoSQL databases have gained popularity because they can manage these vast data sets without worrying about consistency issues that traditional relational database management systems (RDBMS) would need to consider when faced with similar problems.
What is Big Data?
Big data describes huge data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially human behavior and interactions. It refers to the three V’s: velocity, volume, and variety. Velocity means that data is available in real-time, which makes it relevant for immediate action or response. Volume refers to the sheer amount of it all. Big Data can be classified into two further categories: structured and unstructured data. Structured data consists of things like tables, columns, and rows. Unstructured data consists of documents from word processing programs without a formal structure defined in their content.
The first approach to dealing with Big Data was focusing on processing rather than storage, but now the two are intertwined. Even though it is technically possible to store all kinds of information nowadays, there are still actual costs associated with this unrestricted storage strategy. One alternative would be to prolong collecting data and storing it (think about how video can be streamed online). Another way around this problem is thinking about Big Data in terms of computation instead of storage.
Sensors for gathering environmental information like heat maps (example below) help businesses increase efficiency by understanding how people interact with their environment (see example context ).
Big Data also refers to the three V’s: Velocity, Variety, and Volume. Velocity means that data is available in real-time, which makes it relevant for immediate action or response. Variety refers to the different types of data collected (including text, image, video, etc.), while volume refers to the sheer amount of it all. Big Data can be classified into two further categories: structured and unstructured data. Structured data is what we typically think about when dealing with databases: tables, columns, and rows.
Unstructured data
Unstructured data consists of documents from word processing programs without a formal structure defined in their content. In addition to these classifications, scientists have developed a new one called semi-structured data, which allows for both kinds of classification. The third definition only deals with the volume aspect and might be classified as microdata or small volume data.
How to deal with big data
The first approach to dealing with Big Data was focusing on processing rather than storage, but now the two are intertwined. Even though it is technically possible to store all kinds of information nowadays, there are still actual costs associated with this unrestricted storage strategy. One alternative would be to prolong the time between collecting and storing data (think about how video can be streamed online). Another way around this problem is thinking about Big Data in terms of computation instead of storage.
Boosting computing power even further, GPUs (graphics cards) play an important role in handling Big Data by speeding up calculations. GPUs have been around since the mid-’90s but have seen a big boost in usage over the last few years for this kind of application.
Remember that scale-out architectures are preferred more and more instead of scaling up (adding more CPU, RAM, etc., to a single machine) because they can provide more scalable performance. There are some challenges associated with Big Data that go beyond just storage and processing issues. Privacy concerns, legal questions about ownership, and accountability play a role as well. With data volumes continuously growing exponentially, it is vital to address these questions sooner rather than later while new technologies permit us to make sense of Big Data quickly. Refer to RemoteDBA.com for more.
Side note: A heat map of one example context would be an image with different color shades to depict how hot or cold specific locations are compared to others. For example, a dark shade of red might mean very warm temperatures while blue would represent very cold ones and white would show average temperatures.
Boosting computing power even further, GPUs (graphics cards) play an important role in handling Big Data by speeding up calculations. GPUs have been around since the mid-’90s but have seen a big boost in usage over the last few years for this kind of application.
Big Data Challenges
One thing to remember is that scale-out architectures are preferred more and more instead of scaling up (adding more CPU, RAM, etc., to a single machine) because they can provide more scalable performance. That being said, there are some challenges associated with Big Data that go beyond just storage and processing issues. Privacy concerns, legal questions about ownership, and accountability play a role as well. With data volumes continuously growing exponentially, it is vital to address these questions sooner rather than later while new technologies permit us quickly o make sense of Big Daly.
There are plenty of resources available online about Big Data and course content is constantly being developed. Courses at the bachelor’s, master’s, and Ph.D. levels and MOOCs (massive open online courses) on platforms such as Udacity or edX can be found easily. Big data describes vast data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially human behavior and interactions.