By Archita Goyal, batch of 2017
A beaming count of 350 million photo uploads and interactions of 936 million active users with more than 900 million objects (pages, groups, etc.) are handled by Facebook every day.
More than 5 billion people are calling, texting and browsing the Web on mobile phones worldwide.
Twitter produces over 1 million tweets a minute.
More than a million customer transactions occur at Wal-Mart every single hour generating about 2.5 petabytes of data into its databases.
Organizations, today, are being inundated with data from all conceivable directions-its expanse reproducing exponentially. The voluminous data so obtained, is Big Data. Wikipedia defines Big Data as “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.” Big Data can hence be visualized as data-so large that it is difficult to capture, store, manage, share or analyze within current computational architecture. Big data is a relative term, something termed as big data today might not be big data tomorrow.
Of course, this data which is being collected involuntarily every day, or rather every single minute wouldn’t be of much use unless concrete conclusions are drawn from it. This is exactly where “analytics” comes into play. A careful understanding of big data can enable companies to make more informed decisions. As an example,
Starbucks was introducing a new coffee product. Anticipating the response of the customers, Starbucks started monitoring blogs, Twitter and other media platforms along with discussion forums to assess customers’ reactions. By mid-morning, Starbucks discovered that although the taste was being appreciated, customers found that the product was a little expensive. Starbucks catered to the need of the hour, leading to an overall positive response by the end of the day.
Now, a more traditional approach could be to wait for the sales reports to come, but that would have obviously taken a few weeks.
This is how big data stands as an intelligent and a smarter way to take decisions, and deduce solutions to almost all problems existing today.
Who Is Using Big Data And How?
An interesting instance which vouches for the prominence of Big Data and analytics, beyond the realms of traditional database management is this great application called TwitterHealth – a software that analyzes Twitter feeds, in search of social updates that could indicate if someone is suffering from a flu. Twitter users often indulge in tweeting, if they feel sick or if they intend to stay at home and the application takes advantage of the same. Further operations led to the app creating surprisingly good real-time map of flu epidemics, which was nearly as accurate as prepared by medical practitioners-however, easier and cheaper.
Today, Quora and Facebook use Big Data tools to understand more about you and provide you with a feed that you in theory should find it interesting. The fact that the feed is not interesting should show how hard the problem is.
Amazon uses Big Data tools to fuel its ‘Frequently bought together’ and ‘customers who bought this item also bought’ features.
LinkedIn uses Big Data tools to suggest its users ‘people you may know’ or ‘companies you may want to follow’
These days, start-ups are primarily tapping into big data via federated sources including social media to expand their marketing footprint and create awareness. Large enterprises comprising a smaller fraction of the landscape are the most aggressive in their big data roll out strategies, looking to understand their customer segment better so as to increase sales. From analyzing customer influences on preferred telecommunications carriers to personalized customer offers, Big Data analytics is increasingly playing a vital role. The public sector not only consumes but also generates a lot of data. Government sectors like security and finance are already relying on big data across most of the functional operations. Healthcare and industry policy reforms will be the next to embrace the wave. Increased data adoption in government will also considerably ease cross-functioning of the sectors and lead to better citizen services, Big Data hence being more than just a technical paradigm.
The Building Blocks Of Big Data
Big Data is said to hold three dimensions viz. Variety, Volume, and Velocity-referred to as the 3V’s sometimes. Variability and Complexity are often considered as the fourth and fifth characteristics.
- Variety – Big data means much more than rows and columns. It means unstructured text, video and audio that can have important impacts on company decisions – if it’s analyzed properly in time. This ‘unstructured data’ accounts for up to 85 percent of an organization’s data, hence requiring different architecture and technologies for analysis.
- Volume – Big data as the name suggests is relatable to size, and hence the characteristic. The quantity or size of data determines the value and potential of the data under consideration and whether it can actually be termed as Big data or not.
- Velocity – It contextualizes the frequency of generation of data and how fast the data is processed to meet the necessary demands.
- Variability – In addition to a high speed, the data flows can be highly variable-with daily, seasonal and event-triggered peak loads that can be challenging to manage.
- Complexity – Data management can become a very complex process, especially when large volumes of data come from multiple sources. The data, hence, need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed by it.
Tools And Technologies
The three technologies vital for handling big data and its subsequent analysis are information management, high-performance analytics and flexible deployment options. Analytics on big data is made possible by a variety of tools available in the market.
- MapReduce – A software framework for writing applications which process vast amounts of data in parallel on large clusters of commodity hardware. A MapReduce job usually splits input dataset into independent chunks, processed by the map tasks, in a parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. MapReduce is written in various languages such as Java, Python, R, Perl etc. Technologies that abstract the MapReduce model are Apache Hadoop(Java-based), Hive, Pig, S4, Acun, Oozie, Greenplum etc.
Apache Hadoop, the most commonly used big data platform, was initially inspired by papers published by Google, outlining its approach to handling an avalanche of data, and has since become the de facto standard for storing, processing and analyzing hundreds of terabytes, and even petabytes of data. It is a cent percent open source, and pioneered a fundamentally new way of storing and processing data. It enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. Hadoop can handle all types of data from disparate systems: structured, unstructured, log files, pictures, audio files, communications records, email – just about anything one can think of!
- NoSQL(Not Only SQL) – A framework which provides a mechanism for storage and retrieval of data. Cassandra, Redis, BigTable, HBase, Riak, Zookeeper etc. provide a NoSQL environment.
- Cloudera Impala – A fully integrated, state-of-the-art analytic database, architected specifically to leverage the flexibility and scalability strengths of Hadoop as well as NoSQL.
- Corona – A new scheduling framework that separates cluster resource management from job coordination. It introduces a cluster manager which tracks the nodes in the cluster and the amount of free resources.
A Step Forward
Here’s a small game:
- Do you like “peeling apart” problems and study relationships between data? Does the ‘itch of curiosity’ keep on haunting you, once you have had a glance at surprising facts and figures?
- Do you like devising new interesting solutions to already existing problems? A creativity-excess in the personality!
- Do you like drawing conclusions from already available information pieces?
If the answers to these questions were mostly yes, data analysis is just the perfect arena for you to try your hands on!
A computer background is somewhat necessary for a career in Big Data. Big data skills include natural language processing and text mining, along with familiarity with Clojure, Scala, Python, Hadoop and Java. Data mining skills with tools like R and MATLAB are an added advantage!
A career in big data demands a great hold on mathematics especially linear algebra, calculus and probability. Being a para-coding stream, big data analytics requires one to have a fairly good coding background in at least one language, Python being the most preferred.
A few really interesting and free online courses which would help you get your foot in the door are as follows:
- Introduction to Data Science-
- Data Science speicialization- https://www.coursera.org/specialization/jhudatascience/1
- Introduction to Apache Spark- https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x#.VKDqDF4BA
- Introduction to Hadoop and MapReduce- https://www.udacity.com/course/ud617
- Mining Massive Datasets- https://www.coursera.org/course/mmds
- Web Intelligence and Big Data- https://www.coursera.org/course/bigdata
Some brainstorming at Hortonworks Sandbox (http://hortonworks.com/products/hortonworks-sandbox/) would surely prove to be helpful.
The following websites are worth bookmarking if you are really interested in big data:
If you are already into the field of Big Data, you can try your hands on various toy problems present on the internet or indulge in a research project. Public data sets, based on consumerism, healthcare etc., can be accessed from the web itself and a scrutinizing analysis done on them can prove to be a good research project, on an intermediary level.
The present scenario in the Big Data world is that although most of the data collection is done by the corporate world, private startups generally indulge in analyzing and hence drawing conclusions by studying the data so available. Data science is a beaming industry, which would attract nearly 5 million scientists worldwide in the forthcoming years, as suggested by a study by McKinsey & Co., surely a new but a promising showground!