BIG DATA FOUNDATIONS PART -1

Divith Raju
11 min readFeb 10, 2024

Big Data Fundamentals

Big data is no longer merely a catchphrase with great meaning. Today’s
business environment has made it crucial to the success of many firms.
The advantages of a comprehensive data and analytics platform have
distinguished profitable firms from their competitors. Therefore, it is
imperative that everyone get familiar with the fundamentals of big data.

What is Big Data?

Big Data is defined as huge volume of continually expanding data that
cannot be stored and processed using a conventional computing approach
in a given amount of time.

Simply put, big data is significantly harder to process using conventional
techniques due to its complexity or sheer volume. To handle such data, for
instance, a SELECT statement may need many days or even weeks.

Why is Big Data Important?

The importance of big data depends on more than simply how much data
you have. How you use it will determine how much it is worth. Which can
then be analysed to find solutions that:

· Reduce Business Management Complexities and Help
Improve Processes.

· Improve Operational Efficiency,
· Optimise Product Development and Services,
· Offer potential for New Income and Growth, and
. Promote Effective and Intelligent Decision Making

You can accomplish your company goals by combining big data and
powerful analytics. For instance, big data provides valuable consumer
information that organisations can use to enhance their marketing,
advertising, and promotions and increase customer engagement and
conversion rates. By analysing past and current data to determine the
shifting preferences of consumers or corporate buyers, businesses may
become more responsive to customer requests and needs.

Is your data Big Data, to what Extent?

There are several misconceptions about the volume of data that
qualifies as big data. Generally, everyone defines Big Data as data that is
either in gigabytes, terabytes, petabytes, or anything bigger than this in
size.
This is where we need to understand:

Depending on the context, even a modest quantity of data
might be referred to as big data.

For instance, if we wanted to resize 100 images with a combined size of 20
MB using a standard desktop in a specific amount of time, we should be
able to do it since the desktop’s computational power would be adequate
for the task. However, if we attempted to process 100 terabytes of picture
files on a single desktop, we would be unable to complete this operation in
the allotted time limit and would instead require more potent servers with
cutting-edge computing resources. If 100 terabytes of picture files are
processed on a desktop computer, they might be referred to as “Big Data.”

Other examples of big data include sensor data gathered as part of
industrial processes, Netflix gathering user data, electronic data obtained
from medical equipment, campaign data used for marketing purposes,
New York Stock Exchange generating trade data etc.

Characteristics Of Big Data

Big data can be described by the following characteristics (5Vs):

· Volume:

The amount of data being generated and captured.

Velocity:

The speed at which big data is being generated.

Variety:

The type and structure of the data being recorded.

Veracity:

Refers to the quality and trust worthiness of data.

Value:

The business value that can be derived from the data collected.

What are the key drivers of such Data Growth?

The key drivers of data growth are following :

· Increase in storage capacities.

· Cheaper storage.

· Increase in data processing capabilities by modern computing devices.

· Data generated and made available by different sectors.

How Big Data is Classified?

Big Data is classified into 3 different categories:

· Structured Data

Data that is properly structured is referred to as structured data.
For instance, structured data might relate to information that is
found in databases, CSV files, and excel spreadsheets.

· Semi-Structured Data

Data that lacks an appropriate structure is referred to as semi-structured
data. Semi-Structured Data, for instance, can relate to the information
found in emails, log files, and word documents.

· Unstructured Data

Data that has no structure attached to it at all is referred to as
unstructured data. Un-Structured Data can, for instance, be used to
describe audio, video, and picture files.

Why can’t the Traditional Methods of Data Processing
and Storage be used with Big Data?

In a conventional manner, the data produced by businesses like banks,
stock exchanges, or hospitals is often sent as an input to an ETL (Extract,
Transform, and Load) System.

This data would be extracted, transformed (i.e., put into the right format),
and then loaded into the database using an ETL system.

After this procedure is finished, the end users will be able to carry out a
variety of tasks, including producing reports and doing analytics using this data.

However, as the amount of data increases, it becomes more difficult to
handle and process it using the conventional methods. This is one of the
justifications for not storing and processing Big Data in the conventional
manner.

  1. The system’s first disadvantage is that it is expensive and would
    require significant investment to establish or upgrade, making it
    unaffordable for small and midsize businesses.

2. Scalability is the second problem. It would be difficult to expand
this system as the amount of data increases.

3. The final flaw is that it takes a lot of time. Because it is created and
produced using antiquated computer technologies, processing, and
extracting useful information from this data takes a long time.

4.I hope this makes it apparent why big data cannot be stored and
processed using the conventional method or antiquated computer technologies.

Challenges Associated With Big Data

There are 2 main challenges associated with Big Data.

1. How to effectively store and handle such a large amount of data.

2. How to quickly analyse and extract important information from
such a vast amount of data.

These are the two primary problems with storing and processing large
amounts of data that gave rise to the development of the Hadoop
framework.

How is Big Data Stored and Processed?

Storage:

Nowadays, Data Lake is commonly used to store big data. Data lakes can
handle a wide range of data types. They are typically built on Cloud
Storage Services (e.g., Azure Storage), NoSQL databases, Hadoop Clusters,
or some other big data platform. Data Lakes can store raw data and then
this data can be filtered or structured as per the analytical requirements
and can be coupled with other platforms. Whereas Data Warehouses are
often used for relational databases and contain only structured data.

Processing :

Big data processing exerts a high strain on the computational
infrastructure. To solve this, technologies like Hadoop and Spark Engine
offers clustered systems dividing processing workloads over hundreds or
thousands of nodes (computers) offering the required computational
capacity. To support such highly scalable and highly efficient compute
services organizations uses the Cloud hosting services to obtain
processing capabilities at reasonable cost. This type of service can be
scaled up to perform necessary big data analytics and the company needs
to only pay for the storage and compute time it consumes.

How should Business use their Big Data?

For an organization to use their big data, they should clearly define the
following steps for building a sustainable data fabric:

  1. Set the Right Data Strategy aligned with Business Goals and
    future initiatives to effectively devise their data journey.

2. Identify their Data Sources like Streaming/Real Time data,
Application Data, 3rd Party Data Integrations, and data from other open sources.

3. Plan on how to Access, Manage and Store the data, setup the right data governance model, identify the right storage, define data integrations, focus on data quality, etc.

4. Define Business Goal to Identify and Analyse the right data for
Insights and AI endeavours.

5. Use the Insights to make Intelligent and Data-Driven decision
to stay competitive and efficient.

Advantages and Disadvantages of Big Data

The growth in data availability creates both benefits and challenges.

Large data collections provide businesses the chance to do richer and
deeper analyses that are beneficial for growth. But it is equally important
that companies take steps to protect this data to avoid misuse and data
breaches. Further, such huge data may also lead to overload and noise,
which makes it less helpful for businesses. Therefore, it is very crucial for
a business to identify the relevant data to make decisions.

Big Data Technologies

Here are few Big Data Technologies currently being used:

· Storage: Hadoop Distributed File System(HDFS) and Cloud
Storage Services like Amazon S3, Azure Blob Storage, Google Cloud
Storage, etc.

· Cluster Management: Kubernetes, Mesos, YARN, etc.

· Big Data Platforms: Amazon EMR (Elastic MapReduce), Google Data
Proc, Azure HD Insights, Cloudera Data Platform, MapR Data
Platform, etc.

· Data Lake and Datawarehouse: Amazon Redshift, Delta Lake,
Google Big Query, Snowflake, Oracle DWH, SQL Server DWH, Azure
Data Warehouse, etc.

· SQL: SQL Server, Hive, Impala, Drill, Presto, etc.

· NoSQL: Couchbase, Couch DB, Cassandra, MongoDB, HBase,
Neo4j, Redis, etc.

· Streaming: Kafka, Storm, Flink, Spark Streaming, etc.

· ETL: Informatica, Azure Data Factory, Airflow, Hadoop
Ecosystem (Pig, Hive, Spark, Scoop), etc.

· Reporting: Power BI, Tableau, Qlikview, Looker, etc.

· Analytics: Custom Python/Spark Notebooks, Azure ML, Google ML,
AWS ML, KNIME, etc.

Few Big Data Case Studies

1. Walmart: Walmart uses Big Data and data mining to generate
customised product suggestions. Walmart identifiesuseful trends
indicating the most commonly purchased goods, the most well-liked
goods, and even the most well-liked product bundles.

2. American Express: The world’s largest credit card company uses vast
amounts of consumer data to find signs that could indicate user loyalty. It also makes advantage of Big Data to create sophisticated prediction
models for analysing past transactions and forecasting possible customer churn.

3. Uber: Uber makes track consumer behaviour to pinpoint the most used and well-liked services among its users.

4. Netflix: One of the biggest advocates of the recommendation engine is
Netflix. It gathers client information to comprehend the unique
requirements, tastes, and interests of users. It then makes customised
content suggestion lists for each user based on this data to estimate what they would like.

Big Data Architecture

What is Big Data Architecture?

This architecture is designed in such a way that it handles the ingestion
process, processing of data, and analysis of the data is done which is
way too large or complex to handle the traditional database
management systems.
Different organizations have different thresholds for their
organizations, some have it for a few hundred gigabytes while for
others even some terabytes are not good enough a threshold value.
Due to this event happening if you look at the commodity systems and
the commodity storage the values and the cost of storage have reduced
significantly. There is a huge variety of data that demands different
ways to be catered.
Some of them are batch-related data that comes at a particular time
and therefore the jobs are required to be scheduled similarly while
some others belong to the streaming class where a real-time streaming
pipeline has to be built to cater to all the requirements. All these
challenges are solved by big data architecture.

Explanation of Big Data Architecture

Big Data systems involve more than one workload type and
they are broadly classified as follows:
1. Where the big data-based sources are at rest batch processing
is involved.
2. Big data processing is in motion for real-time processing.
3. Exploration of interactive big data tools and technologies.
4. Machine learning and predictive analysis.

1. Data Sources

The examples include:
(i) Datastores of applications such as the ones like relational databases.

(ii) The files which are produced by several applications and are majorly a
part of static file systems such as web-based server files generating logs.

(iii) IoT devices and other real-time-based data sources.

2. Data Storage

This includes the data which is managed for the batch-built operations
and is stored in the file stores which are distributed in nature and are also
capable of holding large volumes of different format-backed big files.

3. Batch Processing

All the data is segregated into different categories or chunks which makes
use of long-running jobs used to filter and aggregate and also prepare
data o a processed state for analysis. These jobs usually make use of
sources, process them and provide the output of the processed files to
the new files. The batch processing is done in various ways by making
use of Hive jobs or U-SQL-based jobs or by making use of Sqoop or Pig
along with the custom map reducer jobs which are generally written in
any one of the Java or Scala or any other language such as Python.

4. Real Time-Based Message Ingestion

This includes, in contrast with the batch processing, all those real-time
streaming systems which cater to the data being generated sequentially
and in a fixed pattern. This is often a simple data mart or store
responsible for all the incoming messages which are dropped inside the
folder necessarily used for data processing. There are, however, the
majority of solutions that require the need of a message-based ingestion
store that acts as a message buffer and also supports the scale-based
processing, providing a comparatively reliable delivery along with other
messaging queuing semantics. The options include those like Apache
Kafka, Apache Flume, Event hubs from Azure, etc.

5. Stream Processing

There is a slight difference between real-time message ingestion and
stream processing. The former takes into consideration the ingested data
which is collected at first and then is used as a publish-subscribe kind of
tool. Stream processing, on the other hand, is used to handle all that
streaming data which is occurring in windows or streams and then writes
the data to the output sink. This includes Apache Spark, Apache Flink,
Storm, etc.

6. Analytics-Based Datastore

This is the data store that is used for analytical purposes and
therefore the already processed data is then queried and analyzed by
using analytics tools that can correspond to the BI solutions. The data can also bepresented with the help of a NoSQL data warehouse technology like HBase or any interactive use of a hive database which can provide the
metadata abstraction in the data store. Tools include Hive, Spark SQL,
Hbase, etc.

7. Reporting and Analysis

The insights have to be generated on the processed data and that is
effectively done by the reporting and analysis tools which make use of
their embedded technology and solution to generate useful graphs,
analysis, and insights helpful to the businesses. Tools include Cognos,
Hyperion, etc.

8. Orchestration

Big data-based solutions consist of data-related operations that are
repetitive and are also encapsulated in the workflows which can transform
the source data and also move data across sources as well as sinks and load in stores and push into analytical units. Examples include Sqoop, oozie,data factory, etc.

Big Data Technologies and How they are used in Data Analysis

Introduction

Big data technologies refer to a set of tools and techniques that
are used to process, analyze, and extract insights from large
and complex data sets. These technologies have become
increasingly important in recent years as the amount of data
being generated and collected has grown exponentially.

Big Data Technologies

Hadoop

One of the main big data technologies is Hadoop, an open-source software framework that allows for the distributed processing of large data sets across clusters of computers. Hadoop is often used in combination with other big data technologies, such as Apache Spark and Apache Storm, to perform real-time data processing and analysis.

NoSQL

Another important big data technology is NoSQL databases, which are designed to handle the storage and retrieval of unstructured or semi-
structured data. NoSQL databases, such as MongoDB and Cassandra, are commonly used in big data applications because they are highly scalable
and can handle large amounts of data.

Data Visualization with Big Data

Big data technologies also include various data visualization tools, such as
Tableau and Power BI, that allow for the creation of interactive and
informative visualizations of data. These tools are used to explore and
analyze data in order to gain insights and make informed decisions.

Big Data with Machine Learning

In the field of machine learning and artificial intelligence, big
data technologies such as Apache Mahout, TensorFlow, and
PyTorch are used to train models on large datasets and
make predictions based on the data.

Big data technologies are used in a variety of industries, including finance,
healthcare, retail, and manufacturing. They are used to analyze customer
data, track inventory and supply chain, predict market trends, and much
more.

In conclusion, big data technologies are a vital part of the data analysis
process, allowing for the efficient processing and analysis of large and
complex data sets. These technologies are used to extract insights and
make informed decisions in various industries, making them a crucial part
of today’s business and research landscape.

“Thank you for reading! If you enjoyed this article and want to stay updated on my latest insights and projects, feel free to connect with me on LinkedIn.”

--

--

Divith Raju

LinkedIn Top Voice|| Big Data Engineer ||EX- Data Engineer@ Freelance ||EX -Data scientist @ Oasis Infobyte ||Entrepreneur ||