What is Hadoop?
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
What is Spark?
Spark Core, the heart of the project that provides distributed task transmission, scheduling and I/O functionality provides programmers with a potentially faster and more flexible alternative to MapReduce, the software framework to which early versions of Hadoop were tied. Spark’s developers say it can run jobs 100 times faster than MapReduce when processed in-memory, and 10 times faster on disk.
How Apache Spark works?
Apache Spark can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores, such as Apache Hive. Spark supports in-memory processing to boost the performance of big data analytics applications, but it can also perform conventional disk-based processing when data sets are too large to fit into the available system memory.
The Spark Core engine uses the resilient distributed data set, or RDD, as its basic data type. The RDD is designed in such a way so as to hide much of the computational complexity from users. It aggregates data and partitions it across a server cluster, where it can then be computed and either moved to a different data store or run through an analytic model. The user doesn’t have to define where specific files are sent or what computational resources are used to store or retrieve files.
What is Scala?
Setup & Installation Steps:
Step 1 : Pre-Requisites
Download and install the Following:
- Download & Install JDK latest version : https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
- Download Hadoop 2.7.2 files: https://archive.apache.org/dist/hadoop/core/hadoop-2.7.2/hadoop-2.7.2.tar.gz
- Download & Install Scala 2.11.8.msi: https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.msi
- Download Spark 2.3.0 Folders: https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
Step 2 : Folder Configurations
Copy all the installation folders to c:\work from the installed paths C:\Program Files
Create Empty Folders:
Step 3 : Setting Environment Variables
- Set Environment variables (Replace or Remove Old/Earlier Configurations)User Variables:
JAVA_HOME C:\work\java\jdk1.8.0_191 (based on the version of JDK downloaded)
- System Variables for Path:
- Download Windows 7/8/10 pre configured files from : http://www.praveenkumarg.com/wp-content/uploads/2018/12/hadoop2.7.2.zip
– Delete Folders C:\work\hadoop-2.7.2\bin and C:\work\hadoop-2.7.2\etc
– Replace bin and etc folders from downloaded .zip file.
edit the file and SET JAVA_HOME path in C:\work\hadoop-2.7.2\etc\hadoop\hadoop-env.sh
set JAVA_HOME=C:\work\Java\jdk1.8.0_191 (based on the version of JDK downloaded)
Step 4 : Validate Configurations for hadoop nodes
set the following Configuration at C:\work\hadoop-2.7.2\etc\hadoop\hdfs-site.xml
Step 5 : Commands to Run Hadoop & Spark
Execute following commands (cmd.exe) Step by step
- > hdfs.cmd namenode -format
- > start-dfs.cmd && start-yarn.cmd (Note: if any popup appears, press allow to access)
- > jps (validate if hadoop is running)
- > cd c:\work\hadoop-2.7.2\bin
- > winutils.exe chmod 777 c:\tmp\hive
- > spark-shell.cmd (to run spark command)