What is Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

What is Spark?

Spark Core, the heart of the project that provides distributed task transmission, scheduling and I/O functionality provides programmers with a potentially faster and more flexible alternative to MapReduce, the software framework to which early versions of Hadoop were tied. Spark’s developers say it can run jobs 100 times faster than MapReduce when processed in-memory, and 10 times faster on disk.

How Apache Spark works?

Apache Spark can process data from a variety of data repositories, including the Hadoop Distributed File System (HDFS), NoSQL databases and relational data stores, such as Apache Hive. Spark supports in-memory processing to boost the performance of big data analytics applications, but it can also perform conventional disk-based processing when data sets are too large to fit into the available system memory.

The Spark Core engine uses the resilient distributed data set, or RDD, as its basic data type. The RDD is designed in such a way so as to hide much of the computational complexity from users. It aggregates data and partitions it across a server cluster, where it can then be computed and either moved to a different data store or run through an analytic model. The user doesn’t have to define where specific files are sent or what computational resources are used to store or retrieve files.

What is Scala?

Scala combines object-oriented and functional programming in one concise, high-level language. Scala’s static types help avoid bugs in complex applications, and its JVM and JavaScript runtimes let you build high-performance systems with easy access to huge ecosystems of libraries.

Setup & Installation Steps:

Step 1 : Pre-Requisites

Download and install the Following:

Step 2 : Folder Configurations

Copy all the installation folders to c:\work from the installed paths C:\Program Files

  • C:\work\hadoop-2.7.2
  • C:\work\scala
  • C:\work\java\jdk1.8.0_191
  • C:\work\spark-2.3.0-bin-hadoop2.7

Create Empty Folders:

  • C:\tmp\hive
  • C:\work\hadoop272data\datanode
  • C:\work\hadoop272data\namenode

Step 3 : Setting Environment Variables

  • Set Environment variables (Replace or Remove Old/Earlier Configurations)User Variables:
    JAVA_HOME C:\work\java\jdk1.8.0_191 (based on the version of JDK downloaded)
    HADOOP_HOME C:\work\hadoop-2.7.2
    SPARK_HOME C:\work\spark-2.3.0-bin-hadoop2.7
    SCALA_HOME C:\work\scala
  • System Variables for Path:

    • C:\work\scala\bin
  • Download Windows 7/8/10 pre configured files from : http://www.praveenkumarg.com/wp-content/uploads/2018/12/hadoop2.7.2.zip
    – Delete Folders C:\work\hadoop-2.7.2\bin and C:\work\hadoop-2.7.2\etc
    – Replace bin and etc folders from downloaded .zip file.

edit the file and SET JAVA_HOME path in C:\work\hadoop-2.7.2\etc\hadoop\hadoop-env.sh
set JAVA_HOME=C:\work\Java\jdk1.8.0_191 (based on the version of JDK downloaded)

Step 4 : Validate Configurations for hadoop nodes

set the following Configuration at C:\work\hadoop-2.7.2\etc\hadoop\hdfs-site.xml


Step 5 : Commands to Run Hadoop & Spark

Execute following commands (cmd.exe) Step by step

  1. > hdfs.cmd namenode -format
  2. > start-dfs.cmd && start-yarn.cmd (Note: if any popup appears, press allow to access)
  3. > jps (validate if hadoop is running)
  4. > cd c:\work\hadoop-2.7.2\bin
  5. > winutils.exe chmod 777 c:\tmp\hive
  6. > spark-shell.cmd (to run spark command)


Happy Programming!!