Distributed Processing Frameworks like Hadoop has ignited the new path of processing the Big Data. Below are the Tools and Technologies covered in this highly practical subjects

Hadoop

  • Learned Distributed Computing paradigm
  • frameworks and languages Used: MrJob, Spark, Hive, HiveQL

Hive

  • learned the advanced concepts like how ORC file format can be more efficient file format in Hive
  • Its integration with external tables, S3 etc..
  • Concept of partitioning make the query processing so fast
  • Some concept specific to Hive
    • like we create the schema explicitly before loading the data in database’

Various Databases

I learned and implemented below listed databases in this course. Finally, in a group we created

  • CassandraDB
  • MongoDB
  • DynamoDB

Also, we learned about the streaming frameworks like Kafka

Small Projects done:

  • Analyzing the employee data and efficiently processing large data by performing partitioning
  • Used AWS EMR to work with various Hadoop technologies including spark, mrJob etc..
  • Finally played around with different NoSQL databases offered by AWS Cloud on Employee Data. Created a project report (kind of case study) on the this experimentation and submitted to the college