Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Wednesday, 30 March 2016

Book: Data Algorithms Recipes for Scaling Up with Hadoop and Spark

Very nice book which teaches how to implement mechine learning and data mining techniques such as NBC, recommender, clustering, etc. Implemented in java, the book provides codes in both hadoop mapreduce and apache spark in simple-to-understand and clean manner. Have re-coded most of the algorithms in the book except for chapters dealing with some of the bio stuff which i am not particularly interested at the moment.

Book: Apache Spark Graph Processing

This book provided me guide on how to use apache spark graphx to graph processing in my project. While the book provides only basic implementation and intro to graphx features such as visualization, aggregateMessages, and pregel, I gain some useful insights after re-implemented most of the codes in the book (knowledge about scala required)

Link: https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing

Book: Hadoop MapReduce v2 Cookbook

Use the material of this book to build the mapreduce algorithms and hdfs infrastructure for my project. Very good introduction in terms of hadoop mapreduce and how to set up hdfs and yarn in virtual environment.

Link: https://www.packtpub.com/big-data-and-business-intelligence/hadoop-mapreduce-v2-cookbook-second-edition


Wednesday, 3 February 2016

Online Course: Administering an ElasticSearch Cluster

Glad to complete this course. Wonderful materials for things like rolling restart for es upgrade, rolling index, snapshot via curator, the head plugin and es health check tips. Learned a lot on how to run es cluster properly.

Link: https://app.pluralsight.com/library/courses/administering-elasticsearch-cluster/table-of-contents

Tuesday, 19 January 2016

Online Course: Patterns for Building Distributed Systems for The Enterprise

Completed this course. Quite good a mind opener, gain some new perspective of distributed computing, particularly the CQRS via MQ and ESB, as well as the append-only models (esp the event streaming and the historical model). This course gave some good ideas on how to proceed with the refactoring processing of a big data project working on.

Link: https://app.pluralsight.com/library/courses/cqrs-theory-practice/table-of-contents

Friday, 8 January 2016

Online Course: SQL on Hadoop - Analyzing Big Data with Hive

Completed this course. While the pace of the course is a bit fast, it is packed with many useful stuff from executing sql on top of hadoop using Hive, things such as Distributed By, Sort By, GroupSet, Multiple Insert, UDAF, UDTF are really cool. Will need to revisit the course materials when I start to use Hive in development.

Link: https://app.pluralsight.com/library/courses/sql-hadoop-analyzing-big-data-hive/table-of-contents

Saturday, 19 December 2015

Sunday, 13 September 2015

Friday, 12 December 2014

Storm Blueprints: Patterns for Distributed Real-time Computation

A very useful book for anyone interested in Storm for real-time processing. I am particularly benefited by some of the practical use cases of Storm, especially on Storm Trident. Compared to other books on Storm that I read so far, this one seems to offer very good tutorials on several aspects of Trident which answers some of my puzzles over Trident such as:


  • How to implement a Trident State
  • How to implement a Trident State Factory
  • How to implement a Trident State Updater
  • How to effectively uses combiner, reducer, aggregator
  • How to implement a Trident non-transactional, repeat transactional topology and opaque trident map state
  • How to implement a Trident spout, coordinator, emitter
  • How to implement recursive functions in Trident


While the later chapters involving druid and hadoop are a bit difficult for me to assimilate at this stage due to time constraints, i will definitely like to read it again on these chapters.

https://www.packtpub.com/big-data-and-business-intelligence/storm-blueprints-patterns-distributed-real-time-computation

Wednesday, 26 November 2014

Elastic Search Cookbook

The book is nice and easy for someone who just starts learning ElasticSearch. Interesting reader can get the book from:

http://www.amazon.com/ElasticSearch-Cookbook-Alberto-Paro/dp/1782166629

There are some parts which i feel a bit verbose such as when explaining the installation and different ways of search, query, map, etc (which will be useful for reference anyway) as i only need to understand a subset of these command while explore the rest in the wild (e.g., http://java.dzone.com/articles/elasticsearch-java-api). While I did read through most of the materials, the book seems to me a bit week to present in a short and concise way to get someone immediately up-to-speed with working with various features of elastic search. However, coupled with a bit of online resources, it is actually quite easy to get started and toy with elastic search.

Tuesday, 25 November 2014

Learning Storm

Just completes the book "Learning Storm". Very nice read, interesting readers can go to the following link to buy:

https://www.packtpub.com/big-data-and-business-intelligence/learning-storm

The book covers quite widely, with quite a number of ways to show how other technologies  working with Storm introduced in a easy-to-understand way such as the covering of ZooKeeper, Kafka, Hadoop, YARN, Ganglia, JMX, HBase, Redis, MySQL, etc. I especially likes the way they teach Trident, which makes it much easier to grasp the concept of Trident, and the last chapter on machine is extremely useful.

While the normal readers can read the book chapters by chapters to take a slow and full exposure. For someone like me, who always like to delve directly into practice, the best approach is actually to read the book three times, each time skipping some chapters.

During first time, the reader should go through chapter 1 to chapter 4, skipping the thrift library introduction in chapter 3, and then directly jump to chapter 8, which gives an example of log processing in Storm. With this the reader will build a level of confidence after practicing the simple cases in these chapters.

During the second time, the reader should go through chapter 5 and chapter 9 to get a good ideas of what Trident is and how Trident work, as well as how to do machine learning using Trident.

During the third time, the reader can optionally go through the thrift library in chapter 3, then go to chapter 7 which show rich tools to interact with Storm such JMX and Ganglia. Finally if there is a need for integration with Hadoop, then go to chapter 6 and some other parts in chapter 7.

Wednesday, 12 November 2014

Getting Started with Storm

Just completed reading the "Getting Started with Storm" book, I will say this is one of the easiest-to-follow books I have read, yet it provides good basic understanding of working with Storm, a distributed system for processing streaming data. A good read, totally recommended for someone interested in learning Storm.

http://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010

Sunday, 7 September 2014

Online Course: Intro to Hadoop and MapReduce

Udacity Link: Intro to Hadoop and MapReduce

Very easy-to-learn course (should be able to finish the tutorial and course in around 1.5 hours), basically the user will learn how to write command lines to interact with Hadoop DFS and write simple mapper and reducer python scripts to process files in Hadoop DFS