Apache Mahout is Free and open source project. It is a library of scalable machine learning Algorithms, implemented on top of Apache Hadoop and using the Map reduce paradigm. It implements the most popular machine learning techniques like recommendation, classification, clustering, and collaborative filtering. In this Mahout contains manly java libraries for common math algorithms and different operations like or, and, not focused on statics and linear algebra as well as primitive java collection. It also provides the data science tools to automatically find interesting patterns in those big data sets. Most companies used in mahout internally: Facebook, LinkedIn, Foursquare, twitter, yahoo, adobe etc.
The Mahout project was started by several people involved in the Apache Lucene (open source search) community with an active interest in machine learning and a desire for robust, well-documented, scalable implementations of common machine-learning algorithms for clustering and categorization. The community was initially driven by Ng et al.’s paper “Map-Reduce for Machine Learning on Multicore” (see Resources) but has since evolved to cover much broader machine-learning approaches. Mahout also aims to:
- Build and support a community of users and contributors such that the code outlives any particular contributor’s involvement or any particular company or university’s funding.
- Focus on real-world, practical use cases as opposed to bleeding-edge research or unproven techniques.
- Provide quality documentation and examples.
Mahout supports four main data science use cases:
Collaborative filtering: It mines user behaviour and makes product recommendations (Amazon recommendations)
Clustering: It takes items in a particular class (such as web pages or newspaper articles) and organizes them into naturally occurring groups, such that items belonging to the same group are similar to each other
Classification: learns from existing categorizations and then assigns unclassified items to the best category
Frequent item-set mining: analyzes items in a group (e.g. items in a shopping cart or terms in a query session) and then identifies which items typically appear together
Clustering is the procedure to organize elements of a given data collection into groups based on the similarity between the items. Or Clustering is grouping any forms of data into characteristically similar groups of data sets. Mahout supports many different clustering mechanisms. The important clustering mechanisms are
- Canopy: It is used to create initial seeds for other clustering algorithms
- K-Mean or Fuzzy-Mean: It creates K clusters based on the distance of items from the centre of the previous iteration.
- Dirichlet: It creates clusters by combining one or more cluster models
- Mean shift: This algorithm doesn’t require any prior information about the number of clusters.
Unless you are highly proficient in java, the coding itself is a big overhead. There’s no way around it. If you don’t know it’s already you are going to need to learn java and its not language that flows! For R users who are used to seeing their thoughts realized immediately the endless declaration and initialization of objects is going to seem like a drag. For that reason I would recommend sticking with R for any kind of data exploration or prototyping and switching to Mahout as you get closer to production.
Recommendation engine is a subset of information filtering systems which can predict the rating or preferences user can give to an item. Using taste library we can build a fast algorithm and flexible collaborative filtering engine. Below are primary components of taste library
- Data model: Users, Items and preferences
- User similarity: similarity between two users
- Item similarity: similarity between two items
- Recommender::provide recommendations
- User neighbourhood: Compute and calculate a neighbourhood of users of same category which cab be used by recommenders.
Mahout is Hadoop Map reduce and MLib is spark .To be more specific from the difference in per job overhead .If Your ML algorithm mapped to the single MR job – main difference will be only start-up overhead, which is dozens of seconds for Hadoop MR, and let say 1 second for Spark. So in case of model training it is not that important. Things will be different if your algorithm is mapped to many jobs. In this case we will have the same difference on overhead per iteration and it can be game changer.For example: we need 100 iterations; each needed 5 seconds of cluster CPU.
On Hadoop: MR (Mahout) it will take 100*5+100*30 = 3500 seconds.
On Spark: it will take 100*5 + 100*1 seconds = 600 seconds.
In the same time Hadoop MR is much more mature framework then Spark and if you have a lot of data, and stability is paramount – I would consider Mahout as serious alternative.