NOTE: These notes are dated and will probably need quite some tweaking to work with more recent versions! INSTALLING HADOOP INSTALL UBUNTU Use a spare partition or VirtualBox: http://www.virtualbox.org http://www.ubuntu.com Use version 10.04 LTS You can install any packages you need/like using System>Administration>Synaptics Package Manager INSTALL SUN/ORACLE JAVA FOR UBUNTU Add partner repository using the following command sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner" Update the source list sudo apt-get update Now install sun java packages using the following commands sudo apt-get install sun-java6-jre sun-java6-plugin sun-java6-fonts Alternatively: sudo apt-get install python-software-properties sudo add-apt-repository ppa:ferramroberto/java sudo apt-get update sudo apt-get install sun-java6-jdk sun-java6-plugin INSTALL OPEN-SSH-SERVER Now check that you can ssh to the localhost without a passphrase: $ ssh localhost If you cannot ssh to localhost without a passphrase, execute the following commands to create and install a key: $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys $ ssh-add INSTALL AND TEST HADOOP * Download latest version from: http://hadoop.apache.org/mapreduce/ * Go through the installation/configuration at: http://hadoop.apache.org/common/docs/current/single_node_setup.html Use Pseudo-Distributed Operation on a single machine. If you have access to several machines then you can do the cluster setup. * Start Hadoop services (bin/start-all.sh) * Running the wordcount example: 1. Put some text files into the input directory 2. Copy the input directory to HDFS bin/hadoop fs -put input input_HDFS 3. Execute the mapreduce example script bin/hadoop jar hadoop-examples-*.jar wordcount input_HDFS output_HDFS 4. Get the result back (to the local output directory) bin/hadoop fs -get output_HDFS output You can also monitor the progress/HDFS using the web interface NameNode (HDFS) - http://localhost:50070/ JobTracker - http://localhost:50030/ * Stop Hadoop: bin/stop-all.sh Some hints: delete data on HDFS (-rmr is recursive!): bin/hadoop fs -rm input get help for HDFS: bin/hadoop fs Finding the example code: src/examples/org/apache/hadoop/examples/