NOTE: These notes are dated and will probably need quite some tweaking to 
    work with more recent versions!


INSTALLING HADOOP


INSTALL UBUNTU

Use a spare partition or VirtualBox: http://www.virtualbox.org

http://www.ubuntu.com
Use version 10.04 LTS
You can install any packages you need/like using 
    System>Administration>Synaptics Package Manager


INSTALL SUN/ORACLE JAVA FOR UBUNTU

Add partner repository using the following command
    sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
Update the source list
    sudo apt-get update
Now install sun java packages using the following commands
    sudo apt-get install sun-java6-jre sun-java6-plugin sun-java6-fonts

Alternatively: 
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:ferramroberto/java
sudo apt-get update
sudo apt-get install sun-java6-jdk sun-java6-plugin


INSTALL OPEN-SSH-SERVER

Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the 
following commands to create and install a key:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys 
$ ssh-add


INSTALL AND TEST HADOOP

* Download latest version from: http://hadoop.apache.org/mapreduce/

* Go through the installation/configuration at: 
    http://hadoop.apache.org/common/docs/current/single_node_setup.html
    Use Pseudo-Distributed Operation on a single machine. If you have access
    to several machines then you can do the cluster setup.

* Start Hadoop services (bin/start-all.sh)

* Running the wordcount example:
1. Put some text files into the input directory
2. Copy the input directory to HDFS
    bin/hadoop fs -put input input_HDFS
3. Execute the mapreduce example script
    bin/hadoop jar hadoop-examples-*.jar wordcount input_HDFS output_HDFS
4. Get the result back (to the local output directory)
    bin/hadoop fs -get output_HDFS output

You can also monitor the progress/HDFS using the web interface
NameNode (HDFS) - http://localhost:50070/
JobTracker - http://localhost:50030/

* Stop Hadoop: bin/stop-all.sh

Some hints:
delete data on HDFS (-rmr is recursive!): bin/hadoop fs -rm input
get help for HDFS: bin/hadoop fs
Finding the example code: src/examples/org/apache/hadoop/examples/