CS 7/5337 Information Retrieval and Web Search
- Instructor: Dr. Michael Hahsler
- Syllabus: Course Syllabus
- Assignments: Canvas
- All course material is provided under
News
- I am currently not offering this course. Please check my.smu.edu
Course Description
Introduces the field of information retrieval with an emphasis on its application in Web search. Introduces the basic concepts of stemming, tokenizing and inverted indices, text similarity metrics and the vector-space model. Popular Web search engines are studied and the concepts are applied in several Java-based projects using state-of-the-art frameworks like MapReduce. Web search frameworks and indexing tools like Apache Nutch and Lucene will also be reviewed.
Text
- Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. (Online Edition)
Slides
- Introduction and Boolean Retrieval (IIR 1)
- The term vocabulary and postings lists (IIR 2)
- Web Search (IIR 19)
- Web crawling and indexes (IIR 20)
- Dictionaries and tolerant retrieval (IIR 3)
- Index construction (IIR 4)
- Mapreduce (my installation notes for Hadoop)
- The Vector Space Model (IIR 6)
- Evaluation in information retrieval (IIR 8)
- Document clustering (IIR 16)
Assignments
- Homework 1 due on 2/6 at noon.
- Homework 2 due on 3/5 at midnight.
- Homework 3 due on 4/9 at midnight.
Project
- Project 1: Web Crawling and Data Extraction due on 3/2 at midnight.
- Project 2: Indexing (using MapReduce) due 4/1 at midnight.
- HOWTO: Install Hadoop/MapReduce, Hadoop Java API: Mapper and Reducer, Simple Example
- Meeting schedule
- Project 3: Complete Search Engine due 4/22 at midnight.
- Servlets: Java Servlets at Wikipedia, HOWTO: Install tomcat, Simple Example
- Meeting schedule - Updated!
Links
- Software
- Install Linux (Ubuntu release 10.04 LTS) on a spare partition of your PC/Laptop or using Virtual Box.
- Install Sun Java
- Help on how to use a unix shell.
- Installation of other things (netbeans, eclipse, apache, tomcat, etc.): System/Administration/Synaptics Package Manager
- Web Crawler
- Indexing
- User Interface
- Advanced topics
- Interesting links
- Search engine statistics and coverage (The Search Guru)
- Google search terms
Tools for distance students
- CamStudio: Free Screen recording software for Windows.