Course Duration: 120 Hours
Course Commencement: Monday & Thursday of every week
Meeting Times: 09:00 PM to 11:00 PM IST
Meeting Location: YouTube Live Streaming
Course Website: http://www.statsindia.guru/courses/natural-language-processing/
Course syllabus, reading material, and course related resources will be made available at the course website. Additionally, the online portal Google Classroom (https://classroom.google.com/) will be used for posting lecture material, assignments, announcements, etc and for handling submissions.
Tentative office hours: Saturday/ Sunday 10:00 AM to 02:00 PM, or by appointment.
Natural language processing (NLP) is a field of computer science, artificial intelligence concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language data.
NLP requires an integrated skill set spanning mathematics, statistics, machine learning, databases and other branches of computer science along with a good understanding of the craft of problem formulation to engineer effective solutions. This course will introduce students to this rapidly growing field and equip them with some of its basic principles and tools.
Students will learn concepts, techniques and tools they need to deal with various facets of Natural Language Processing and Natural Language Understanding, including data collection and integration, exploratory data analysis, predictive modeling, descriptive modeling, data product creation, evaluation, and effective communication.
The focus in the treatment of these topics will be on breadth, rather than depth, and emphasis will be placed on integration and synthesis of concepts and their application to solving problems. To make the learning contextual, real datasets from a variety of disciplines will be used.
At the conclusion of the course, students should be able to:
- Describe what Data Science is and the skill sets needed to be a data scientist.
- Explain in basic terms what Statistical Inference means. Identify probability distributions commonly used as foundations for statistical modeling. Fit a model to data.
- Use R to carry out basic statistical modeling and analysis.
- Explain the significance of exploratory data analysis (EDA) in data science. Apply basic tools (plots, graphs, summary statistics) to carry out EDA.
- Describe the Data Science Process and how its components interact.
- Use APIs and other tools to scrap the Web and collect data.
- Apply EDA and the Data Science process in a case study.
- Apply basic machine learning algorithms (Linear Regression, k-Nearest Neighbors (k-NN), k-means, Naive Bayes) for predictive modeling. Explain why Linear Regression and k-NN are poor choices for Filtering Spam. Explain why Naive Bayes is a better alternative.
- Identify common approaches used for Feature Generation. Identify basic Feature Selection algorithms (Filters, Wrappers, Decision Trees, Random Forests) and use in applications.
- Identify and explain fundamental mathematical and algorithmic ingredients that constitute a Recommendation Engine (dimensionality reduction, singular value decomposition, principal competent analysis). Build their own recommendation system using existing components.
- Create effective visualization of given data (to communicate or persuade).
- Work effectively (and synergically) in teams on data science projects.
- Reason around ethical and privacy issues in data science conduct and apply ethical practices.
This course is suitable for professionals who are working as data engineers, analytics professionals, marketing researchers, and software engineers working in finance, insurance, or ecommerce domain.
The course is also suitable for graduate or postgraduate students in computer science, computer engineering, electrical engineering, applied mathematics, business, computational sciences, and related analytic fields.
Students are expected to have basic knowledge of algorithms and reasonable programming, and some familiarity with basic linear algebra and basic probability and statistics.
If you are interested in taking the course, but are not sure if you have the right background, talk to the instructor. You may still be able to take the course if you are willing to put in the extra effort to fill in any gaps.
The course consists of lectures (two times a week, 100 min each), and involves a set of assignments (about 3 to 4), a set of Kaggle problems (about 3 to 4) and an internship project.
Internship project could be a real project solving a real-world problem by analyzing an interesting dataset using existing methods and software tools; building your own machine learning or statistical model; or creating a visualization of a complex dataset. Students are encouraged to work in teams of two or three for a project. Assignments and Kaggle problems, on the other hand, are to be completed and submitted individually.
Your final grade will be determined based on your performance on each of the following items; the percentages in parenthesis show the weight each item carries to the final grade.
- Class participation (10%)
- Assignments (30%)
- Project (30%)
- Internship project (30%)
Topics and Course Outline
- Introduction: What is Data Science?
- Big Data and Data Science hype
- Why now?
- Current landscape of perspectives
- Skill sets needed
- Statistical Inference
- Populations and samples
- Statistical modeling, probability distributions, fitting a model
- Intro to R
- Exploratory Data Analysis and the Data Science Process
- Basic tools (plots, graphs and summary statistics) of EDA
- Philosophy of EDA
- The Data Science Process
- Case Study
- Three Basic Machine Learning Algorithms
- Linear Regression
- k-Nearest Neighbors (k-NN)
- One More Machine Learning Algorithm and Usage in Applications
- Motivating application: Filtering Spam
- Naive Bayes and why it works for Filtering Spam
- Data Wrangling: APIs and other tools for scrapping the Web
- Feature Generation and Feature Selection (Extracting Meaning From Data)
- Motivating application: user (customer) retention
- Feature Generation (brainstorming, role of domain expertise, and place for imagination)
- Feature Selection algorithms
- Filters; Wrappers; Decision Trees; Random Forests
- Recommendation Systems: Building a User-Facing Data Product
- Algorithmic ingredients of a Recommendation Engine
- Dimensionality Reduction
- Singular Value Decomposition
- Principal Component Analysis
- Exercise: build your own recommendation system
- Mining Social-Network Graphs
- Social networks as graphs
- Clustering of graphs
- Direct discovery of communities in graphs
- Partitioning of graphs
- Neighborhood properties in graphs
- Data Visualization
- Basic principles, ideas and tools for data visualization
- Examples of inspiring (industry) projects
- Exercise: create your own visualization of a complex dataset
- Data Science and Ethical Issues
- Discussions on privacy, security, ethics
- A look back at Data Science
- Next-generation data scientists
The lecture notes and reading material will be posted on the course's website or the associated Google Classroom page as the course proceeds.
Additional references and books related to the course:
- Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1, Cambridge University Press. 2014. (free online)
- Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
- Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
- Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning, Second Edition. ISBN 0387952845. 2009. (free online)
- Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. \\ (Note: this is a book currently being written by the three authors. The authors have made the first draft of their notes for the book available online. The material is intended for a modern theoretical course in computer science.)
- Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. 2014.
- Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques, Third Edition. ISBN 0123814790. 2011.
Students are expected to submit assignments by the specified due date and time. Assignments turned in up to 48 hours late will be accepted with a 10% grade penalty per 24 hours late. Except by prior arrangement, missing or work late by more than 48 hours will be counted as a zero.
Important Dates and Deadlines
Please refer to the course page and academic calendar often to be aware of important dates and critical deadlines throughout the semester.
This syllabus is subject to change. Updates will be posted on the course website.