Designing My Own “Data Science & Machine Learning” Program

I have already completed my post graduation in business administration with specialization in finance, it had some exposure to analytics as well. Three years after completion of this, I got an itching to learn something about data science. Having said that, I cant imagining going for another PG but this should not stop me from learning it.

In the age of MOOC, online learning, YouTube, blogs, most of the resources are available to everyone, Yes, it is a pain to identify good resources but its there. I decided to go through some of the reputed Data Science program curriculum and derive a program for Data Science from freely available resources. I have considered following courses.

  • Master of Science in Business Analytics by UT at Austin
  • MS in Data Science by NYU
  • Master of Science in Data Science by Columbia University
  • MS in Computational Data Science by Carnegie Mellon University
  • Certificate Program in Business Analytics by ISB

I also looked at some already created road maps.

  • http://datasciencemasters.org/
  • http://www.datasciguide.com/

These road-maps are huge, in fact authors have mentioned that its not humanly possible to go through each resource, whoever is following these, should follow selectively.

Machine learning topics list is huge, if you start learning everything, you will loose momentum so when you start learning machine learning, consider following points

Firstly and most importantly choose your niche and start working on it,if you have not selected any niche, start reading more about different subtopics but dont delay deciding your niche.

and secondly, don’t wait till you complete all learning, start getting your hands dirty as early as possible.

Data Science in Python by University of Michigan

Month#1

Python Basics, NumPy, SciPy, Pandas and Matplotlib using following two courses:

Introduction to Data Science in Python

Basic Statistics

Applied Plotting, Charting & Data Representation in Python

You can do applied plotting, charting & data representation course at later stage but I recommend you do python and basic statistics courses before you starting machine learning course.

Month#2

Applied Machine Learning in Python

I am still analysing next steps and I will add these section once I figure these out.  Since now we have basic programming as well as basic understanding of machine learning this is a right time to learn required mathematics, statistics , probability before we venture ahead.

Please view this video as well

 

and

 

Important blogs and Websites for Data Science

Learning  and practicing Data Science

Importance Courses

Free Books (All legit)

Data Science Blogs to Follow

fuzzywuzzy : string matching Python

Many times while dealing with text analytics, we need to compare text. There are multiple algorithms and approaches to do the job. Lets have a look at fuzzywuzzy library.

fuzzywuzzy
Installation
pip install fuzzywuzzy  
pip install python-Levenshtein  

fuzzywuzzy will work even if you dont install python-Levenshtein but installing it will enhance performance.

Using fuzz.ratio

This is basic comparison and output is as below

>>> from fuzzywuzzy import fuzz
>>> from fuzzywuzzy import process
>>> fuzz.ratio("ABCD", "ABCD")
100  
>>> fuzz.ratio("ABCD", "ABCDE")
89  
>>> fuzz.ratio("ABCD", "ABCDEF")
80  
>>> fuzz.ratio("ABCD", "ABCDEFG")
73  
>>> fuzz.ratio("ABCD", "ABCDEFGH")
67  
>>> fuzz.ratio("ABCD", "ABCDEFGHI")
62  
>>> fuzz.ratio("ABCD", "ABCDEFGHIJ")
57  
Using partial_ratio

ratio is very simple comparison, you can use partial_ratio to do sub-string mapping.

>>> fuzz.partial_ratio("ABCD", "ABCDEFGHIJ")
100  

But evening partial ration fails when words are scarmbled.

>>> fuzz.partial_ratio("India Vs Aus","Aus Vs India")
42  
Using tokensortratio

Basically “India Vs Aus” and “Aus Vs India” are same thing but people can use either ways and both are valid. In cases of words where sequence might be different,you can use tokensortratio

>>> fuzz.token_sort_ratio("India Vs Aus","Aus Vs India")
100  
>>> fuzz.token_sort_ratio("India cricket team Vs Aus cricket team","Aus Vs India")
48  

Now lets add further complication, if I add ‘cricket team’ in one of the word, match does not work.

fuzz.tokensetratio

In such cases, using tokensetration might help you.

>>> fuzz.token_set_ratio("India cricket team Vs Aus cricket team","Aus Vs India")
100  
Which matching algorithm to use ?

Well, just because fuzz.tokensetratio works even with extra words, it might not be suitable for your application. Ultimately, it will boils down to what you want ti compare and how is use data.

You can take a look at other libraries such as jellyfish if you are looking for alternative approaches.