Machine Learning was a field that piqued my interest (yes, yes.. it piques everyone’s interest, its a field everyone likes and people throw around randomly) since college. When an opportunity to develop ‘predictive models for order processing’ using machine learning paradigms came along in my team I jumped at it.
The first few weeks was more of a ‘sponge’ phase trying to understand what the project was about, its aims, what i would need to know, what was expected of me and so on. At the end of this the key fields I realized i need to know more about were – ‘Big Data’ and ‘Data Science’
‘Big Data‘ is a relatively new field dealing with basically handling amounts and types (unstructured data like twitter feed) of data that traditional databases may not be able to handle. The ecosystem was huge and is still developing at a tremendous pace. I come into contact with new technology as part of it almost every week. The stack we utilize for analysis is primarily Cloudera (CDH 5.8) with a bit of third party integration (for example, Streamsets) for specific needs. I’ll be covering each element of the stack in separate posts later on (please watch out for the links in the comments section)
Learning ‘Data Science‘ was relatively a bit harder. ‘Data Science’ can be defined as field to analyze data, find patterns or make predictions and develop them as a feasible system. It aims to help drive business decisions using data. Its a field that is again relatively new (courses for Data Science started in early 2000s) with most of its experienced practitioners coming from Operation Research or Statistics backgrounds (having picked up coding/development on the way). Its science, varies from person to person, as different people deal and handle data in a different way. In that sense it seems to be at present more of an art form than a science. Following are two specializations on Coursera that I found particularly useful for getting the basics of the field down. It helps answer the question like – how to handle data, how to see patterns and how to formulate a model.
For the core developers
For the developers who want to understand the bigger picture
Statistics deserves a special mention. It took me some time to realize i would need a refresher on Statistics concepts. Understanding this field helps promote better line of thought for analysis and develop better models. Basic concepts of statistics can be picked up here – Statistics For Data Science
In terms of programming languages – Python is pretty powerful language with integration to all technologies in the CDH stack. For a basic course check out the link below – Python For Data Science
This covers all you would need to start looking at problems. So go out and start getting your hands dirty.
That’s all folks..