Learning from mistakes

The Use case

My first endeavor in the field came in form of a proof-of-concept – a near real time dashboard to monitor and alert on issues during a particular stage of order processing. As a developer of Order Pipeline Engineering (OPE), Dell we take care of certain order validations before it goes to manufacturing. A number of systems and teams are involved in these validation and the number of dependencies and combinations can be quite voluminous and complicated.


Mock representation of complexity of dependencies between systems during validation in OPE

To increase visibility while order was moving through Order Pipeline Engineering we wanted to be able to inform stakeholders on current location of order, its issues or areas of concern and if there are issues what action could be taken by teams to help move the order along. The idea was to build an ‘Order Journey’ dashboard, targeting the portion of the journey in OPE.

The aim of this proof-of-concept was however not only to track and monitor issues. We also wanted to collect data as order moved through OPE in hopes of building predictive models which could identify possible issues that might arise before it even happened. This meant storing and processing voluminous application logs to track and identify order issues. This is where ‘Big Data’ and ‘Data Science’ came into the picture

During the development we faced many issues and stumbled on multiple occasions. The three below are the ones, that I felt another novice would be likely to make.

Mistake 1: Sacrificing Speed for Customizability

After the first rounds of requirement analysis and grooming we believed being able to customize the dashboard to a high degree was an absolute requirement.

Using R and Shiny (package in R for dashboards) we developed a dashboard where we had control of each and every pixel being displayed. The first phase of development went well, the proof-of-concept became a product and the dashboard got a considerable amount of users. However this user base came from multiple teams, with multiple requirements each requiring us to pursue different lines of analysis. The development of visuals and prototyping dashboards involved a decent amount of code and we noticed it took considerable time to develop, test and release a feature . Also, if an analysis proved inconclusive we had to scrap a decent amount of code. We were not able to tackle requirements fast enough, the backlog grew and we ended up losing users.

Now,  the dashboard is build using Cloudera and Tableau. Tableau makes data visualization easier, prototyping interactive dashboard faster and reduced overall development time. Granted customization went down but now we are able to cater to our wide audience in the constrained time frames.

Mistake 2: Dispersed Analysis

Certain user requirements tend to be vague and had us follow long and discursive lines of analysis, where often times the need we were trying to meet mutated into a different form. This caused models built to track the wrong metrics and us alerting on the wrong information.

The question in analytics is the most basic and essential things to get right. Knowing what kind of question you have to tackle (do i have to predict something or just see if there is a relationship) and having a clear, to-the-point question worded out is itself half the fight.

Now, the question is the first thing worded out of every customer requirement and it is refined and clarified over multiple discussions with the user/s till the analysis and result expected are absolutely clear.

For more on type of question and framing question see this course on Coursera- Managing Data Analysis

Mistake 3: Reproducible Reports

Often times, a new requirement coming in means extending previous work. Traditional documentation where there is documented code and data flow diagrams were not enough. Data per-processing steps, initial exploratory analysis and its results, results from sample size estimates etc – these were few of the things that we needed to know before we could start analysis. We saw the each developer was re-iterating through these stages and hence leading to certain amount of redundant work.

Now, we use knitr and IPython notebooks to document analysis, including intermediate results. A brief reading of this can lead the developer to understand about the data what he/she would otherwise might have obtained from a day of coding. Additionally each developer may do incremental analysis enriching the report with interpretations, counter-views and different lines of thoughts

Learning from these mistakes helped save time and improve quality of analysis. Hope this helps you avoid our mistakes.

That’s all folks..

Posted in Data Science, Statistics | Tagged | Leave a comment

Getting the basics down

Machine Learning was a field that piqued my interest (yes, yes.. it piques everyone’s interest, its a field everyone likes and people throw around randomly) since college. When an opportunity to develop ‘predictive models for order processing’ using machine learning paradigms came along in my team I jumped at it.

The first few weeks was more of a ‘sponge’ phase trying to understand what the project was about, its aims, what i would need to know, what was expected of me and so on. At the end of this the key fields I realized i need to know more about were – ‘Big Data’ and ‘Data Science’

Big Data‘ is a relatively new field dealing with basically handling amounts  and types (unstructured data like twitter feed) of data that traditional databases may not be able to handle. The ecosystem was huge and is still developing at a tremendous pace. I come into contact with new technology as part of it almost every week. The stack we utilize for analysis is primarily Cloudera (CDH 5.8) with a bit of third party integration (for example, Streamsets) for specific needs. I’ll be covering each element of the stack in separate posts later on (please watch out for the links in the comments section)

Learning ‘Data Science‘ was relatively a bit harder. ‘Data Science’ can be defined as field to analyze data, find patterns or make predictions and develop them as a feasible system. It aims to help drive business decisions using data. Its a field that is again relatively new (courses for Data Science started in early 2000s) with most of its experienced practitioners coming from Operation Research or Statistics backgrounds (having picked up coding/development on the way). Its science, varies from person to person, as different people deal and handle data in a different way. In that sense it seems to be at present more of an art form than a science. Following are two specializations on Coursera that I found particularly useful for getting the basics of the field down. It helps answer the question like – how to handle data, how to see patterns and how to formulate a model.

Coursera – Data Science Specialization

For the core developers

Coursera – Executive Data Science Specialization

For the developers who want to understand the bigger picture

Statistics deserves a special mention. It took me some time to realize i would need a refresher on Statistics concepts. Understanding this field helps promote better line of thought for analysis and develop better models. Basic concepts of statistics can be picked up here – Statistics For Data Science

In terms of programming languages – Python is pretty powerful language with integration to all technologies in the CDH stack. For a basic course check out the link below – Python For Data Science

This covers all you would need to start looking at problems. So go out and start getting your hands dirty.

That’s all folks..

Posted in Big data anlaytics, Data Science, Hadoop, Python, Statistics | Tagged , , , | Leave a comment

A New Beginning

As part of my job, i often reading blog posts of engineering teams who push the boundaries of the big data analytics in every way – processing speeds, data compaction and even visualizations that just seem to touch the viewer at a personal level. This receives a lot of comments, criticism and sparks debates which often leave both sides having learned something new even if they do not agree on the final results.

I enjoy the work I do and want to improve and achieve mastery of it. To this end, I decided that its time to put the idea and the breakthroughs that I find at at work and on independent projects to an open platform to receive critiquing. This is the first blog post with that aim at heart.


Posted in Data Science | Tagged | Leave a comment