Learning from mistakes

The Use case

My first endeavor in the field came in form of a proof-of-concept – a near real time dashboard to monitor and alert on issues during a particular stage of order processing. As a developer of Order Pipeline Engineering (OPE), Dell we take care of certain order validations before it goes to manufacturing. A number of systems and teams are involved in these validation and the number of dependencies and combinations can be quite voluminous and complicated.


Mock representation of complexity of dependencies between systems during validation in OPE

To increase visibility while order was moving through Order Pipeline Engineering we wanted to be able to inform stakeholders on current location of order, its issues or areas of concern and if there are issues what action could be taken by teams to help move the order along. The idea was to build an ‘Order Journey’ dashboard, targeting the portion of the journey in OPE.

The aim of this proof-of-concept was however not only to track and monitor issues. We also wanted to collect data as order moved through OPE in hopes of building predictive models which could identify possible issues that might arise before it even happened. This meant storing and processing voluminous application logs to track and identify order issues. This is where ‘Big Data’ and ‘Data Science’ came into the picture

During the development we faced many issues and stumbled on multiple occasions. The three below are the ones, that I felt another novice would be likely to make.

Mistake 1: Sacrificing Speed for Customizability

After the first rounds of requirement analysis and grooming we believed being able to customize the dashboard to a high degree was an absolute requirement.

Using R and Shiny (package in R for dashboards) we developed a dashboard where we had control of each and every pixel being displayed. The first phase of development went well, the proof-of-concept became a product and the dashboard got a considerable amount of users. However this user base came from multiple teams, with multiple requirements each requiring us to pursue different lines of analysis. The development of visuals and prototyping dashboards involved a decent amount of code and we noticed it took considerable time to develop, test and release a feature . Also, if an analysis proved inconclusive we had to scrap a decent amount of code. We were not able to tackle requirements fast enough, the backlog grew and we ended up losing users.

Now,  the dashboard is build using Cloudera and Tableau. Tableau makes data visualization easier, prototyping interactive dashboard faster and reduced overall development time. Granted customization went down but now we are able to cater to our wide audience in the constrained time frames.

Mistake 2: Dispersed Analysis

Certain user requirements tend to be vague and had us follow long and discursive lines of analysis, where often times the need we were trying to meet mutated into a different form. This caused models built to track the wrong metrics and us alerting on the wrong information.

The question in analytics is the most basic and essential things to get right. Knowing what kind of question you have to tackle (do i have to predict something or just see if there is a relationship) and having a clear, to-the-point question worded out is itself half the fight.

Now, the question is the first thing worded out of every customer requirement and it is refined and clarified over multiple discussions with the user/s till the analysis and result expected are absolutely clear.

For more on type of question and framing question see this course on Coursera- Managing Data Analysis

Mistake 3: Reproducible Reports

Often times, a new requirement coming in means extending previous work. Traditional documentation where there is documented code and data flow diagrams were not enough. Data per-processing steps, initial exploratory analysis and its results, results from sample size estimates etc – these were few of the things that we needed to know before we could start analysis. We saw the each developer was re-iterating through these stages and hence leading to certain amount of redundant work.

Now, we use knitr and IPython notebooks to document analysis, including intermediate results. A brief reading of this can lead the developer to understand about the data what he/she would otherwise might have obtained from a day of coding. Additionally each developer may do incremental analysis enriching the report with interpretations, counter-views and different lines of thoughts

Learning from these mistakes helped save time and improve quality of analysis. Hope this helps you avoid our mistakes.

That’s all folks..


About Karun Thankachan

Working as Assoc Software Development Engineer at Dell International Services Ltd. Area of focus is Data Science and Engineering (Hadoop and Python)
This entry was posted in Data Science, Statistics and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s