Three pillars to succeed in Data Science
by Anay Nayak - Sahaj
There are innovative use cases that are currently driven using Data Science, and given the cheap cost of storage and cloud infrastructure, it is tempting for enterprises both big and small to start this journey. While the rest of engineering has reached a phase where Agile/Agility has become an abused term, Data Science, still needs to catch up on a lot of practices.
Here are three ways to accelerate your Data Science journey:
Iterate on the models
For any new model, start with simple, explainable models with a measure of accuracy and roll it out to your users. Improve the model in subsequent cycles based on any new observations and stop once the cost exceeds the benefit. If required, build capabilities to A/B test your models.
Timebox exploration for your hypotheses so that you don’t get lost in data. You will discover new patterns/outliers in our data, note them, and see if it makes sense to analyse these further for future business use cases.
Rather than spending more time choosing the best model, pick one that provides acceptable accuracy and then improve over time. Delaying a feature for minor improvements prevents you from discovering other aspects in production.
Work with Stakeholders and business users
Involve business users early and often for data analysis and validation. Provide access to users via interactive widgets/dashboards so that they can explore and come back with specific data points for any issues.
Ensure that your stakeholders are aware of the difference in approach for rolling out a model. Given that a lot of things can be exploratory, data analysis discovery/steps play an important role in addition to the outcome. Agree on what is an acceptable accuracy threshold. For some models, you can start with enabling people in their roles and later evolve to provide direct recommendations.
Do not delay confidence-building measures like testing and validation to get to a deliverable faster. Instead, leverage them to evolve the model frequently. For systems dealing with big data, fixing a broken model costs more in time, money and reputation.
Leverage on-demand immutable cloud infrastructure to reduce upfront infrastructure costs and to run larger workloads faster. While the cost for running it with more infrastructure is higher, you save on the overall time to ship the model.
Avoid silos and have Data Scientists and Data Engineers working closely on all aspects. All of them possess specific skills and can benefit hugely from cross-pollination that in turn would improve the rate as well as the quality of what is delivered. Reduce cycle time for your models by having more people work on it simultaneously rather than running one stream per person.