The Data Science PipelineCommunication
Once you've decided on a model, the next step is to use it to actually solve your problem. For example, perhaps the housing pricing prediction is part of a real estate app. In that case, you would need to incorporate your model into the app. Typically you would have an app development team with whom to discuss the best way to do that, since the details would often depend on various technical aspects of how the app works.
Another common machine learning use case is informing business decisions. Typically, the people ultimately responsible for those decisions will want as robust an understanding as possible of the relevant details about how the model works and how reliable its predictions are. The data scientist is responsible for describing their process clearly and being forthright about any causes for concern. It might also be important at this stage to be able to think of machine learning models in more specific terms than "something that can be trained in Scikit-Learn".
You might be asked to give a report or a presentation or both, and your ability to have an impact in your organization might depend on your ability to inspire confidence through these media. It's a good idea to begin practicing supporting your quantitative work with clear explanations, well before the stakes are high.
Here are a few tips for writing about your data analysis:
- Know your audience. Learning to anticipate what will be understandable and meaningful to your readers is one of the most important skills for any sort of writing. It takes practice, and it requires targeting a specific audience throughout the writing process. If in doubt, provide an explanation rather than assuming your reader will be familiar with a necessary idea.
- Prioritize narrative cohesion. You don't want a report to read like a laundry list of things you did. The reader should be able to easily identify the motivating question, follow idea threads from section to section, appreciate any surprises, and spend attention on various components in proportion to their actual importance.
- Don't waste space. Leave out stuff if it really doesn't matter. No one can appreciate dozens of lines of data frame printouts, so they should not appear in a report. If it's important to show what a data frame looks like, you might print its head. The same goes for figures: it takes time for the reader to absorb the lessons of a graphic, so you want to write explicit captions to facilitate that process. It's also important to be mindful of reader fatigue. If you have 20 plots related to the same idea, you probably need to identify a handful of especially useful ones and do without the rest.
- Check for typos. Readers will, subconsciously or otherwise, take your work less seriously if it is riddled with errors. Use a service like Grammarly or a colleague who will proofread your work to help make sure it is grammatically and typographically accurate.
Read this report and this report. Make five positive or negative observations about each report. It's OK if your observations are mostly negative about one and mostly positive about the other.
This course has been a very brief introduction to the data science pipeline. If you try to tackle a real data science problem, you will find that there are many important skills that we did not develop (for example, handling missing data or making categorical variables quantitative for purposes of training a model). However, getting a survey of the full pipeline will serve as a useful frame of reference as you learn more data science from more in-depth sources. My top recommendations are Hands-on Machine Learning with Scikit-Learn and TensorFlow by Aurélien Geron for Python, and the free online book Introduction to Data Science by Rafael Irizarry for R.