"The Data Janitor 101", Daniel Molnar, Senior Data Scientist at Microsoft

Description
"The Data Janitor 101", Daniel Molnar, Senior Data Scientist at Microsoft Watch videos from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo Visit the conference website to learn more: www.datanatives.io Follow Data Natives: https://www.facebook.com/DataNatives https://twitter.com/DataNativesConf https://www.youtube.com/c/DataNatives Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS About the Author: I'm a generalist in a tight-knit data team enabling data-driven company culture and operations as a data janitor, data analyst and occasional data scientist. Doing ETL and Data Quality, defining company-wide KPIs and metrics with management, producing BI, exploring user behavior to trigger actionable changes in marketing and product, A/B testing, feature engineering for ML, building webapp for secret sauce internal data tool. Long live Bayes and Occam! Tools are mostly bash, Python, Redshift, Tableau, SQL, Python, Flask, Mustache, Wizard, Optimizely.

Please download to get full document.

View again

of 131
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Data & Analytics

Publish on:

Views: 59 | Pages: 131

Extension: PDF | Download: 0

Share
Transcript
  • 1. 1
  • 2. Data Janitor 101 Daniel Molnar, Microsoft Data Natives 2016 2
  • 3. tl;dr 3
  • 4. tl;dr 4 KISS is the philosophy, 3
  • 5. tl;dr 4 KISS is the philosophy, 4 take the long view, invest in durable knowledge, 3
  • 6. tl;dr 4 KISS is the philosophy, 4 take the long view, invest in durable knowledge, 4 strive for fast and good enough, 3
  • 7. tl;dr 4 KISS is the philosophy, 4 take the long view, invest in durable knowledge, 4 strive for fast and good enough, 4 just because you can doesn't mean you should. 3
  • 8. CAP #1 BUSINESS ANALYST 4
  • 9. "... American MBA? ... if you don’t understand something it must be simple and only take five minutes." 1 Sean Murphy, PingThings 5
  • 10. Don't 6
  • 11. Don't 4 unicorn my a**, 6
  • 12. Don't 4 unicorn my a**, 4 hockey stick here for me, 6
  • 13. Don't 4 unicorn my a**, 4 hockey stick here for me, 4 skip leg day. 6
  • 14. Do 7
  • 15. Do 4 make definitions, 7
  • 16. Do 4 make definitions, 4 show direction, 7
  • 17. Do 4 make definitions, 4 show direction, 4 care about data quality, 7
  • 18. Do 4 make definitions, 4 show direction, 4 care about data quality, 4 rule dashboards. 7
  • 19. KPIs that matter 8
  • 20. KPIs that matter 4 DAU, WAU, MAU, LTV, churn, 8
  • 21. KPIs that matter 4 DAU, WAU, MAU, LTV, churn, 4 cohorts, segments, funnels, 8
  • 22. KPIs that matter 4 DAU, WAU, MAU, LTV, churn, 4 cohorts, segments, funnels, 4 first hour, first day. 8
  • 23. Approach 9
  • 24. Approach 4 KPIs must hurt (aka no feelgood metrics), 9
  • 25. Approach 4 KPIs must hurt (aka no feelgood metrics), 4 you are what you measure, 9
  • 26. Approach 4 KPIs must hurt (aka no feelgood metrics), 4 you are what you measure, 4 you can run in one direction, 9
  • 27. Approach 4 KPIs must hurt (aka no feelgood metrics), 4 you are what you measure, 4 you can run in one direction, 4 is it actionable (the Friday 1700 test). 9
  • 28. Toolset 10
  • 29. Toolset 4 Excel, 10
  • 30. Toolset 4 Excel, 4 SQL, 10
  • 31. Toolset 4 Excel, 4 SQL, 4 Metabase. 10
  • 32. Heroes of the day Joel Spolsky: You Suck at Excel Dan McKinley: Data Driven Products Now! 11
  • 33. CAP #2 DATA ENGINEER 12
  • 34. "Don't reinvent the flat tyre." 1 Alan Kay 13
  • 35. Don't 14
  • 36. Don't 4 just Apache it, 14
  • 37. Don't 4 just Apache it, 4 build a Hadoop JENGA (10x-235x slow), 14
  • 38. Don't 4 just Apache it, 4 build a Hadoop JENGA (10x-235x slow), 4 real-time it, 14
  • 39. Don't 4 just Apache it, 4 build a Hadoop JENGA (10x-235x slow), 4 real-time it, 4 stream it, 14
  • 40. Don't 4 just Apache it, 4 build a Hadoop JENGA (10x-235x slow), 4 real-time it, 4 stream it, 4 overengineer it. 14
  • 41. Do 15
  • 42. Do 4 embrace dirty reality (entity recognition makes a data engineer), 15
  • 43. Do 4 embrace dirty reality (entity recognition makes a data engineer), 4 ETL, events and DWH, 15
  • 44. Do 4 embrace dirty reality (entity recognition makes a data engineer), 4 ETL, events and DWH, 4 data quality (know your leakage), 15
  • 45. Do 4 embrace dirty reality (entity recognition makes a data engineer), 4 ETL, events and DWH, 4 data quality (know your leakage), 4 testing (yes, you can even unit test data). 15
  • 46. Approach 16
  • 47. Approach 4 avoid GIGO, 16
  • 48. Approach 4 avoid GIGO, 4 pedal to the metal, skip the overhead, 16
  • 49. Approach 4 avoid GIGO, 4 pedal to the metal, skip the overhead, 4 know that big RAM is eating big data, 16
  • 50. Approach 4 avoid GIGO, 4 pedal to the metal, skip the overhead, 4 know that big RAM is eating big data, 4 use open source, pragmatic, cloud service agnostic tools. 16
  • 51. Toolset 17
  • 52. Toolset 4 UNIX (bash, make), 17
  • 53. Toolset 4 UNIX (bash, make), 4 Python, 17
  • 54. Toolset 4 UNIX (bash, make), 4 Python, 4 SQL, 17
  • 55. Toolset 4 UNIX (bash, make), 4 Python, 4 SQL, 4 ETL in batch (mETL, night-shift) 17
  • 56. Toolset 4 UNIX (bash, make), 4 Python, 4 SQL, 4 ETL in batch (mETL, night-shift) 4 event tracking (Hamustro, logsanitizer, RPi?), 17
  • 57. Toolset 4 UNIX (bash, make), 4 Python, 4 SQL, 4 ETL in batch (mETL, night-shift) 4 event tracking (Hamustro, logsanitizer, RPi?), 4 DWH = MPP SQL (Azure DWH, Redshift, Vertica...). 17
  • 58. Heroes of the day James Mickens: Computers are a Sadness, I am the Cure Dan McKinley: Choose Boring Technology David Beazley: Discovering Python 18
  • 59. CAP #3 DATA SCIENTIST 19
  • 60. "Friends don’t let friends calculate p-values (without fully understanding them)." 1 Scott Weingart 20
  • 61. Don't 21
  • 62. Don't 4 expect CSVs and produce models whatever it takes, 21
  • 63. Don't 4 expect CSVs and produce models whatever it takes, 4 expect that you have to explore the laws of Universe, 21
  • 64. Don't 4 expect CSVs and produce models whatever it takes, 4 expect that you have to explore the laws of Universe, 4 forget about Occam's razor, 21
  • 65. Don't 4 expect CSVs and produce models whatever it takes, 4 expect that you have to explore the laws of Universe, 4 forget about Occam's razor, 4 A/B test (only if it REALLY REALLY makes sense). 21
  • 66. Do 22
  • 67. Do 4 user testing to define context (usertesting.com), 22
  • 68. Do 4 user testing to define context (usertesting.com), 4 talk to users via surveys, 22
  • 69. Do 4 user testing to define context (usertesting.com), 4 talk to users via surveys, 4 embed yourself in departments (personas), 22
  • 70. Do 4 user testing to define context (usertesting.com), 4 talk to users via surveys, 4 embed yourself in departments (personas), 4 have common sense. 22
  • 71. Approach 23
  • 72. Approach 4 you mostly tell what not to do, 23
  • 73. Approach 4 you mostly tell what not to do, 4 it's hard, but still the only way, 23
  • 74. Approach 4 you mostly tell what not to do, 4 it's hard, but still the only way, 4 persist when not finding anything or trivialities, 23
  • 75. Approach 4 you mostly tell what not to do, 4 it's hard, but still the only way, 4 persist when not finding anything or trivialities, 4 kill teh lurking causation. 23
  • 76. A/B 24
  • 77. A/B 4 think twice about TCO, 24
  • 78. A/B 4 think twice about TCO, 4 the world isn’t identically distributed, 24
  • 79. A/B 4 think twice about TCO, 4 the world isn’t identically distributed, 4 random variation will cheat you in small samples, 24
  • 80. A/B 4 think twice about TCO, 4 the world isn’t identically distributed, 4 random variation will cheat you in small samples, 4 most A/B test results are illusory, 24
  • 81. A/B 4 think twice about TCO, 4 the world isn’t identically distributed, 4 random variation will cheat you in small samples, 4 most A/B test results are illusory, 4 small data -> go Bayesian = less certainty. 24
  • 82. Toolset 25
  • 83. Toolset 4 SQL, 25
  • 84. Toolset 4 SQL, 4 Wizard, 25
  • 85. Toolset 4 SQL, 4 Wizard, 4 Python, 25
  • 86. Toolset 4 SQL, 4 Wizard, 4 Python, 4 R (only to anger CS peeps). 25
  • 87. Heroes of the day Evan Miller: Wizard Statistical Analyzer Chris Stucchio talks and posts on testing 26
  • 88. Machine Learning CAP #4 27
  • 89. Don't 28
  • 90. Don't 4 need a PhD, 28
  • 91. Don't 4 need a PhD, 4 develop new unique matrix algos, please, 28
  • 92. Don't 4 need a PhD, 4 develop new unique matrix algos, please, 4 need more than Excel, 28
  • 93. Don't 4 need a PhD, 4 develop new unique matrix algos, please, 4 need more than Excel, 4 give false hope. 28
  • 94. Do 29
  • 95. Do 4 deploy good enough fast, 29
  • 96. Do 4 deploy good enough fast, 4 copy Kaggle (ensembles, random forest, XGBoost), 29
  • 97. Do 4 deploy good enough fast, 4 copy Kaggle (ensembles, random forest, XGBoost), 4 feature engineer, 29
  • 98. Do 4 deploy good enough fast, 4 copy Kaggle (ensembles, random forest, XGBoost), 4 feature engineer, 4 build core data/feature (augment and enhance). 29
  • 99. Approach 30
  • 100. Approach 4 the Mailchimp way (offline built model redeployed each quarter), 30
  • 101. Approach 4 the Mailchimp way (offline built model redeployed each quarter), 4 hybrid approaches (domain expert, vanilla ML), 30
  • 102. Approach 4 the Mailchimp way (offline built model redeployed each quarter), 4 hybrid approaches (domain expert, vanilla ML), 4 you are a machine instructor, 30
  • 103. Approach 4 the Mailchimp way (offline built model redeployed each quarter), 4 hybrid approaches (domain expert, vanilla ML), 4 you are a machine instructor, 4 Tensorflow (logic to clients, handle models). 30
  • 104. Toolset 31
  • 105. Toolset 4 Excel, 31
  • 106. Toolset 4 Excel, 4 Wizard, 31
  • 107. Toolset 4 Excel, 4 Wizard, 4 BigML, 31
  • 108. Toolset 4 Excel, 4 Wizard, 4 BigML, 4 Python. 31
  • 109. Heroes of the day John Foreman: Data Smart Jeroen Janssen: Data Science at the Command Line 32
  • 110. CAP #5 HEAD OF DATA 33
  • 111. "In god we trust everybody else bring data to the table." 1 W. Edwards Deming 34
  • 112. Don't 35
  • 113. Don't 4 believe the hype, 35
  • 114. Don't 4 believe the hype, 4 trust no-one, just benchmarks, 35
  • 115. Don't 4 believe the hype, 4 trust no-one, just benchmarks, 4 let black box take over, 35
  • 116. Don't 4 believe the hype, 4 trust no-one, just benchmarks, 4 let black box take over, 4 expect hiring to be easy. 35
  • 117. Do 36
  • 118. Do 4 maintain data mythology, 36
  • 119. Do 4 maintain data mythology, 4 keep the view backwards straight, 36
  • 120. Do 4 maintain data mythology, 4 keep the view backwards straight, 4 expect emotions, 36
  • 121. Do 4 maintain data mythology, 4 keep the view backwards straight, 4 expect emotions, 4 see the future. 36
  • 122. Approach 37
  • 123. Approach 4 train to be the bearer of the bad news, 37
  • 124. Approach 4 train to be the bearer of the bad news, 4 laugh at endless growth without saturation, 37
  • 125. Approach 4 train to be the bearer of the bad news, 4 laugh at endless growth without saturation, 4 handle the cargo cult (inverse causality). 37
  • 126. Marketing 38
  • 127. Marketing 4 Google Analytics (sampling, off by 20%, no user granularity, no raw, 150k per year), 38
  • 128. Marketing 4 Google Analytics (sampling, off by 20%, no user granularity, no raw, 150k per year), 4 CPA, FB CPA, mobile CPA, conversion, attribution, 38
  • 129. Marketing 4 Google Analytics (sampling, off by 20%, no user granularity, no raw, 150k per year), 4 CPA, FB CPA, mobile CPA, conversion, attribution, 4 Net Promoter Score. 38
  • 130. Heroes of the day Dan Lyons: Disrupted Venkatesh Rao: The Gervais Principle 39
  • 131. Thank you! @soobrosa visuals: @xkcd, @DorsaAmir, ˙Cаvin 〄, thelearningcurvedotca, JD Hancock, Thomas Hawk, jonolist, Kalexanderson 40
  • Related Search
    Similar documents
    View more...
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks