Issue 27: This Week in Data (Feb 16, 2021)
(1) Some great posts on data culture from StitchFix, DoorDash, AirBnB (2) Virginia consumer privacy law passes the legislative branch, (3) Another week, another 20 fundraises in data
Hi TWID,
Hope you all had a great valentines day (and tried out gpt-3 generated valentines cards) and relaxed on president’s day weekend if you are in the US. Can you believe we are already month and a half into 2021 already? Onto the weekly.
Data Culture
What is Data Culture?
Data is not just about the technology, or the products, or the ethics. It’s about those who work in data, how we do work and build ecosystems that make each others’ lives better just a little bit every day.
Many data teams are doing something for somebody else in an organization, whether it providing a platform for analysts to use or providing recommendations for others teams. Or you could be building data product for a platform team that helps analysts make recommendations (for their product that enables other platform teams to help other analysts make recommendations, and so on.) The point is, in all this complexity, the people matter. Some great posts from this week:
Being a Good Platform Partner. The Stitch Fix Data Platform explains how being aggressively helpful helps them gain adoption in their platform and trust with their data scientists. This includes not only talking to customers, but also documentation, monitoring, analytics, and pretty much treating their platform as a product.
“On a personal level, I’ve found this kind of user engagement to be incredibly rewarding. In the remote-first culture of the current workplace, there is nothing more de-isolating than developing strong relationships through copious quantities of unsolicited help and the mutually respectful exchange of feedback and product vision.” (Stitch Fix blog)
Communication with Business Partners. The DoorDash Analytics team gives 7 tips on communication best practices. Since I work here, I can say that this is a surprisingly accurate representation of the no-nonsense way our data science team operates. (Disclaimer in case you missed it: I work here.)
"In an effort to appear data-driven, many presentations and documents include a laundry list of metrics presented without context, which have little informational value to the audience [...] Any insight which is not actionable is trivia."
“When utilizing visualizations, avoid confusing the audience [...] Even insights that were initially discovered using an advanced visualization technique can often be summarized with a simple chart or table, which will be easier for all audiences to understand.” (DoorDash blog)
Dashboards at Scale. The AirBnB Data team talks how they built a large, successful BI ecosystem. The post discusses the engineering improvements and data policies they built over the past 5 years to make Apache Superset work for 125k weekly views across 2000 WAU.
Ultimately, you get a ton of articles at a high level such as "why is it hard to become a data driven company (HBR)", but it's clear that a lot of the culture, work and innovation comes from the ground up. There's no silver bullet but we can continue learning from each other and working on ourselves.
Data Privacy
The Push and Pull of Privacy
On the side of protecting data privacy, we have a few bills we've been waiting for that have finally got voted on. We are getting movement across all layers of government, between both corporate tracking and government surveillance. The fact that we can count these policies on our fingers is unfortunate, but the trajectory is speeding up!
City ⇒ Minneapolis. The City Council voted on Friday to ban facial recognition for use by police or other city agencies.
State ⇒ Virginia. The Consumer Data Protection Act has passed the state house and senate. The law is modeled after California's CCPA, though a few privacy organizations believe it is still too weak.
Country ⇒ Sweden. Sweden has fined their local police €250K for inappropriate use of Clearview AI's facial recognition software.
Country ⇒ US Federal Trade Commission. The FTC has expressed interest in being involved in data privacy if federal legislation doesn't come through.
That being said, there are also forces eroding data privacy. While many of them exist underground or out of the public eye, some just happen in plain sight (in the form of patents).
Clearview. The company filed for a patent to use it in new use cases such as identifying homeless and sex offenders to deny them access to buildings.
AnyVision. The company filed patent to do facial recognition from the sky with drone cameras, tackling challenges that stem from the high camera angle.
Small Bytes
The Spotify teams blogs about its massive data pipeline job used for Wrapped 2020. Using a combination of Google Dataflow and Beam, the team did a massive SortMergeBucket operation to join 1PB at an affordable price point to bring users their personalized Wrapped pages.
Lyft talks about its use of OpenStreetMap. They go through their process of testing the accuracy of the map tags, signs, and other metadata and found it to be great across 30 US cities.
Try out the Data Visualization Crossword Puzzle, with topics around the data visualization space. Take a break and take the puzzle!
LinkedIn discusses using its Fairness Toolkit at scale. The company open sourced LiFT half a year ago, and gives an example of how it is used internally in the "People You May Know" feature.
Varada has open sourced a Presto workflow optimization tool called the presto workload analyzer. The tool collects metrics to optimize clustering and other configurations.
Unite AI does a great interview with Julia Stoyanovich from the center of responsible AI at NYU. The interviewee one of the more articulate conversations on what it bias in AI means.
Thumbtack improves their Review Relevance to include attributes through extractive summarization. The post goes through the lifecycle of a model and is a window into building models at mid-sized orgs.
Industry and Fundraising
Rhino Health - $5M for federated learning with hospital data
NeuReality - $8M for AI inference platform using modern hardware
Insightin Health - $12M for personalized health care guidance
Theator - $15.5M for using computer vision for surgical footage
BigHat Biosciences - $19M for AI antibody design platform
Monte Carlo - $25M for data observability
CYE - $120M for probe for security vulnerabilities with AI and human hackers
Sentinel One acquires Scalyr for $155M for log management and observability
This Week in Data is a weekly newsletter to help you stay up to date with developments in the data ecosystem. My goal is to bring focus on broader data trends to data professionals and enthusiasts who are interested in data and its applications. Topics include infrastructure, AI/ML, experimentation, analytics/BI, privacy, security.