Issue 26: This Week in Data (Feb 8, 2021)
(1) Canada goes after Clearview AI, (2) Alibaba cloud profitable after 11 years, (3) Databricks raises $1B. (4) NYT warns about dangers of geo-location tracking of capitol insurrectionists
Hi TWID readers,
Hope you all had a super weekend. The Super Bowl gave us something for everyone, whether it be television advertisements, the Mahomes heroics, the Weeknd memes or Brady’s family going on their annual vacation.
But this is a data newsletter, so onto data news:
Data Infrastructure
Learning infrastructure from the blogosphere
I'm starting to believe that reading technical blogs on topics is some muted, technical equivalent of doom scrolling. You read to validate your understanding, enrich it slightly and pray that you find something actually new. This is especially true when company blogs are constrained and can't reveal that much. That being said, some are getting quite good, and here’s my tiny contribution collating them:
Gobblin. The Paypal team talks why they chose Apache Gobblin to standardize many ingestion formats.
Using Snapshots. The Capitol One team blogs about how Snapshots help enable time travel inside their data pipelines.
Alibaba Data Lake. The Alibaba team writes a small book (43 min read!) on how they think about data lake. This is part of their Alibaba Cloud, which had its first profitable quarter after 11 years.
An Actual Book. Lecturer Chip Huynh publishes an early draft of the data engineering course material of her Stanford CS 329S class (Data Engineering for ML). Modern lectures need modern content (read: tweets and memes).
Google Cloud. Unlike Alibaba, Google Cloud is down 5.6B on 13B of revenue in 2020. This week they did land a success with a multi-year Twitter partnership.
Privacy & Surveillance
I spy (with) a little fly(ing drone)
There's little evidence that we'll be able to avoid having a standing privacy and ethics section in this newsletter moving forward, given how much news there is every week. Last week, there was a separation of government surveillance from the corporate tracking, but sometimes the lines begin to blur. You win some, you lose some.
Employee Spying = bad. Amazon is planning on monitoring drivers for "safety reasons". The videocameras use AI to give feedback when drivers are driving aggressively or not paying attention.
Flying Spying = bad. Baltimore terminates a spy plane program that used drones to monitor 90% of the city up to 10 hours each day.
Camera Spying = bad. San Francisco passes a resolution require board approval for surveillance plans in special business districts. This is in response to police obtaining surveillance of surrounding businesses during BLM protests.
Face Spying = bad. Canada adjudicates that Clearview AI engaged in illegal behavior for collecting the information of Canadians without their consent. The report says the practice of scraping info from websites doesn't count as 'publicly available' exception.
Location Spying = bad, even on people you don't like. The New York Times warns of the dangerous implications of geo-location tracking those who stormed the capitol earlier this year. Even though we want justice for criminal behavior, tracking them is extremely dangerous precedent.
As people involved in data, it behooves us to understand how our work affects others around us. We may benefit from consumer tracking in order to build recommendations or conversion funnels, but pay the price elsewhere. There’s no fine line to a surveillance world, and what we can do is be vigilant and conscious of how data is used or abused.
Small Bytes
FiveThirtyEight shows the US economy slowly recovering, though it affects different groups and different industries differently.
Bezos retires after 27 years, handing the reins to AWS lead Andy Jassy. It's not hard to find many profiles on the new guy, and eulogy-like essays on the Bezos era. Now 4 out of the 5 FAAMG founders have been replaced with MBAs.
Vx2Text generates inferences from multi-modal media. It uses text, video, and audio to create a combined context that can be used to interpret what a comic book or movie means.
Venturebeat does a special issue on the Future of Data in HealthCare. The reporters talk about challenges in data management, telemedicine and chat bots, AI decision making, as well as robotics in a six-part series. I love what comes out of the VB team - honestly great reporting.
Ford F-150 production cut due to chip shortage. It's unclear if this affects the broader cloud market, but may influence future fabrication plans.
Two additional engineers leave Google over the Timnit Gebru firing last month. A very difficult decision to make, and we will see how things follow.
Researchers at CMU show Predictive Policing continues to be racist even when changing the datasets from using arrest data to crime reports.
Corpo Updates
Free product marketing for some data companies.
The Forbes magazine this month features how Frank Slootman took Snowflake to the next level. Read this (suck-up piece) for a fun story of executive glory. (link)
Mixpanel is creating a free tier for up to 100K monthly tracked users (MTUs). (link)
Some fun products from my recent ProductHunt crawl (Filters = Top 10, has "AI" in the name)
CopyAI - GPT-3 for doing copywriting
SnazzyAI - GPT-3 for advertising tag lines
Removal.ai - background removal.
Tableau announces its Community Hub, a reorganization and extension of the existing community features. The new little welcome page helps show us their realigned focus for its fans.
Industry and Fundraising
Alforithmic - $1.3M for AI driven synthetic audio
Polytomic - $2.4M for syncing warehouse data to business applications
HealthTensor - $5M for augmenting and correcting medical records
Iteratively - $5.4M for event data pipeline validation
Brightloom - $15M for eCommerce recommendations
Reverie Labs - $25M for AI drug discovery
Weights & Biases - $45M for developer tools for ML
DroneDeploy - $50M for drone imaging platform
Rescale - $50M for High Performance Computing infrastructure on the cloud.
Vivino - $155M for wine discovery and recommendations (and is amazing at identifying wines based on a photo of the label)
Databricks - $1B for spark & delta lakes
Collibra acquires OwlDQ for data quality for AI
Rapid7 acquires Alcide for $50M for container based security
This Week in Data is a weekly newsletter to help you stay up to date with developments in the data ecosystem. My goal is to bring focus on broader data trends to data professionals and enthusiasts who are interested in data and its applications. Topics include infrastructure, AI/ML, experimentation, analytics/BI, privacy, security.