Hi TWID Readers!
Thank you for your support and readership this past year. I started this newsletter in June to stay in touch with the innovations, developments, and movement of capital in the data ecosystem. My first issue was back when GPT-3 came out and Privacy Shield was struck down. What a time! I went back through my posts to find patterns, look for interesting trends, and see how these trends may be relevant in the future.
This is a mega post! It may be easier to read on the website (share button below will take you to the post).
Topics
For those who work with data - Data Discovery, Experimentation, Infrastructure
For the data enthusiasts - Data for Science, Data Visualization, Language/Robotics
How data affects our day to day - Consumer Privacy, Government Surveillance
Long term data trends - AI Ethics, Hardware in Data/ML
Companion Link: My tracking spreadsheet for data-related fundraising
For those who work with data
A number of you work in the data space, and there have been a number of developments across the field, especially in the core foundations. Findings from this year:
Data Discovery and Quality has been hot! It's slowly evolving, data discovery, data catalog, data classification, data downtime, data dictionary are all terms being thrown around.
Just this half year, we've seen a number of data discovery & quality posts about internal tools including Nemo (Facebook), Artifact (Shopify), Amundsen (Lyft), Databook (Uber), and Midas/Minerva/etc. (Airbnb). These made up a large fraction of my "most clicked" links in this weekly newsletter, and companies blog on topics that they think will attract talent.
There's also been a slew of data discovery products in the market to compete with the existing enterprise incumbents (Alation, Collibra). We've seen fundraising and great content from companies such as Data.world ($26M, blog), Monte Carlo ($16M, blog), Anomolo ($6M, blog), and dozens I'm sure I've missed. Each tool tackles data quality, discoverability, lineage, and cataloging features in its unique way.
Full stack experimentation is relatively mature but has seen a lot of blogging from tech blogs I follow. I'm not sure what to think of this, but when the blogging picks up I can't help but wonder if there will be a small revitalization. Here's what happened this year:
Optimizely got acquired for less than desired in September by customer experience company Episerver. On the full stack experimentation world, we had fundraises from AB Tasty (July, $40M) and LaunchDarkly (Jan, $54M), but it's unclear if there is opportunity for innovation or the waves are all towards monetization.
In the blog land, companies talked about all sorts of things, including Quasi Experiments, alternatives such as diff-in-diff or counterfactuals when traditional A/B testing isn't possible. (See blogs: [1] Netflix Quasi-experiments, [2] Shopify Quasi-experiments). Experimentation Infrastructure is a tough scaling problem, and T-REX [P1/P2](LinkedIn), Curie (DoorDash), and Verdict (Shopify) discuss specific challenges in scaling an experimentation infrastructure. Stitch Fix showed how they use Sequential Testing to end experiments early, and enabling customized Multi-Arm Bandit experiments for faster iteration.
None of these techniques are new, but a few new techniques are a good excuse for organizations to revisit internal A/B platforms that need some love or innovation. Few thoughts: (1) Is there any AI that can be added to experimentation? (i.e. to create better counterfactuals for non-AB tests). (2) will newer streaming technology or fast databases allow for fast iteration, such as efficient MAB, better caching, etc.?
Data Infrastructure continues to iterate and innovate. We're starting to stabilize on a tech stack and better understand user access patterns. The recent A16Z data architecture graph is well done and innovation will happen both in and between the boxes.
Hadoop/Hive is pretty much out. The increasingly popular Data Lake concept means a number of things, ranging from the use case Hadoop served, to highly unstructured data processing, to staging areas to keep around your data archives. I expect it to be confusing for years to come. SQL stays king w/ Databricks launching SQL-based Delta Engine on top of their Delta Lake (lakehouse), and Presto picking up momentum.
Steady growth and fundraising for companies trying to monetize innovative early/mid-stage technologies w/ traction such as Altinity ($3M | Tech: Clickhouse), Anyscale ($40M | Tech: Ray), Rockset ($40M | Tech: RocksDB), Materialize ($40M | Tech: Materialize), and Fishtown Analytics ($30M | Tech: dbt). Many of these have momentum to become integral parts of emerging data ecosystems, and will have to find their niche against more integrated enterprise solutions.
Big movements in the industry. Snowflake (SNOW), Palantir (PLTR) and C3.ai (AI) had public debuts in 2020 and are seeing significant success in the public market with $82, $48, and $13 billion market cap, respectively as of this article. Other big deals include Twilio acquiring Segment for $3.2B, and Idera acquiring Qubole.
For the data enthusiasts
Depending how you define it, data can be columns and rows, or it can be a superset of information through which we perceive the world. It all gets a bit too metaphysical for me, but how we use data is honestly pretty interesting. If you love data as a observer, here are some fun things that have happened this year:
AI in Science has helped shift research from the descriptive (detection & analysis), to the predictive (predicting protein structure), to the generative (creating simulations), all of which are important. I imagine we will see AI/Data used amongst scientific fields in novel ways and permeate across all branches of science (biology, chemistry, physics, space, climate, etc.) both in research and application.
Advancements in the AI ecosystem have an increasing impact on science. From DeepMind "solving" protein folding with AlphaFold II, to Facebook's Open Catalyst Project for atomic interactions, to a 3D space map built by neural nets, advancements in data processing have extended their value to basic sciences. The State of AI in 2020 with a significant number of pages on the impact of ML on science research.
AI in health and medicine have been frothy for fundraising, as this space has (1) a ton of data, and (2) high value actionability. It's not just for drug discovery, but also genetics and diagnostics. With an increase in telemedicine due to COVID, home medical tests may be where edge devices, sensors, AI, and data infrastructure collide.
Data Visualization. There hasn't been that much innovation in data visualization in 2020, but it's clear that data is a powerful accessory in effective storytelling. Life is complex and stories are what help us make sense of the world.
High effort visualization needs to generate an equivalent amount of value, and stories are the perfect venue. Major media sites (NYTimes, Washington Post, Bloomberg, ProPublica) all showed up multiple times with banging data journalism, especially around open data such as COVID, wildfires, and the election. I am hopeful that open data can allow organizations to win on superior visualization to construct cohesive, thoughtful narratives.
It's not always the big machines. I love the small publication The Pudding and will continue to feature whenever possible. They've also come up with great meta-articles such as How To Make Dope Shit Part 3: Storytelling or Idea to Data Story. For independent bloggers, Nathan Yau has maintained a great blog FlowingData and his visualization "A Timeline of California Wildfires" was the most-clicked link of this newsletter this year.
Innovation in business data visualization has been flat. From the big BI tools exits in 2019 (Looker, Tableau, Periscope) to relatively underwhelming hype around embedded analytics, I wonder if the space has gradually moved from innovation to monetization. Few crazy ideas:
(1) Visualization of AI. AI operates as a black box and has few visual aids. With advances in AI explainability, maybe we will find novel visual representations of what AI is actually doing.
(2) AR/VR. What if we can navigate visualization in 3D space? 2D visualizations already attempt to create additional dimensions through color, ranking, size, and shape, and an interactive dimension can help.
AI continues to make advancements on human tasks. Whether it being teaching machines to talk like a person, understand conversations, or do human movements, we're creating the building blocks for robots to take in sensory inputs and act on them more effectively.
Language models made a big splash with GPT-3, the latest and biggest language model. It made eerily good sentences, write literature, and even write code. For the best GPT-3 generated content, I point you to Gwern. It hit some negative press, as the model revealed flaws such as the blatant racism it picked up from the internet, and OpenAI decided to do an exclusive partnership w/ Microsoft. Outside of the dabbler tutorials, Twilio is promoting GPT-3 usage, ostensibly for chatbots on its platform.
Voice transcription and NLP are hot in fields that involve a lot of structured communication, namely Sales and Support. Chorus, Gong, Acto, Observe, Balto, TechSee, Slintel, Cogito, Verbit, and Ultimate.AI as well as acquisitions RingCentral ← DeepAffects, and Snapchat ← Voca.ai. In this past half year I've seen 10 raises for $460M.
Robotics + Self-driving cars. Self-driving cars have made a number of regulatory leaps this year, with partnerships and government approvals to do driverless tests. Amazon Prime Air and Flytrex got FAA certification for delivery drones. In business news, Uber dumped ATG to Aurora, Boston Dynamics was sold to Hyundai for $1.1B, and Scale AI raised $200M to provide data labeling services for them (as many data labeling clones spawned overnight).
How data affects our day to day
One way data affects us each day is how all our digital behavior is tracked and consequently used for monetization or surveillance. This has become a significant social issue (i.e. What are our Privacy Rights?) and has expanded into two major categories - Consumer Privacy and Government Surveillance.
Consumer privacy has become a significant industry. Since Europe passed GDPR in 2016, the baseline consumer privacy requirements to operate a business have become more clear, and companies have little excuse to plead ignorance on handling user data responsibly.
Privacy Laws. Privacy Shield was struck down (Schrems 2.0) as insufficient in protecting the privacy rights of EU citizens. California voters passed a revised data privacy law CRPA via Prop 24. The US senate has attempted national legislation through the SAFE DATA act, though it has been challenged on being too weak and still has a while to go. States and countries are on varying degrees of legislation.
GDPR compliance created a data privacy industry, with a growing ecosystem of data privacy companies trying to replicate OneTrust's 484x 3-year revenue growth. There are various angles, from sensitive data classification (BigID, $70M)/(CryptoNumerics, acq. by Snowflake), to data subject requests and streamlining internal operations.
The Value of your Data. The US Military is found to have indirectly bought data from a muslim prayer app, and California DMV & Arizona DMV are found to sell data to private investigators. On the advertising front, Andrew Yang has become a spokesperson for getting paid for your data via Data Dividends, though the concept is heavily opposed by organizations such as the EFF who believe privacy is a right and not an economic instrument.
Government Surveillance is the other category of privacy. This year, a lot of focus has moved from broad government surveillance (Patriot Act, NSA, Snowden) and shifted towards local surveillance and how this data collection impacts civil liberties at a local level.
Thousands of police departments use facial recognition through contracts with Clearview AI, a facial recognition company that does face matches with a database of images collected from social media scraping. Given this privacy fear, some cities such as LA and Portland banned facial recognition this year, though others such as Massachusetts Gov Baker were not on board. The year capped off with an article on the mistaken arrest of Nijeer Parks in Feb 2019, the third documented false arrest due to incorrect facial recognition.
Another popular police partnership has been Amazon Ring (doorbell + camera combo) to obtain video surveillance of neighborhoods after crimes. Civil and tech liberty groups such as EFF strongly oppose these partnerships due to bias and lack of transparency when these tools are used. Communities have also been installing Automated License Plate Readers (ALPR), which have similar police partnerships and dangerous downstream consequences, in which biases and fear fan the flames of paranoia.
The impact of data in government is predictable and depressing. In Tampa, predictive policing led to harassment of the same families over and over again. EFF's Atlas of Surveillance was created this year to show how pervasive this tooling is around the US. Of course, the broader government surveils on us too, such as the DHS ramping up its biometrics collection of immigrants.
Long term data trends
AI Ethics has continued to be an important topic that has bled into this year. There is less emphasis on killer robots this year, so the conversations have moved primarily towards algorithmic bias and how to guide towards fairness in AI.
This year gave us Timnit Gebru's forced resignation from Google, which shocked the AI ethics community and put in question Google's commitment to ethics and diversity. The narrative is shifting - that we can no longer trust tech to regulate itself.
Some orgs have attempted to improve transparency of AI models. In the second half of 2020, LinkedIn open sourced the Linkedin Fairness toolkit (LiFT), Salesforce put out its Simulation Scorecard, and Google came out with its Model Card Toolkit. In Amsterdam, the city pledged to be transparent on where it is using AI through the Amsterdam AI Registry.
It's a huge space, and I recommend the following articles: Carly Kind from VentureBeat has a great piece on "The Term Ethical AI is Finally Starting to Mean Something" or Karen Hao from the MIT Tech Review put out a great piece on her hopes for AI ethics in 2021.
Hardware is on a steady march, as more software applications rely on increasingly high computational power to give consumers the products they want.
Processors are becoming more specialized. Amazon (Inferentia), Microsoft, Google (Pixel Chips, TPUs), Apple (M1) etc. have all come up with their own chip initiatives, and we'll see who follows. I enjoyed Alex Irpan’s AI timelines and A16Z’s article on AI economics on effect of compute innovation on AI products and capabilities. With the potential $40B Nvidia + ARM acquisition, we will see how the market dynamics shift as the big players all try to do a little bit of everything.
Edge processing is finding its niche. AI applications need low latency processing. One big market is robotics and self-driving cars, i.e. Tesla's HW 4.0 w/ TSMC. In October, AMD finalized a deal to acquire Xilinx for $35B to shift their portfolio to include edge devices. Consolidation and innovation will continue as all devices start to benefit from processing.
Final Thoughts
The data world is complicated and "What is data?" is a metaphysical question I'm not going to attempt to answer. But as we look around, it's increasingly clear that the way we approach data is foundational to the way we contribute to and craft the society we live in. I'm privileged to have a chance to aggregate and make sense of data, and I'm privileged to have a great audience who cares.
Thank you so much for your support, and hope you have a great 2021!