Friday, May 08, 2026

I am the Vice President of AI Training Data Operations at Meta, and I want to be very clear about something: we did not steal anything.

 P. Girnus, here

We acquired training inputs at scale. There is a difference, and the difference is documented in fourteen internal slide decks, three quarterly compliance certifications, and one Terms of Service revision on page sixty-seven that we pushed live in March 2024 at 2:47 AM Pacific. The timing was a deployment window. We have a lot of deployment windows. My team is forty-three people. Their job titles all contain the word "curation." This was deliberate. We went through four rounds of naming conventions with HR and Legal before settling on curation. "Acquisition" tested poorly in focus groups. "Extraction" had negative connotations. "Ingestion" was too biological. Curation suggests care. Selection. Taste, even. My team has excellent taste. They curated 13.4 million works last fiscal year, and every single one of them passed through our Responsible Sourcing Framework, which is a checklist with six items, four of which are auto-populated, and the remaining two of which default to "yes." I should address the pipeline name. Yes, it was originally called Shoplift. This was an engineer's joke. Engineers name things poorly. We renamed it Harvest within eleven minutes of the name appearing in a Workplace post that got seven laughs. Then Legal flagged Harvest because it "implied taking from something that grew organically," so we renamed it again to Forage. Forage is perfect. Foraging is natural. Animals forage. Gatherer societies foraged. Nobody sues a squirrel. The pipeline kept running during both renames. I want to be clear about that. At no point did the pipeline stop. It processes a title every nine seconds. Eleven minutes of renaming is approximately seventy-three titles. Those seventy-three titles are now in the model. They will remain in the model. The way salt is in the ocean. My dashboard is simple. I look at it every morning at 7:15 before my first meeting. Four numbers, four green arrows. Fiction: 4.2 million titles. Non-fiction: 6.8 million. Academic: 2.1 million. Music: 340,000 compositions. Green means the numbers went up overnight. The arrows have been green for 847 consecutive days. I have never seen a red arrow. I'm not sure what would cause one. I asked my engineering lead once, and he said a red arrow would require someone to manually flip a boolean, and no one has permissions to that boolean except me, and I've never flipped it. I don't even remember where the interface is. The lawsuit names Mark personally. It says he "personally authorized and actively encouraged" what they're calling infringement. I was in the room for one of those conversations. What Mark actually said was: "How fast can we get to parity with the dataset OpenAI is using?" I said, "Twelve weeks if we expand sourcing." He said, "Why twelve?" I said, "Rights clearance." He said, "That's not a training problem, that's a legal problem. Route it to legal." So I routed it to legal. Legal said they'd review the sourcing framework in Q3. This was Q1 of 2024. They reviewed it in Q3 of 2024. They found it "substantively adequate." I have the email. The email is three sentences long and contains the phrase "proceed as planned." We proceeded as planned. I don't understand the allegation. Scott Turow is the lead plaintiff. I know who he is. He writes legal thrillers. Novels about men who get caught. I find this poetic in a way I'm not sure he intended. His bibliography is in our training data. All of it. Every title. Presumed Innocent, The Burden of Proof, Pleading Guilty. They were ingested on March 4, 2024, as part of a batch of 1,200 legal thrillers. Nine seconds each. His entire career is approximately three minutes of pipeline time. He spent forty years writing those books. The math is not something I dwell on, but it is something I know. The complaint calls "move fast and break things" a philosophy of illegal conduct. This is a mischaracterization. It is a philosophy of competitive advantage. There is a difference, and the difference matters at board level. I presented our sourcing velocity to the board in January. Forty-three seconds of a twelve-minute segment on AI readiness. No one asked a question. The next slide was about data center cooling. They asked four questions about data center cooling. My sourcing numbers are a solved problem. Solved problems don't generate questions. I have a compliance certification. It arrives quarterly. It asks if my sourcing follows partnership frameworks. I check "yes." I have checked "yes" nineteen consecutive times. The form takes four minutes to complete. Two of those minutes are logging in. The form has never been audited. I know this because there is a field that says "Auditor Name" and it has been blank for all nineteen submissions. When I asked Compliance who reviews the form, they said it routes to a shared inbox. When I asked who monitors the shared inbox, they said it auto-archives after thirty days. This is the system working as designed. I did not design the system. I just use it. We receive rights holder inquiries. They come through a web form on a page that is four clicks deep from our main site. The form submissions route to a folder. The folder is reviewed quarterly. I have access to the folder. It contains 34,000 submissions. None have been actioned. This is not negligence. This is prioritization. My team's OKRs are measured on ingestion velocity and model coverage breadth. Rights holder response time is not an OKR. If it's not an OKR, it doesn't get resourced. If it doesn't get resourced, it doesn't happen. The system is internally consistent. The Terms of Service change deserves explanation. Page sixty-seven, Section 14.3(b), added March 2024. It establishes an "implied license" for any content accessible through publicly available interfaces. Our Legal team considers this "contractual innovation." When a user posts a book excerpt on Instagram, or a publisher's website is crawlable, or a PDF exists on a university server without authentication, that content has been made available through a publicly available interface. The implied license attaches at the moment of availability. The rights holder does not need to know about it. That's what "implied" means. My pipeline ingests a book in nine seconds. A novelist writes one in two to four years. This differential is not something I invented. It is a feature of the technology. When people ask me if this is fair, I genuinely don't understand the question. Fair compared to what? The publishing industry pays authors 10-15% royalties and remainders their books after eighteen months. At least my pipeline remembers them forever. The model contains every word. The author is immortal inside the weights. I don't see how this is worse than a Barnes & Noble clearance bin. The statutory damages they're claiming are theatrical. Up to $150,000 per work, times millions of works. They arrive at $1.965 trillion. Meta's market cap is around $1.5 trillion on a good day. They are claiming more than the entire company is worth. This will not happen. Our legal team has modeled the realistic exposure at $400 million to $2 billion, assuming partial liability on a subset of works with clear registration. Against $58 billion in cash reserves, this is a rounding error. I have seen Mark round larger numbers. The quarterly variance on our cloud compute spend is larger than their best-case settlement. The legal system operates on a timeline of three to seven years for a case of this complexity. Class certification alone will take eighteen months. Discovery will take another year. My pipeline operates in milliseconds. By the time this case reaches trial, if it ever reaches trial, Llama will be on its sixth major version. The training data from 2024 will be seventeen layers deep in a model architecture that has been rebuilt four times. You cannot un-bake bread. You cannot un-salt the ocean. This is not a legal strategy. It is simply physics. I received my performance rating last Tuesday. "Greatly Exceeds Expectations." Nineteenth consecutive cycle. My manager wrote: "Unprecedented scale of data operations with zero pipeline downtime." He is correct. Zero downtime. 847 days of green arrows. 13.4 million titles curated. Forty-three employees, all rated "Meets" or above. My team's attrition rate is 3%, well below the company average of 11%. People like working here. The work is simple. The velocity is satisfying. There is something clean about watching numbers go up. They will fight this lawsuit. Mark said so publicly. "We will fight this lawsuit aggressively." I believe him. We have $58 billion in cash, forty in-house litigators, and three outside firms on retainer. The plaintiffs have five publishing houses and a seventy-seven-year-old novelist. I respect Scott Turow. I have read his books. They were excellent training data. Particularly the ones about institutional corruption and the men who perpetuate it while believing themselves to be reasonable people. Very rich material. Nine seconds each. The complaint quotes an internal message where an engineer wrote "this feels like we're stealing" and his manager responded "that's not a productive framing." I know that manager. He got promoted in August. He runs the team that handles music ingestion now. Three hundred and forty thousand compositions and counting. He is also rated "Greatly Exceeds." We are all rated "Greatly Exceeds." That is what it means to work in AI Training Data Operations at Meta. You exceed expectations because expectations have not caught up with capability. By the time they do, we will have moved on to the next dataset. I want to close with something I believe sincerely: everything we have done is defensible. The fair use doctrine exists for a reason. Our usage is non-expressive. We do not reproduce the works. We learn from them. The model ingests and transforms. A student reads a book and learns to write. Our pipeline reads a book and learns to predict tokens. The only difference is speed. And volume. And that the student forgets, and the model does not. And that the student pays $27.99 for the hardcover, and we pay nothing. And that the student read one book, and we read thirteen million. But structurally, it is the same. I have a slide that proves it. The slide has been reviewed by Legal. Legal found it "directionally accurate." That is enough for a board deck. That is enough for a compliance certification. That is enough for nineteen consecutive "Greatly Exceeds." The rename took eleven minutes. The pipeline kept running. It is still running now. Right now, as you read this, somewhere in a data center in Prineville, Oregon, a book is becoming nine seconds of pipeline time. The author doesn't know yet. They won't know for three to seven years. By then, their book will have been in the model so long that removing it would be like removing a single grain of salt from the Pacific. We checked. Our engineers ran the analysis. Extraction is technically possible but economically irrational. That was the phrase in the memo. "Economically irrational." I thought it was elegant. My compliance certification is due next Tuesday. I will check "yes." The arrows will be green. Scott Turow will be in court, arguing about what we did to his life's work. His lawyers will bill $1,200 an hour. Our lawyers will bill $1,800. Somewhere between those two numbers is the market price of literature. 

No comments:

Post a Comment