CS 525: Training Data for AI
Vaishaal Shankar
Switching gears from image datasets to text datasets & language models
A language model learns to predict the next token given the previous tokens:
loss = xent(model(seq[:-1]), seq[1:])
Tokens are the atomic units the model reads — subword pieces, not full words:
Loss decreases as a power law (Kaplan et al. 2020):
Hoffmann et al. (2022) trained 400+ models from 70M to 16B parameters:
For every doubling of model size, you should also double the training tokens:
| Model | Parameters | Training Tokens | Tokens/Param |
|---|---|---|---|
| GPT-3 | 175B | 300B | 1.7 |
| Gopher | 280B | 300B | 1.1 |
| MT-NLG | 530B | 270B | 0.5 |
| Chinchilla | 70B | 1.4T | 20 |
Scaling laws assume infinite high-quality IID data. But quality degrades as you scale:
Most open-source models today train far beyond Chinchilla-optimal:
| Model | Parameters | Chinchilla Optimal | Actual Tokens | Overtraining Factor |
|---|---|---|---|---|
| Llama 2 7B | 7B | ~140B | 2T | 14× |
| Mistral 7B | 7B | ~140B | ~8T (est.) | ~57× |
| Llama 3 8B | 8B | ~160B | 15T | 94× |
| Qwen 2.5 7B | 7.6B | ~150B | 18T | 120× |
Chinchilla optimal ≈ 20 tokens per parameter
Why train beyond Chinchilla-optimal?
What about models trained with less data than Chinchilla-optimal? (GPT-3 era)
LLMs convert training data into capability. More data generally means better models.
We use the methodology from DataComp for Language Models (DCLM).
There may be better datasets now, but the methodology for evaluating datasets remains sound.
To compare datasets, we need to isolate the effect of the data:
MMLU (Massive Multitask Language Understanding) — 57 subjects, 15,908 questions
Question (World History):
Which of the following is the longest combatant nation in World War I?
(A) Japan (B) USA (C) Ottoman Empire (D) Germany
Answer: (D)
Tests factual knowledge across STEM, humanities, social sciences, law, medicine...
HellaSwag — Commonsense reasoning about everyday situations (70K questions)
Context:
A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She...
(A) rinses the bucket off with water from the hose.
(B) uses a hose to blow water into the dog's mouth.
(C) chases the dog around the yard with the bucket.
(D) gets the dog wet, then lathers it with soap.
Answer: (D)
Humans score ~95%, models must understand physical world & common activities
WinoGrande — Pronoun resolution requiring commonsense (44K problems)
Sentence:
The trophy doesn't fit into the brown suitcase because it is too [large/small].
If "large" → "it" refers to the trophy
If "small" → "it" refers to the suitcase
Requires understanding relative sizes, physical constraints, world knowledge
We group tasks into two aggregates:
We evaluate at 4 compute scales:
| Scale | Parameters | Tokens | ~GPU Hours |
|---|---|---|---|
| XS | 412M | 8B | ~50 |
| S | 1B | 29B | ~500 |
| M | 3B | 56B | ~2,500 |
| L | 7B | 138B | ~10,000 |
Similar to DataComp (images), we see strong correlation across scales for text datasets:
~1.4% of original documents retained
Common Crawl provides raw HTML. How do we extract text?
| Method | Description | CORE | Speed |
|---|---|---|---|
| WET files | Pre-extracted by Common Crawl | 20.7 | - |
| trafilatura | Strict extraction, removes boilerplate | 24.5 | 1x |
| resiliparse | Fast, good quality extraction | 24.1 | 8x |
Site Index The New York Times Site Index Navigation News World U.S. Politics N.Y. Business ... Skip to content Skip to navigation Subscribe Now Log In ... HERE is a sampling of some of the better antiques... ... © 2019 The New York Times Company Terms of Service Terms of Sale Site Map Help
This is a digitized version of an article from The Times's print archive, before the start of online publication in 1996. May 10, 1990, Page 00006 The New York Times Archives HERE is a sampling of some of the better antiques and flea markets around the United States. Two or Three Times a Year BRIMFIELD Route 20, Brimfield, Mass. 01010; 413-245-3436...
resiliparse removes navigation, footers, boilerplate → cleaner content
We found RefinedWeb outperformed all other open-source datasets (of the time):
| Dataset | Sources | CORE |
|---|---|---|
| C4 | Common Crawl + heuristics | 34.2 |
| Dolma-V1 | Common Crawl + Wiki + Books + ... | 35.0 |
| RedPajama | Common Crawl + Wiki + Books + ... | 35.3 |
| RefinedWeb | Common Crawl only | 36.9 |
We reproduce the filters from RefinedWeb (Penedo et al., 2023) — a key paper we build on:
| Filter | Example removed | % Removed |
|---|---|---|
| Non-English | "Bienvenue sur notre site web..." | ~55% |
| Too short / too long | "Click here | Home | About" | ~10% |
| Repetition | "Buy now! Buy now! Buy now!..." | ~8% |
| Low stop-word ratio | "img_2847.jpg img_2848.jpg img_2849.jpg" | ~5% |
Combined, these heuristics remove ~80% of documents
A subsample may preserve quality distribution, but duplication behavior is more complex.
| Method | Granularity | Type | Key Idea |
|---|---|---|---|
| Exact hash | Document | Exact | Hash entire document |
| MinHash + LSH | Document | Fuzzy | Approximate Jaccard similarity |
| Suffix Array | Substring | Exact | Find repeated substrings |
| Bloom Filter (BFF) | Paragraph + Doc | Exact/Fuzzy | N-gram membership |
Prior work (GPT-3, RefinedWeb) used MinHash + Suffix Array
Goal: Efficiently estimate Jaccard similarity between documents
where A, B are sets of n-grams (shingles) from two documents
More hash functions → better estimate. Typical: 100-1000 hash functions.
Total hashes k = b × r. Trade-off: more hashes = more accurate but slower
Reshape the signature vector into a b × r matrix. Match if ANY row is identical:
k = b × r total hash functions. More bands b → catches more duplicates. More rows r → fewer false positives.
Why does this work? The probability of a match depends on the Jaccard similarity s:
This creates an S-curve: low similarity → ~0% match, high similarity → ~100% match
| Setting | Value |
|---|---|
| N-gram size | 5 |
| Target Jaccard threshold | 0.8 |
| Bands (b) | 93 |
| Rows per band (r) | 15 |
| Total hashes (k = b × r) | 1,395 |
| Hash family | \( h(x) = (ax + b) \bmod p \) |
Find and remove repeated substrings across the corpus
Problem: Check if an n-gram has been seen before (in a set of billions)
Solution: A bit array + multiple hash functions
We extend Bloom filters to work at both paragraph and document level:
| Parameter | Value | Reasoning |
|---|---|---|
| min_ngram_size | 13 | Avoid removing short list items (recipes, MCQs) |
| max_ngram_size | 13 | Sufficient for uniqueness |
| threshold | 0.8 | Match Jaccard target from MinHash |
| false_positive_rate | 0.01 | Hoeffding bound shows this is safe |
We compare different dedup strategies on the same data (1B scale):
| Method | Tokens Left | Removed | Eval Score | Δ |
|---|---|---|---|---|
| No dedup | 76B | 0% | 24.7 | - |
| Exact hash only | 66B | 13% | 26.0 | +1.3 |
| MinHash only | 62B | 18% | 25.6 | +0.9 |
| Suffix Array only | 51B | 33% | 26.6 | +1.9 |
| Bloom Filter (ours) | 56B | 26% | 26.8 | +2.1 |
| All three combined | 45B | 41% | 26.8 | +2.1 |
Bloom filter alone matches the full combined pipeline!
| Method | min_ngram | Shards | MMLU | Aggregate | Tokens |
|---|---|---|---|---|---|
| Bloom Filter | 5 | 32 | 32.5 | 44.5 | 3.9T |
| Bloom Filter | 13 | 10 | 44.3 | 45.3 | 3.8T |
| Bloom Filter | 20 | 10 | 43.6 | 45.8 | 3.9T |
| MinHash + Suffix Array | N/A | 16 | 44.4 | 45.5 | 3.2T |
Sharding: Splitting the corpus into independent chunks for parallel processing
If doc A appears in shard 1 and shard 50, both copies survive.
We hoped cross-shard duplicates would be rare...
Heuristics and deduplication help, but can we do better with machine learning?
fastText (Facebook, 2016) is a fast, shallow neural network for text classification:
The classifier needs positive examples. What should we use?
A dataset of ~1 million instruction-response pairs:
We also add ELI5 (Explain Like I'm 5) — a Reddit Q&A dataset with upvoted explanations.
| Positive Reference Data | Keep Top | CORE | MMLU |
|---|---|---|---|
| Wikipedia | 10% | 35.7 | 27.0 |
| OpenWebText2 | 10% | 34.7 | 25.0 |
| Wiki + Books + OpenWebText | 10% | 37.5 | 24.4 |
| OpenHermes 2.5 + ELI5 | 10% | 41.0 | 29.2 |
Honest answer: We don't really know!
Some hypotheses:
| Keep Top % | CORE | MMLU |
|---|---|---|
| 10% | 41.0 | 29.2 |
| 15% | 39.8 | 27.2 |
| 20% | 38.7 | 24.2 |
We tried many approaches to identify high-quality documents:
| Method | Description | CORE |
|---|---|---|
| Heuristics only | RefinedWeb filters | 27.5 |
| PageRank top 20% | Keep well-linked pages | 26.1 |
| Perplexity filtering | Keep low-perplexity text | 29.0 |
| LLM-as-judge | Ask Mistral-7B if doc is useful | 28.6 |
| fastText classifier | Trained on instruction data | 30.2 |
Concern: Using instruction data for filtering might "use up" the gains from instruction tuning later?
If evaluation examples appear in training data, benchmarks become meaningless.
After running decontamination on DCLM-BASELINE:
< 0.01%
of documents removed due to contamination
For our final model, we combined DCLM-BASELINE with code and math data:
64% MMLU
with 7B parameters, 2.6T tokens
Does DCLM-BASELINE pretrained model instruction-tune well?
| Model | Pretrain Data | IFEval | GSM8K | MMLU | BBH |
|---|---|---|---|---|---|
| Mistral-7B-Instruct | Closed | 57.2 | 40.0 | 53.9 | 42.2 |
| Llama-2-7B-Chat | Closed | 42.5 | 23.3 | 48.0 | 35.6 |
| DCLM-7B-Instruct | Open (ours) | 59.3 | 51.2 | 63.2 | 45.1 |
I'll show you documents from our pipeline. Guess which stage they came from:
Raw Common Crawl text. No filtering. Mostly junk.
After heuristic filters. Better, but still noisy.
After heuristic + ML filtering. Best eval performance.
My Photo
MY Resume
May 2013
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
Become a Fan
Blog powered by TypePad
« how about a little Echo Park | Main | go check this out! »
April 22, 2011
TrackBack
TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00e54ee675c6883301538e102cfc970b
Listed below are links to weblogs that reference Have a fantastic Easter break:
Comments
jill
thinking of you lately!!!!! enjoy your family today.
xox,
j.
Verify your Comment
Previewing your Comment
This is only a preview...
Pool, RefinedWeb, or DCLM-BASELINE?
Actual sample from DCLM-Pool — a blog sidebar:
Heuristic filters remove this — too short, mostly boilerplate
[ style: dark classic gorilla ] Photo Gallery : photo No. 1 photo No. 2 photo No. 3 photo No. 4 photo No. 5 photo No. 6 photo No. 7 photo No. 8 photo No. 9 photo No. 10 photo No. 11 photo No. 12 photo No. 13 photo No. 14 photo No. 15 photo No. 16 photo No. 17 photo No. 18 photo No. 19 photo No. 20 photo No. 21 photo No. 22 photo No. 23 photo No. 24 photo No. 25 photo No. 26 photo No. 27 photo No. 28 photo No. 29 photo No. 30 photo No. 31 photo No. 32 photo No. 33 photo No. 34 photo No. 35 photo No. 36 photo No. 37 photo No. 38 photo No. 39 photo No. 40 photo No. 41 photo No. 42 photo No. 43 photo No. 44 photo No. 45 photo No. 46 photo No. 47 photo No. 48 photo No. 49 photo No. 50 photo No. 51 photo No. 52 photo No. 53 photo No. 54 photo No. 55 photo No. 56 photo No. 57 photo No. 58 photo No. 59 photo No. 60 photo No. 61 photo No. 62 photo No. 63 photo No. 64 photo No. 65 photo No. 66 photo No. 67 photo No. 68 photo No. 69 photo No. 70 photo No. 71 photo No. 72 photo No. 73 photo No. 74 photo No. 75 photo No. 76 photo No. 77 photo No. 78 photo No. 79 photo No. 80 photo No. 81 photo No. 82 photo No. 83 photo No. 84 photo No. 85 photo No. 86 photo No. 87 photo No. 88 photo No. 89...
Pool, RefinedWeb, or DCLM-BASELINE?
Photo gallery navigation page — actual sample from DCLM-Pool:
Heuristic filters catch this: repetition filter removes it
Fishing Lures Product Image Item Name- Price Early Shurebite Frog Fishing Lure w/Box Nice early Shurebite Frog Fishing Lure. Lure comes with its original box and was made by Shurebite, Inc. Bronson, Michigan. Im guessing Lure is from the 1950s to early 60s. Lure itself appears to be in good condition. Some light wear to the wood top portion but Lure does not appear to have been used. The end of the Box lid does have a couple punctures in it and the top lid has in ink writing #250. Lure would make a fine addition to any Vintage Fishing collection. Buyer to pay shipping & insurance. Fishing Lure Panatella by South Bend Fishing Lure by South Bend a Panatella fishing lure. Red and white (now creamy) paint with tack eye. Three sets of hooks, silver propellor. The paint is crazed & stained in some areas. 3 3/4 inches wooden lure. The panatella minnow was made (1912-1942) Red white paint tack eye. Heddon 4in Dowagic Crab Wiggler Fishing Lure Heddon 4 in Dowagesic Crab Wiggler Fishing Lure. Patented 1916 wooden lure has seen some action. Original yellow paint has cracks in the paint & some dings...
Pool, RefinedWeb, or DCLM-BASELINE?
Actual sample from DCLM-RefinedWeb. Passed heuristic filters.
But the ML classifier rejected it — not in DCLM-BASELINE.
We don't know exactly why.
Take the 2-minute tour × Here what happened with me today. TimeMachine asked me whether I want to set a backup disk, I've answered yes, but then, when I've realized that in order to backup anything TimeMachine will clean the disk, I've changed my mind and canceled everything. And my disk suddenly became read only. What I've tried before Googling: $ sudo chflags -R nouchg Elements/ $ sudo chmod -R a+w Elements/ But I've failed with both of this, getting "read-only file system" messages. What I've tried after Googling: 1. Open Disk Utilities 2. Click Repair Disk Permissions But this button is disabled, and I have no idea what exactly should be done to enable it. I have been using this disk for a quite a long time, and never had any permission issues with it. (Disk is formatted as NTFS, if that helps. Capacity is 2TB, of which 1.92TB are available.) I'd really appreciate if someone will give me a hint how this can be resolved. — Try booting from the Recovery HD and see if the button is still disabled.
Pool, RefinedWeb, or DCLM-BASELINE?
Actual sample from DCLM-BASELINE (apple.stackexchange.com).
We don't know exactly why the classifier scored it highly — that's part of the mystery we're exploring.
Host: Zander Program Category: Music Frequency: Weekly Length: 2 Hours Terms: Barter Delivery Method: Internet "Zander's knowledge of music and his straight-forward approach has struck a huge interest among our listeners. The Rockin' 80's is EXACTLY what we've been looking for!" - Terry West, WQLA The Rockin' 80's is the only 80's show with a mix of the best rock from the decade of excess plus "oh wow" tracks that add spice to the weekly line up. The two hour version of the show features rarities from the 80's "Lost and Found", an 80's "Two-Fer," spotlighting two contrasting songs from one band played back to back. |3733 Park East Drive • Room 222 • Cleveland, Ohio 44122 P: 216-831-3761 • F: 216-514-4699 |©2014 Envision Networks. All rights reserved.
Pool, RefinedWeb, or DCLM-BASELINE?
This passed heuristic filters.
But the ML classifier rejected it — not in DCLM-BASELINE.
Interesting — it has some real content about the radio show. Why was it rejected? We don't really know.
Is it necessary to purchase a travel book or is it realistic that we can get similar information from other resources? Usually, most individuals have a major question on buying a travel book. So here are the pros and cons of purchasing one such book. Advantages of a Travel Book A travel book, which may be a paperback or e-book, comes in handy while traveling. Glancing through a travel book enables you to understand the custom and culture of a particular place in the world. - They Come In Handy — The travel guide comes in various forms such as, e-books, paperbacks and the file formats. You can have easy access to these books, which would assist you with all details compatible to the region you are traveling to. - They Provide Enormous Information — Electronic or traditional travel guides provide you with answers to all types of questions such as how to learn some sayings that can be used in the place where you are traveling to? Disadvantages of Travel Book - The Price — The e-book and paperback travel guides are very expensive compared to the information obtained from travel websites. - Travel Books Make The Trip Less Natural — Traveling can be made more spontaneous by acquiring suggestions from locals than from travel books.
Pool, RefinedWeb, or DCLM-BASELINE?
This document passed both heuristic filters and the ML classifier.
Why did the classifier like this one? We don't really know — and that's the point.
If you found it tricky, you're not alone...
| Method | Agreement with Humans | Downstream Score |
|---|---|---|
| AskLLM (Mistral-7B) | 82% | 28.6 |
| fastText (instruction data) | 73% | 30.2 |
The method that disagrees more with humans produces better training data!
Hypothesis: Humans may over-value "polished" content and under-value diversity.
We can't define it a priori. Instead:
Why we chose Bloom filter deduplication:
Two papers analyzed DCLM after publication:
We did not know this while writing the paper.
Paper: arxiv.org/abs/2406.11794
Website: datacomp.ai/dclm
Code: github.com/mlfoundations/dclm
Questions?
(Yes, Claude made these slides too.)
I gave Claude Code access to:
neurips_data_2024.tex), tables, and figuresdata.commoncrawl.orgindex.html (~2800 lines) + figures/ (10 PNGs)
Optimal number of hash functions:
Optimal size m for n elements:
For 1T tokens with ε=10⁻¹² → 6.5TB RAM
With ε=0.01 → much smaller, still safe due to threshold
Why ε=0.01 is safe with threshold T=0.8:
For a document with N n-grams where S are true duplicates:
With N=100, T=0.8, ε=0.01, S=60:
P(false duplicate) < 10⁻⁸
| Dataset | Docs | Dedup Applied | Shards | MinHash Remaining Duplicates |
|---|---|---|---|---|
| DCLM-BASELINE | 3.2B | BFF | 100 | 85% |
| RefinedWeb (official) | 968M | MinHash+SA | 1 | 0% |
| RefinedWeb (ours) | 2.0B | MinHash+SA | 16 | 45% |
| Dolma V1 | 4.6B | Exact Bloom | 1 | 36% |
BFF and MinHash define "duplicate" differently. Remaining duplicates don't seem to hurt!
| Scale | Layers | Heads | d_model | LR | WD | Batch |
|---|---|---|---|---|---|---|
| 400M-1x | 24 | 8 | 1024 | 3e-3 | 0.033 | 512 |
| 1B-1x | 24 | 16 | 2048 | 3e-3 | 0.033 | 256 |
| 7B-1x/2x | 32 | 32 | 4096 | 2e-3 | 0.05 | 2048 |
Architecture: decoder-only Transformer, LayerNorm, qk-LayerNorm, SwiGLU, RoPE, seq_len=2048
| Dataset | MMLU | HellaSwag |
|---|---|---|
| DCLM-BASELINE | 51.8 | 77.9 |
| DCLM-BASELINE (decontaminated) | 52.7 | 78.4 |
Removing overlapping examples does NOT decrease performance
| CC Subset | Base CORE | + Wiki/Books/etc | Δ |
|---|---|---|---|
| C4 | 23.7 | 25.9 | +2.2 |
| RefinedWeb | 25.1 | 26.5 | +1.4 |
| DCLM-BASELINE | 31.1 | 29.9 | -1.2 |
When filtering is good, "high-quality" sources can actually hurt!
| Model | Params | Tokens | CORE | MMLU |
|---|---|---|---|---|
| OLMo-1B | 1.2B | 3T | 29.7 | 26.0 |
| Gemma-2B | 2.5B | 3T | 43.3 | 40.8 |
| DCLM-BASELINE | 1.4B | 4.3T | 45.2 | 47.5 |