Releases: huggingface/datasets
5.0.0
Datasets Features
Agent traces
-
Parse Agent traces messages for SFT using
teichby @lhoestq in #8232- Agent traces from claude_code/pi/codex and others can now be loaded with load_dataset
- Using the
teichlibrary (new optional dependency), traces are parsed tomessagesto enable training on traces using e.g.trl - Load the data:
>>> from datasets import load_dataset >>> ds = load_dataset("lhoestq/agent-traces-example", split="train") >>> ds[0]["messages"] [{'role': 'user', 'content': 'Download a random dataset from Hugging Face, use DuckDB to inspect it, and come back with a short report about it. Be concise and include: dataset name, what files/format you found, row count or rough size if you can determine it,...' ...]
- Train on agent traces:
trl sft --dataset-name lhoestq/agent-traces-example ...
- find all the Agent traces datasets on HF here: https://huggingface.co/datasets?format=format:agent-traces&sort=trending
Next-level shuffling in streaming mode
-
Use multiple input shards for shuffle buffer by @lhoestq in #8194
ds = load_dataset(..., streaming=True) ds = ds.shuffle(seed=42) # or configure local buffer shuffling manually, default is: ds = ds.shuffle(seed=42, buffer_size=1000, max_buffer_input_shards=10)
toy example comparison
from datasets import IterableDataset ds = IterableDataset.from_dict({"i": range(123_456_789)}, num_shards=1024) ds = ds.shuffle(seed=42) print("Cold start ids:") print(list(ds.take(10)["i"])) print("Nominal regime ids:") print(list(ds.skip(10_000).take(10)["i"]))
before👎:
Cold start ids: [6148853, 6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858] Nominal regime ids: [6149537, 6149418, 6149202, 6149197, 6149622, 6148849, 6149461, 6148965, 6148858, 6149290]after✨:
Cold start ids: [7836668, 9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871] Nominal regime ids: [9283505, 95847927, 482299, 9283471, 482341, 112003312, 59920157, 43764666, 95847871, 16758448]Note:
ds.state_dict()andds.load_state_dict()are still supported for this improved shuffling :) enabling dataset checkpointingNote 2: it uses threads to fetch the first examples in parallel from the input shards
Note 3: This is a BREAKING CHANGE: the default shuffling mechanism now uses multiple input shards. You can get the old mechanism by passing
max_buffer_input_shards=1toIterableDataset.shuffle()
New batching features for robotics datasets
-
Add batch(by_column=...) by @lhoestq in #8172
from datasets import Dataset ds = Dataset.from_dict({"episode": [0] * 10 + [1] * 10, "frame": list(range(10)) * 2}) # ds = ds.to_iterable_dataset() ds = ds.batch(by_column="episode") for x in ds: print(x) # {'episode': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]} # {'episode': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'frame': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}
New supported formats
- Add Apache Iceberg format support by @frankliee in #8148
- feat: add TsFile (Apache IoTDB) packaged builder with per-device wide format by @JackieTien97 in #8160
- feat: add 3D mesh support and MeshFolder builder by @Vinay-Umrethe in #8055
- Add
.conll/.conlludataset format loader (CoNLL-2003 / 2000 / U) by @CrypticCortex in #8219
Other improvements and bug fixes
- Pass library_name/version to HfApi in dataset push and delete paths by @davanstrien in #8161
- Fix storage_options lookup for streaming Lance datasets by @ericjaebeom in #8166
- add agent trace prompt, sent_at, count fields by @cfahlgren1 in #8163
- fix: add
num_procargument toDataset.to_sqlby @EricSaikali in #7791 - Support fsspec 2026.4.0 by @lhoestq in #8175
- Fix Parquet streaming hangs at the end of script by @lhoestq in #8176
ClassLabeldocs: Correct value for unknown labels by @l-uuz in #7645- fix parquet reshard by @lhoestq in #8193
- Fix parquet columns arg by @lhoestq in #8210
- update readme by @lhoestq in #8208
- update single seg repos in ci by @lhoestq in #8213
- Fix single lance file form pylance 7.0 by @lhoestq in #8225
- fix(map): fix progress bar exceeding total when load_from_cache_file=False by @Nitin-Rajasekar in #8170
- fix: embed_external_files=True for mesh support by @Vinay-Umrethe in #8224
- Fix iterable skip over full Arrow blocks by @my17th2 in #8236
- Keep None as a real null in Json() columns instead of the string "null" by @adityasingh2400 in #8231
- Support composed splits in streaming datasets by @lanarkite99 in #8220
New Contributors
- @ericjaebeom made their first contribution in #8166
- @EricSaikali made their first contribution in #7791
- @l-uuz made their first contribution in #7645
- @CrypticCortex made their first contribution in #8219
- @frankliee made their first contribution in #8148
- @Vinay-Umrethe made their first contribution in #8055
- @Nitin-Rajasekar made their first contribution in #8170
- @JackieTien97 made their first contribution in #8160
- @my17th2 made their first contribution in #8236
- @adityasingh2400 made their first contribution in #8231
- @lanarkite99 made their first contribution in #8220
Full Changelog: 4.8.5...5.0.0
4.8.5
Main bug fixes
- fix: decode Json() values before calling DataFrame.to_json() (#8116) by @Brianzhengca in #8122
- Fix: decode JSON type before to_list or to_dict is called by @ItsTania in #8137
- Fix batching for table-formatted datasets by @bluehyena in #8126
- Fix iterable map resume state by @Brianzhengca in #8147
- don't embed remote files in download_and_prepare to parquet by @lhoestq in #8150
Other improvements and bug fixes
- Parse agent traces by @lhoestq in #8113
- 🔒 Pin GitHub Actions to commit SHAs by @paulinebm in #8114
- chore: bump doc-builder SHA for PR upload workflow by @rtrompier in #8134
- Remove print statement in JSON processing by @lhoestq in #8136
- Don't include files list DatasetInfo (and remove old stuff) by @lhoestq in #8128
- update ci uer by @lhoestq in #8139
- fix warning in ci by @lhoestq in #8140
- fix mask in embed_storage for remote files by @lhoestq in #8151
- fix original_files missing in ci json test by @lhoestq in #8152
- Fix null in embed storage by @lhoestq in #8154
- Fix base_path in integration tests by @lhoestq in #8155
New Contributors
- @paulinebm made their first contribution in #8114
- @Brianzhengca made their first contribution in #8122
- @bluehyena made their first contribution in #8126
- @rtrompier made their first contribution in #8134
- @ItsTania made their first contribution in #8137
Full Changelog: 4.8.4...4.8.5
4.8.4
4.8.3
What's Changed
- Fix split_dataset_by_node step by @lhoestq in #8081
- Fix docstring of Json.cast_storage by @albertvillanova in #8080
Full Changelog: 4.8.2...4.8.3
4.8.2
4.8.1
What's Changed
- Fix formatted iter arrow double yield by @HaukurPall in #8063
Full Changelog: 4.8.0...4.8.1
4.8.0
Dataset Features
-
Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @lhoestq in #8064
from datasets import load_dataset # load raw data from a Storage Bucket on HF ds = load_dataset("buckets/username/data-bucket", data_files=["*.jsonl"]) # or manually, using hf:// paths ds = load_dataset("json", data_files=["hf://buckets/username/data-bucket/*.jsonl"]) # process, filter ds = ds.map(...).filter(...) # publish the AI-ready dataset ds.push_to_hub("username/my-dataset-ready-for-training")
This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork).
And it bumpsdillandmultiprocessversions to support python 3.14 -
Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in #8068
- added
max_shard_sizeto IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome) - more arrow-native iterable operations for IterableDataset
- better support of glob patterns in archives, e.g.
zip://*.jsonl::hf://datasets/username/dataset-name/data.zip - fixes for to_pandas, videofolder, load_dataset_builder kwargs
- added
What's Changed
- fix reshard_data_sources by @lhoestq in #8061
- Improve error message for invalid data_files pattern format by @kushalkkb in #8060
- fix null filling in missing jsonl columns by @lhoestq in #8069
New Contributors
- @kushalkkb made their first contribution in #8060
- @Michael-RDev made their first contribution in #8068
Full Changelog: 4.7.0...4.8.0
4.7.0
Datasets Features
- Add
Json()type by @lhoestq in #8027- JSON Lines files that contain arbitrary JSON objects like tool calling datasets are now supported. When there is a field or subfield containing mixed types (e.g. mix of str/int/float/dict/list or dictionaries with arbitrary keys), the
Json()type is used to store such data that would normally not be supported in Arrow/Parquet - Use the
Json()type inFeatures()for any dataset, it is supported in any functions that acceptsfeatures=likeload_dataset(),.map(),.cast(),.from_dict(),.from_list() - Use
on_mixed_types="use_json"to automatically set theJson()type on mixed types in.from_dict(),.from_list()and.map()
- JSON Lines files that contain arbitrary JSON objects like tool calling datasets are now supported. When there is a field or subfield containing mixed types (e.g. mix of str/int/float/dict/list or dictionaries with arbitrary keys), the
Examples:
You can use on_mixed_types="use_json" or specify features= with a [Json] type:
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]})
Traceback (most recent call last):
...
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64
>>> features = Features({"a": Json()})
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]}, features=features)
>>> ds.features
{'a': Json()}
>>> list(ds["a"])
[0, "foo", {"subfield": "bar"}]This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]})
>>> ds.features
{'a': List({'b': Value('int64'), 'c': Value('int64')})}
>>> list(ds["a"])
[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]] # missing fields are filled with None
>>> features = Features({"a": List(Json())})
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]}, features=features)
>>> ds.features
{'a': List(Json())}
>>> list(ds["a"])
[[{'b': 0}, {'c': 0}]] # OKAnother example with tool calling data and the on_mixed_types="use_json" argument (useful to not have to specify features= manually):
>>> messages = [
... {"role": "user", "content": "Turn on the living room lights and play my electronic music playlist."},
... {"role": "assistant", "tool_calls": [
... {"type": "function", "function": {
... "name": "control_light",
... "arguments": {"room": "living room", "state": "on"}
... }},
... {"type": "function", "function": {
... "name": "play_music",
... "arguments": {"playlist": "electronic"} # mixed-type here since keys ["playlist"] and ["room", "state"] are different
... }}]
... },
... {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
... {"role": "tool", "name": "play_music", "content": "The music is now playing."},
... {"role": "assistant", "content": "Done!"}
... ]
>>> ds = Dataset.from_dict({"messages": [messages]}, on_mixed_types="use_json")
>>> ds.features
{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}What's Changed
- Fix typos in iterable_dataset.py by @omkar-334 in #8049
- Fix non-deterministic by sorting metadata extensions (#8034) by @Nexround in #8039
- Use num_examples instead of len(self) for iterable_dataset's SplitInfo by @HaukurPall in #8041
- Fix silent data loss in push_to_hub when num_proc > num_shards by @HaukurPall in #8044
- Don't extract bad files by @lhoestq in #8056
- fix(iterable_dataset): preserve features when chaining filter() on typed IterableDataset by @s-zx in #8053
- fix: handle nested null types in feature alignment for multi-proc map by @ain-soph in #8047
- Fix unstable tokenizer fingerprinting (enables map cache reuse) by @KOKOSde in #7982
- Limit dataset listing to first 20 entries in readme by @lhoestq in #8057
New Contributors
- @omkar-334 made their first contribution in #8049
- @Nexround made their first contribution in #8039
- @HaukurPall made their first contribution in #8041
- @s-zx made their first contribution in #8053
- @ain-soph made their first contribution in #8047
- @KOKOSde made their first contribution in #7982
Full Changelog: 4.6.1...4.7.0
4.6.1
4.6.0
Dataset Features
-
Support Image, Video and Audio types in Lance datasets
>>> from datasets import load_dataset >>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train") >>> ds.features {'video_blob': Video(), 'video_path': Value('string'), 'caption': Value('string'), 'aesthetic_score': Value('float64'), 'motion_score': Value('float64'), 'temporal_consistency_score': Value('float64'), 'camera_motion': Value('string'), 'frame': Value('int64'), 'fps': Value('float64'), 'seconds': Value('float64'), 'embedding': List(Value('float32'), length=1024)}
-
Push to hub now supports Video types
>>> from datasets import Dataset, Video >>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]}) >>> ds = ds.cast_column("video", Video()) >>> ds.push_to_hub("username/my-video-dataset")
-
Write image/audio/video blobs as is in parquet (PLAIN) in
push_to_hub()by @lhoestq in #7976- this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
- E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
- See more info here: https://huggingface.co/docs/hub/en/xet/deduplication
-
Add
IterableDataset.reshard()by @lhoestq in #7992Reshard the dataset if possible, i.e. split the current shards further into more shards.
This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards.
Equality may happen if no shard can be split further.The resharding mechanism depends on the dataset file format:
- Parquet: shard per row group instead of per file
- Other: not implemented yet (contributions are welcome !)
>>> from datasets import load_dataset >>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True) >>> ds IterableDataset({ features: ['label', 'title', 'content'], num_shards: 4 }) >>> ds.reshard() IterableDataset({ features: ['label', 'title', 'content'], num_shards: 3600 })
What's Changed
- Fix load_from_disk progress bar with redirected stdout by @omarfarhoud in #7919
- Revert "feat: avoid some copies in torch formatter (#7787)" by @lhoestq in #7961
- docs: fix grammar and add type hints in splits.py by @Edge-Explorer in #7960
- Fix interleave_datasets with all_exhausted_without_replacement strategy by @prathamk-tw in #7955
- Add examples for Lance datasets by @prrao87 in #7950
- Support null in json string cols by @lhoestq in #7963
- handle blob lance by @lhoestq in #7964
- Count examples in lance by @lhoestq in #7969
- Use temp files in push_to_hub to save memory by @lhoestq in #7979
- Drop python 3.9 by @lhoestq in #7980
- Support pandas 3 by @lhoestq in #7981
- Remove unused data files optims by @lhoestq in #7985
- Remove pre-release workaround in CI for
transformers v5andhuggingface_hub v1by @hanouticelina in #7989 - very basic support for more hf urls by @lhoestq in #8003
- Bump fsspec upper bound to 2026.2.0 (fixes #7994) by @jayzuccarelli in #7995
- Fix: make environment variable naming consistent (issue #7998) by @AnkitAhlawat7742 in #8000
- More IterableDataset.from_x methods and docs and polars.Lazyframe support by @lhoestq in #8009
- Support empty shard in from_generator by @lhoestq in #8023
- Allow import polars in map() by @lhoestq in #8024
New Contributors
- @omarfarhoud made their first contribution in #7919
- @Edge-Explorer made their first contribution in #7960
- @prathamk-tw made their first contribution in #7955
- @prrao87 made their first contribution in #7950
- @hanouticelina made their first contribution in #7989
- @jayzuccarelli made their first contribution in #7995
- @AnkitAhlawat7742 made their first contribution in #8000
Full Changelog: 4.5.0...4.6.0

