DataSAIL
DataSAIL is an open-source software tool that splits machine learning datasets while minimizing Information Leakage. It formulates the splitting of a dataset as a constrained minimization problem and optimizes the data split towards an objective function that accounts for information leakage.
Cite this software
Description
DataSAIL: Data Splitting Against Information Leaking
DataSAIL, short for Data Splitting Against Information Leakage, is a versatile tool designed to partition data while minimizing similarities between the partitions. Inter-sample similarities can lead to information leakage, resulting in an overestimation of the model's performance in certain training regimes.
DataSAIL was initially developed for machine learning workflows involving biological datasets, but its utility extends to any type of datasets. It can be used through a command line interface or integrated as a Python package, making it accessible and user-friendly. The tool is licensed under the MIT license, ensuring it remains open source and freely available here on GitHub.
A detailed documentation of the package, explanations, examples, and much more are given on DataSAIL's ReadTheDocs page.
Installation
DataSAIL is available for all modern versions of Python (v3.9 or newer). We ship two versions of DataSAIL:
DataSAIL: The full version of DataSAIL, which includes all third-party clustering algorithms and is available on conda for linux and OSX (calleddatasail).DataSAIL-lite: A lightweight version of DataSAIL, which does not include any third-party clustering algorithms and is available on PyPI (calleddatasail) and conda (calleddatasail-lite).
NOTE: There is a naming-inconsitency between the conda and PyPI versions of DataSAIL. The lite version is called datasail-lite on conda, while it is called datasail on PyPI. This will be fixed in the future, but for now, please be aware of this inconsistency.
Usage
DataSAIL is installed as a command-line tool. So, in the conda environment, DataSAIL has been installed to, you can run
datasail --e-type P --e-data <path_to_fasta> --e-sim mmseqs --output <path_to_output_path> --technique C1e
to split a set of proteins that have been clustered using mmseqs. For a full list of arguments, run datasail -h and checkout ReadTheDocs. There is a more detailed explanation of the arguments and example notebooks. The runtime largy depends on the number and type of splits to be computed and the size of the dataset. For small datasets (less then 10k samples) DataSAIL finished within minutes. On large datasets (more than 100k samples) it can take several hours to complete.
Regardless of which installation command was used, DataSAIL can be executed by running
datasail -h
in the command line and see the parameters DataSAIL takes. DataSAIL can also directly be included as a normal package into your Python program using
from datasail.sail import datasail
splits = datasail(...)
For more information about the parameters, please read through the documentation page
When to use DataSAIL and when not to use
DataSAIL offers a variety of ways to split one-dimensional and multi-dimensional data. Here exemplarily shown for a generic protein property prediction task and a protein-ligand interaction prediction dataset.
The datasplit employed should always reflect the inference reality the model is facing. So, if the model is intended to perform well on unseen data, the validation and test data shall be new between splits.
For more information, please see our guideline to selecting datasplits in the documentation.
Citation
If you used DataSAIL to split your data, please cite DataSAIL in your publication.
@article{joeres2025datasail,
title={Data splitting to avoid information leakage with DataSAIL},
author={Joeres, Roman and Blumenthal, David B. and Kalinina, Olga V.},
journal={Nature Communications},
volume={16},
pages={3337},
year={2025},
doi={10.1038/s41467-025-58606-8},
}
Participating organisations
Reference papers
Mentions
- 1.Author(s): Steffen Docter, Benoit David, Holger GohlkePublished in Current Opinion in Biotechnology by Elsevier BV in 2026, page: 10339310.1016/j.copbio.2025.103393
- 2.Author(s): Yong Wang, Peifu Han, Xue Li, Shuang Wang, Xun Wang, Tao SongPublished in Dyes and Pigments by Elsevier BV in 2026, page: 11328710.1016/j.dyepig.2025.113287
- 3.Author(s): Abebe Wolie Yimam, Majid Vafaeipour, Maarten Messagie, Kinde Anlay Fante, Emiyamrew Minaye Molla, Tefera Mekonnen Azerefegn, Thierry CoosemansPublished in Engineering Applications of Artificial Intelligence by Elsevier BV in 2025, page: 11264510.1016/j.engappai.2025.112645
- 4.Author(s): Zhenqian Shen, Mingyang Zhou, Yongqi Zhang, Quanming YaoPublished in Bioinformatics by Oxford University Press (OUP) in 202510.1093/bioinformatics/btaf569
- 5.Author(s): Farzaneh Firoozbakht, Maria Louise Elkjaer, Diane E. Handy, Rui-Sheng Wang, Zoe Chervontseva, Matthias Rarey, Joseph Loscalzo, Jan Baumbach, Olga TsoyPublished in Cell Reports Methods by Elsevier BV in 2025, page: 10099010.1016/j.crmeth.2025.100990
- 6.Author(s): Luc Thomès, Roman Joeres, Zeynep Akdeniz, Daniel BojarPublished in Nature Communications by Springer Science and Business Media LLC in 202510.1038/s41467-025-67590-y
- 7.Author(s): Andrea Apicella, Francesco Isgrò, Roberto PrevetePublished in Artificial Intelligence Review by Springer Science and Business Media LLC in 202510.1007/s10462-025-11326-3
- 8.Author(s): Hosein Fooladi, Thi Ngoc Lan Vu, Miriam Mathea, Johannes KirchmairPublished in Journal of Chemical Information and Modeling by American Chemical Society (ACS) in 2025, page: 9871-989110.1021/acs.jcim.5c00475
- 9.Author(s): Luciano Radrigan, Sebastián E. Godoy, Anibal S. MoralesPublished in Machine Learning and Knowledge Extraction by MDPI AG in 2025, page: 11110.3390/make7040111
- 10.Author(s): Julian Götz, Euan Richards, Iain A. Stepek, Yu Takahashi, Yi-Lin Huang, Louis Bertschi, Bertran Rubi, Jeffrey W. BodePublished in Science Advances by American Association for the Advancement of Science (AAAS) in 202510.1126/sciadv.adw6047
- 11.Author(s): Shuming Jin, Qiuyang Wu, Gaokui Fu, Dong Lu, Fang Wang, Li Deng, Kaili NiePublished in Catalysts by MDPI AG in 2025, page: 84210.3390/catal15090842
- 12.Author(s): Alfred Ferrer Florensa, Jose Juan Almagro Armenteros, Henrik Nielsen, Frank Møller Aarestrup, Philip Thomas Lanken Conradsen ClausenPublished in NAR Genomics and Bioinformatics by Oxford University Press (OUP) in 202410.1093/nargab/lqae106
- 13.Author(s): James Urban, Roman Joeres, Luc Thomès, Kristina A. Thomsson, Daniel BojarPublished in Analytical and Bioanalytical Chemistry by Springer Science and Business Media LLC in 2024, page: 931-94310.1007/s00216-024-05500-9
- 14.Author(s): Judith Bernett, David B. Blumenthal, Dominik G. Grimm, Florian Haselbeck, Roman Joeres, Olga V. Kalinina, Markus ListPublished in Nature Methods by Springer Science and Business Media LLC in 2024, page: 1444-145310.1038/s41592-024-02362-y
- 15.Author(s): Hosein Fooladi, Steffen Hirte, Johannes KirchmairPublished in Journal of Chemical Information and Modeling by American Chemical Society (ACS) in 2024, page: 4031-404610.1021/acs.jcim.4c00160
- 1.Author(s): Jie Li, Xingyi Guan, Oufan Zhang, Kunyang Sun, Yingze Wang, Dorian Bagni, Teresa Head-GordonPublished in 202610.1021/acs.jpcb.5c08598
- 2.Author(s): Victor Hugo Xavier Bernardes, Rubens Gedraite, Nicolas Spogis, Sarah Arvelos AltinoPublished in 202610.1016/j.ijhydene.2026.153747
- 3.Author(s): Anthony Lavertu, Jacques Corbeil, Pascal GermainPublished in 202610.64898/2026.02.03.703041
- 4.Author(s): Nana Kofi Sarpong Morgan, Patrick Annan-NoonooPublished in 202610.2196/preprints.92079
- 5.Author(s): Geletaw Sahle Tegenaw, Hailin Song, Tomas WardPublished in 202610.1186/s13040-025-00516-y
- 6.Author(s): Mickael Leclercq, Arnaud DroitPublished in 202510.1021/acs.jproteome.5c00506
- 7.Author(s): Alexander Gress, Carène Benasolo, Johanna Becher, Dominique Mias-Lucquin, Roman Joeres, Sebastian Keller, Olga V. KalininaPublished by openRxiv in 202510.64898/2025.12.01.691563
- 8.Author(s): Nure Tasnina, Maryam Haghani, T M MuraliPublished in 202510.1093/bib/bbaf676
- 9.Author(s): Alper Yurtseven, Roman Joeres, Olga V. KalininaPublished by openRxiv in 202510.1101/2025.07.08.663126
- 10.Author(s): Vahid Atabaigi Elmi, Roman Joeres, Olga V. KalininaPublished in 202510.1101/2025.10.09.681419
- 11.Author(s): Luc Thomès, Roman Joeres, Zeynep Akdeniz, Daniel BojarPublished by openRxiv in 202510.1101/2025.06.22.660912
- 12.Author(s): Rohan Gorantla, Aryo Pradipta Gema, Ian Xi Yang, Álvaro Serrano-Morrás, Benjamin Suutari, Jordi Juárez-Jiménez, Antonia S. J. S. MeyPublished by openRxiv in 202410.1101/2024.11.01.621495
- 13.Author(s): Julian Götz, Euan Richards, Iain Stepek, Yu Takahashi, Yi-Lin Huang, Louis Bertschi, Bertran Rubi, Jeffrey BodePublished by American Chemical Society (ACS) in 202410.26434/chemrxiv-2024-5328b
- 14.Author(s): Ferrer Florensa, Alfred, Almagro Armenteros, Jose Juan, Nielsen, Henrik, Aarestrup, Frank Møller, Clausen, Philip Thomas Lanken ConradsenPublished in 2024
- 15.Author(s): Farzaneh Firoozbakht, Maria Louise Elkjaer, Diane E. Handy, Rui-Sheng Wang, Zoe Chervontseva, Matthias Rarey, Joseph Loscalzo, Jan Baumbach, Olga TsoyPublished by openRxiv in 202410.1101/2024.07.20.602911
- 16.Author(s): Shafayat Ahmed, Muhit Islam Emon, Nazifa Ahmed Moumi, Liqing ZhangPublished by openRxiv in 202410.1101/2024.11.13.623463
- 17.Author(s): James Urban, Roman Joeres, Luc Thomès, Kristina A. Thomsson, Daniel BojarPublished by openRxiv in 202410.1101/2024.06.28.601175
- 18.Author(s): Hosein Fooladi, Steffen Hirte, Johannes KirchmairPublished by American Chemical Society (ACS) in 202410.26434/chemrxiv-2024-871mt
- 19.Author(s): Raúl Fernández-Díaz, Denis C. Shields, Thanh Lam Hoang, Vanessa LopezPublished by Cold Spring Harbor Laboratory in 202410.1101/2024.03.14.584508
- 20.Author(s): Floriane Odje, Lisa-Marie Rolli, Andrea VolkamerPublished in 201210.1007/978-1-0716-4985-5_8