omnes-flores (technology preview)

A Unified NLP Framework for LLMs.

Terms of Use

The omnes-flores Python module is published under Apache License Version 2.0 and the dedicated models for omnes-flores are distributed under the license inherited from the Universal Dependencies treebanks used for training.

To use the base model google/gemma-2-9b, you must agree to the terms of use in your HuggingFace account.

To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately.

Requirements

The omnes-flores technology preview requires a Linux environment with NVIDIA GPUs (Ampere or later). To run inference on the 9B parameter base model + LoRA using bfloat16, 24GB or more of GPU memory is required. The following environments have been tested for operation.

NVIDIA RTX Pro 6000 Blackwell 96GB
CPU RAM 64GB
Ubuntu 24.04
CUDA 12.8
Python 3.12
vLLM 0.16.0
Transformers 4.57.6

We are planning to support Apple Silicon (MLX) in the near future.

Install

Installing the library is very simple like:

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install omnes-flores

Setup HuggingFace

To use the base model google/gemma-2-9b, you have to agree to the terms of use by following the steps below:

Log in to HuggingFace with your huggingface account.
Open google/gemma-2-9b page.
Read the descriptions in Access Gemma on Hugging Face panel and proceed to Acknowledge license if you agree to the contract.

Next, generate your access token in huggingface account settings page:

Open Access Tokens page and then push + Create new token button.
In the User permissions (your-account-name) section of Create new Access Token page:
- Select Fine-grained in Token type field (default).
- Fill read-gated-repos in the Token name field.
- Check the following items in Repositories section.
  - Read access to contents of all repos under your personal namespace
  - View access requests for all gated repos under your personal namespace
  - Read access to contents of all public gated repos you can access
- Push the bottom side Create token button.
- In the Save your Access Token dialog, copy the access token beginning with hf_ by pushing the Copy button and then save it in appropriate secure place.

Finally, login via CLI with an access token, like:

From the Python environment which you installed omnes-flores, execute hf auth login and paste the access token.
- If the login is successful, the following will be displayed:
```
$ hf auth login
...
Enter your token (input will not be visible): 
...
Login successful.
The current active token is: `read-gated-repos`
```

Run Models

`40-lang-41-treebank-v0` (CC BY-SA 4.0)

This model is available for commercial use.

$ omnes-flores < text_file > conllu_file

In the above code, the input text_file is plain text regardless of language, and sentence separation by line breaks is not required. To improve the inference efficiency, the input is automatically batched, but batch is always separated by blank lines in the input. When processing interactively via standard input, you can press Enter twice to get immediate inference. For the details of output CoNLL-U format, read the Universal Dependencies official page.

This model was trained using training data from 40 UD languages, consisting of 41 treebanks.

The Japanese word unit is LUW.
(日本語の単語分割基準は国語研長単位です。)

The following 40 UD treebanks, which have both a commercially available license and over 40k UD tokens in the train set, were selected to train the LoRA models of omnes-flores-40-lang-41-treebank-v0.

UD_Armenian-ArmTDP, UD_Belarusian-HSE, UD_Bororo-BDT, UD_Chinese-GSD, UD_Chinese-GSDSimp, UD_Croatian-SET, UD_Czech-CAC, UD_Danish-DDT, UD_Dutch-Alpino, UD_English-EWT, UD_Estonian-EWT, UD_Finnish-TDT, UD_French-GSD, UD_German-GSD, UD_Haitian_Creole-Adolphe, UD_Hebrew-IAHLTwiki, UD_Icelandic-GC, UD_Indonesian-GSD, UD_Irish-IDT, UD_Japanese-GSDLUW, UD_Korean-Kaist, UD_Latvian-LVTB, UD_Lithuanian-ALKSNIS, UD_Naija-NSC, UD_Norwegian-Nynorsk, UD_Persian-PerDT, UD_Portuguese-Porttinari, UD_Romanian-RRT, UD_Russian-GSD, UD_Scottish_Gaelic-ARCOSG, UD_Serbian-SET, UD_Sindhi-Isra, UD_Slovak-SNK, UD_Slovenian-SSJ, UD_Spanish-GSD, UD_Swedish-Talbanken, UD_Thai-TUD, UD_Turkish-BOUN, UD_Ukrainian-ParlaMint, UD_Western_Armenian-ArmTDP,

In addition, a proprietary treebank was used for training, which were specially licensed from the National Institute for Japanese Language and Linguistics exclusively for training this model.

UD_Japanese-BCCWJLUW (excluding PN newspaper articles)

`40-lang-42-treebank-v0` (CC BY-SA 4.0)

This model is available for commercial use.

$ omnes-flores --m megagonlabs/omnes-flores-40-lang-42-treebank-v0 < text_file > conllu_file

This model uses the Corpus of Everyday Japanese Conversation (CEJC) as part of training data, and uses SUW as the Japanese word unit in order to handle non-sentence contexts contained in fragmented speech.
(本モデルは訓練データの一部に日本語日常会話コーパスを使用しており、日常会話の断片的な発話に含まれる非文法的な文脈に対応するために、日本語の単語分割基準には文節構造を前提としない国語研短単位を用いています。)

UD_Armenian-ArmTDP, UD_Belarusian-HSE, UD_Bororo-BDT, UD_Chinese-GSD, UD_Chinese-GSDSimp, UD_Croatian-SET, UD_Czech-CAC, UD_Danish-DDT, UD_Dutch-Alpino, UD_English-EWT, UD_Estonian-EWT, UD_Finnish-TDT, UD_French-GSD, UD_German-GSD, UD_Haitian_Creole-Adolphe, UD_Hebrew-IAHLTwiki, UD_Icelandic-GC, UD_Indonesian-GSD, UD_Irish-IDT, UD_Japanese-GSDLUW, UD_Korean-Kaist, UD_Latvian-LVTB, UD_Lithuanian-ALKSNIS, UD_Naija-NSC, UD_Norwegian-Nynorsk, UD_Persian-PerDT, UD_Portuguese-Porttinari, UD_Romanian-RRT, UD_Russian-GSD, UD_Scottish_Gaelic-ARCOSG, UD_Serbian-SET, UD_Sindhi-Isra, UD_Slovak-SNK, UD_Slovenian-SSJ, UD_Spanish-GSD, UD_Swedish-Talbanken, UD_Thai-TUD, UD_Turkish-BOUN, UD_Ukrainian-ParlaMint, UD_Western_Armenian-ArmTDP,

In addition, the following datasets were used for training, which were specially licensed from the National Institute for Japanese Language and Linguistics exclusively for training this model.

UD_Japanese-BCCWJ (excluding PN newspaper articles)
UD_Japanese-CEJC

`84-lang-99-treebank-non-commercial-v0` (CC BY-NC-SA 4.0)

This model is made available for non-commercial use, including academic use; commercial use is strictly prohibited.

$ omnes-flores --m megagonlabs/omnes-flores-84-lang-99-treebank-non-commercial-v0 < text_file > conllu_file

The Japanese word unit is LUW.
(日本語の単語分割基準は国語研長単位です。)

The following 40 UD treebanks, which have both a commercially available license and over 40k UD tokens in the train set, were selected to train the LoRA models of omnes-flores-84-lang-99-treebank-non-commercial-v0.

UD_Armenian-ArmTDP UD_Belarusian-HSE UD_Bororo-BDT UD_Chinese-GSD UD_Chinese-GSDSimp UD_Croatian-SET UD_Czech-CAC UD_Danish-DDT UD_Dutch-Alpino UD_English-EWT UD_Estonian-EWT UD_Finnish-TDT UD_French-GSD UD_German-GSD UD_Haitian_Creole-Adolphe UD_Hebrew-IAHLTwiki UD_Icelandic-GC UD_Indonesian-GSD UD_Irish-IDT UD_Japanese-GSDLUW UD_Korean-Kaist UD_Latvian-LVTB UD_Lithuanian-ALKSNIS UD_Naija-NSC UD_Norwegian-Nynorsk UD_Persian-PerDT UD_Portuguese-Porttinari UD_Romanian-RRT UD_Russian-GSD UD_Scottish_Gaelic-ARCOSG UD_Serbian-SET UD_Sindhi-Isra UD_Slovak-SNK UD_Slovenian-SSJ UD_Spanish-GSD UD_Swedish-Talbanken UD_Thai-TUD UD_Turkish-BOUN UD_Ukrainian-ParlaMint UD_Western_Armenian-ArmTDP

In addition, the following 59 treebanks have been added to the training in this model for academic purposes:

UD_Ancient_Greek-PTNK UD_Ancient_Greek-PROIEL UD_Ancient_Greek-Perseus UD_Ancient_Hebrew-PTNK UD_Basque-BDT UD_Bulgarian-BTB UD_Classical_Armenian-CAVaL UD_Classical_Chinese-Kyoto UD_Coptic-Scriptorium UD_Coptic-Bohairic UD_Egyptian-PC UD_Erzya-JR UD_Estonian-EDT UD_Galician-CTG UD_Galician-TreeGal UD_Georgian-GLC UD_Gothic-PROIEL UD_Greek-GDT UD_Hindi-HDTB UD_Hungarian-Szeged UD_Icelandic-IcePaHC UD_Icelandic-Modern UD_Italian-ISDT UD_Italian-Old UD_Khoekhoe-KDT UD_Kyrgyz-KTMU UD_Latin-CIRCSE UD_Latin-ITTB UD_Latin-LLCT UD_Latin-Perseus UD_Latin-PROIEL UD_Latin-UDante UD_Low_Saxon-LSDC UD_Maltese-MUDT UD_Manx-Cadhan UD_Middle_French-PROFITEROLE UD_Nheengatu-CompLin UD_North_Sami-Giella UD_Occitan-TTB UD_Old_Church_Slavonic-PROIEL UD_Old_East_Slavic-RNC UD_Old_East_Slavic-Ruthenian UD_Old_East_Slavic-TOROT UD_Old_East_Slavic-Birchbark UD_Old_French-PROFITEROLE UD_Old_Occitan-CorAG UD_Ottoman_Turkish-DUDU UD_Ottoman_Turkish-BOUN UD_Polish-MPDT UD_Pomak-Philotis UD_Sanskrit-Vedic UD_Sindhi-Isra UD_Urdu-UDTB UD_Uyghur-UDT UD_Vietnamese-VTB UD_Welsh-CCG UD_Wolof-WTB UD_Yiddish-YiTB UD_Zaar-Autogramm

Method

The analysis pipeline components use following prompts:

Figure 1: An example of language identification and sentence segmentation prompt instance. The parts that change from instance to instance are shown in Italic. The SHADED REGION in the assistant-role corresponds to the range over which the loss gradient is computed during training, and to the decoded text during inference. At inference time, the span from the system-role up to the assistant-role header is provided as input, and decoding of the subsequent segment continues until <eos> is generated.

Figure 2: An example of word segmentation and language-specific part-of-speech tagging prompt instance.

Figure 3: An example of dependency parsing prompt instance.

Evaluation Result

Table 1: Accuracy of the proposed method and UDPipe2 on 41 treebanks (average of 4 trials ± sample standard deviation). Yellow highlights indicate relatively small training data or relatively low accuracy. Green highlights indicate relatively large sample standard deviation.

Read the NLP2026 paper (多言語統語解析処理のためのMulti-task LoRA SFT方式の評価) and its poster material (written in Japanese) for details.

Acknowledgements

This work was conducted as part of a collaborative research project between Recruit Co., Ltd. and the National Institute for Japanese Language and Linguistics.

Citations

You are encouraged to cite one of the following papers if you use omnes-flores models:

@inproceedings{matsuda-2025-iwpt,
    title = "Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency
      Parsing Accuracy of {LLM}s",
    author = "Matsuda, Hiroshi  and  Ma, Chunpeng  and  Asahara, Masayuki",
    booktitle = "Proceedings of the 18th International Conference on Parsing Technologies (IWPT, SyntaxFest 2025)",
    url = "https://aclanthology.org/2025.iwpt-1.2/"
}

@misc{matsuda-2025-arxiv,
      title={Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency
        Parsing Accuracy of LLMs}, 
      author={Hiroshi Matsuda and Chunpeng Ma and Masayuki Asahara},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.09983}, 
}

Version History

0.1.0

2026-03-09 Release 0.1.0