The following tables shows the further distribution of Niger-Congo & Indic languages and programming languages in the training data.ĭistribution of Niger Congo and Indic languages. The pie chart shows the distribution of languages in training data. In 1.6TB of pre-processed text, converted into 350B unique tokens (see the tokenizer section for more.) It is relevant for anyone who wants to know the basics of what the model is learning.ĭetails for each dataset are provided in individual Data Cards, and the sizes of each of their contributions to the aggregated training data are presented in an Interactive Corpus Map. This section provides a high-level overview of the training data. It is useful for people who want to learn more about the model inputs and training footprint. This section provides information about the training data, the speed and size of training elements, and the environmental impact of training. PyTorch (pytorch-1.11 w/ CUDA-11.5 see Github link) NCCL-communications network: a fully dedicated subnetĭisc IO network: shared network with other types of nodes Inter-node connect: Omni-Path Architecture (OPA) Jean Zay Public Supercomputer, provided by the French government (see announcement).Īdditional 32 A100 80GB GPUs (4 nodes) in reserveĨ GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links Objective Function: Cross Entropy with mean reduction (see API documentation). Sequence length of 2048 tokens used (see BLOOM tokenizer, tokenizer description) Layer normalization applied to word embeddings layer ( StableEmbedding see code, paper)ĪLiBI positional encodings (see paper), with GeLU activation functions Modified from Megatron-LM GPT2 (see paper, BLOOM Megatron code): Please see the BLOOM training README for full details on replicating training. It is useful for people interested in model development. This section includes details about the model objective and architecture, and the compute infrastructure. (Further breakdown of organizations forthcoming.) Send Questions to: as: BigScience, BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model. Release Date Estimate: Monday, 11.July.2022 License: RAIL License v1.0 ( link / article and FAQ) Model Type: Transformer-based Language ModelĬheckpoints format: transformers (Megatron-DeepSpeed format available here) (Further breakdown of participants forthcoming.) Click to expandĪll collaborators are either volunteers or have an agreement with their employer. It is useful for anyone who wants to reference the model. This section provides information about the model type, version, license, funders, release date, developers, and contact information. BLOOM can also be instructed to perform text tasks it hasn't been explicitly trained for, by casting them as text generation tasks. As such, it is able to output coherent text in 46 languages and 13 programming languages that is hardly distinguishable from text written by humans. BigScience Large Open-science Open-access Multilingual Language ModelĬurrent Checkpoint: Training Iteration 95000īLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources.
0 Comments
Leave a Reply. |