With great enthusiasm, we unveil the Prem-1B series, an open-source, multipurpose large language model developed by Prem AI. This cutting-edge SLM offers the open community and enterprises the opportunity to harness capabilities that were once exclusively available through closed-model APIs, empowering them to build their advanced language models. The weights of the base model (Prem-1B base) and the finetuned chat model (Prem-1B Chat) are available on HuggingFace under APACHE LICENSE 2.0.
🎯 Our Objective
We aim to develop a model that excels at Retrieval-Augmented Generation (RAG). While Large Language Models (LLMs) store a vast amount of information within their parameters, RAG operates differently by ingesting information during runtime. This approach suggests that for RAG applications, we may not require models of immense size. With this initiative, we aim to create a Small Language Model (SLM) with an extended context length of 8192 tokens, enabling it to handle multi-turn conversations effectively. This endeavor represents our inaugural attempt to craft an SLM tailored for RAG tasks. Read more about our hypothesis here.
💻 Infra Setup
Our infrastructure dedicated to model training is equipped with 16 H100 GPUs, distributed across two nodes, each hosting 8 GPUs. To facilitate multi-GPU training, these nodes are interconnected through the utilization of Ray, a distributed computing framework. We faced a few challenges while setting up the environment, which we explored in our previous blog, link below 👇
SLM Journey UnveiledPrem’s “SLM Journey Unveiled” details training a 1B parameter Small Language Model with 8K context length. It covers dataset challenges, Distributed Data Parallelism (DDP) with Ray, and optimization techniques for data partitioning and gradient synchronization.PremNicola Sosio
🏛️ Architecture
Prem-1B is a transformer-based decoder-only SLM that was trained using next-token prediction. The architecture is based on Llama 2 used by TinyLlama with flash-attention. Note that TinyLlama was trained with a context length of 2048, but Prem-1B supports a context length of up to 8192. Considering the recent release of Llama 2 and Llama 3 and their amazing performance and benchmarks, we went with this Llama architecture based on transformers. We explored Mamba architecture, Mixture of Experts (MOE) architectures, and recent technical reports of H2O-Danube-1.8B, Stable LM 2 1.6B, Phi3, and Llama3 models, and figured it’s not about architecture, but mainly about diverse quality data.
🏋️♂️ Pre-training
During the pre-training stage, we employed SlimPajama. We adopted Llama's tokenizer to process the data corpus. In the pre-processing phase, we packed multiple instances of data up to the defined context length of 8192 tokens, minimizing the need for excessive padding. The core objective behind pre-training is to ingest information and enable the large language model to comprehend sentence formation and perform text completion tasks effectively. We tried pre-training the model without packing the datasets, but it didn’t perform well. Mainly because most of the available open-source datasets don’t have long context data points, and if you don’t pack them during pre-training, most of the tokens will just be pad tokens, and the model will not learn anything.
In preparing the packed dataset, we utilized Lightning Data, a tool designed for efficient data handling and pre-processing. As the primary purpose of this model is to perform well on English content, we filtered out any data points containing code-specific information. After the pre-processing phase, we had accumulated 600B tokens, which were trained over the course of two epochs, totaling 1.2T tokens. Considering the research objective of developing an exceptional RAG SLM, we adopted an extended context length of 8192 tokens. We spent a total of 8500 GPU hours on pre-training.
Here is the final training config for pre-training:
model:
  model_args:
    model_name: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
    max_position_embeddings: 8192
    flash_attention: true
    dtype: bfloat16
  optimizer_args:
    lr: 0.0004
    betas: [0.9, 0.95]
    weight_decay: 0.1
  lr_scheduler_args:
    num_warmup_percentage: 0.1
data:
  train_path: "<train_data>"
  val_path: "<train_data>"
  max_seq_length: 8192
  batch_size: 2
trainer:
  accelerator: auto
  precision: bf16-mixed
  log_every_n_steps: 1
  gradient_clip_val: 1
  accumulate_grad_batches: 16
  max_epochs: 2
  val_check_interval: 92000
  limit_val_batches: 1.0
  limit_train_batches: 1.0
  reload_dataloaders_every_n_epochs: 1💬 Chat-Finetuning (SFT)
The pre-trained model serves as a foundation, a base model. However, base models are not designed for conversational interactions, so they are unsuitable for chat applications. To transform the base model into a capable assistant, we employ a process called chat fine-tuning. At a high level, this approach involves creating a structured prompt and ingesting it instead of raw data. The structured prompt is designed to simulate a conversation between a human and an assistant, and the model is trained to predict the assistant's response. The process of chat fine-tuning can be summarized as follows:
- Added a prompt template. For this, we adopted the Llama 3 chat template. 
- Used dataset with multi-turn conversation data points. In a few of the datasets, we didn’t have a system prompt, so we just added a very generic base system prompt. 
- The model was trained on 4-H100 GPUs for 12 hours. 
- No data-packing like we did in the pre-training stage. 
Following are the config/hyperparameters for the chat-finetuning stage:
model:
  model_args:
    model_name: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
    max_position_embeddings: 8192
    flash_attention: true
    dtype: bfloat16
  optimizer_args:
    lr: 0.00005
    betas: [0.9, 0.95]
    weight_decay: 0.1
  lr_scheduler_args:
    num_warmup_percentage: 0.1
data:
  train_path: "<train_dataset>"
  val_path: "<val_dataset>"
  max_seq_length: 8192
  batch_size: 2
  dataset_tokenizer: TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
trainer:
  accelerator: auto
  precision: bf16-mixed
  log_every_n_steps: 1
  gradient_clip_val: 1
  accumulate_grad_batches: 16
  max_epochs: 3
  limit_val_batches: 1.0
  limit_train_batches: 1.0Masked the whole prompt except the assistant responses while calculating the loss. This ensures that we only calculate the loss for the assistant tokens. Even in the case of multi-turn conversation data points, we masked all the assistant responses. For eg. consider the following data point formatted with the template:
<s><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>hi<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Hello! How can I help you today?<|eot_id|>     (Not MASKED)
<|start_header_id|>user<|end_header_id|>
who is the CEO of google?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
The CEO of Google is Sundar Pichai<|eot_id|>   (Not MASKED)
<|start_header_id|>user<|end_header_id|>
who is the CEO of Twitter?<|eot_id|>
<|start_header_id|>assistant<
We used the following datasets while finetuning. These datasets are selected based on their quality and the diverse nature of prompts:
- Ultrachat 200k 
- Deita 10K V0 
- Slim Orca 
- WizardLM Evol Instruct V2 
- Capybara 
- MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models 
💡Prem Platform. Effortlessly Integrate Generative AI into Your Applications with Full Ownership and Confidence.
Try Prem Now
🤝 DPO and Alignment
We followed SFT, by Direct Preference Optimization (DPO). It is one of the techniques used to align our model to generate better responses. Large, unsupervised language models lack precise control over their behavior due to their unsupervised training. Existing methods like Reinforcement Learning From Human Feedback (RLHF) use complex procedures to fine-tune the models to align with human preferences. DPO is a stable and computationally efficient algorithm that solves the RLHF problem using a simple classification loss, eliminating the need for sampling or significant hyperparameter tuning. You can learn more about model alignment in this blogpost. The following datasets were used for DPO finetuning:
- UltraFeedback Binarized 
- Orca DPO Pairs 
- OASST2 DPO Pairs 
This stage of training is performed using the Alignment Handbook.
We used the following config for DPO finetuning. You can check the parameters in DPOConfig.
bf16: true
beta: 0.01
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: False
learning_rate: 4.0e-6
lr_scheduler_type: cosine
max_length: 8192
max_prompt_length: 1000
num_train_epochs: 1
optim: adamw_torch
per_device_train_batch_size: 2
seed: 42
warmup_ratio: 0.1
loss_type
🔢 Results
Prem-1B-SQL is loved by the open source community. We reached more than 10K+ monthly downloads on Hugging Face and also 8K+ PremSQL library downloads. Check out our Prem-1B-SQL release blog to learn more.
Prem-1B-SQL: Fully Local Performant SLM for Text to SQLLast week, we open-sourced PremSQL, a local first library that created customised Text-to-SQL solutions.PremAnindyadeep Sannigrahi
🚀 Try it now!
Try it on Huggingface Chat: https://huggingface.co/premai-io/prem-1B-chat.
Or you can use the models now using Huggingface pipelines.
With model and tokenizer:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("premai-io/prem-1B-chat")
model = AutoModelForCausalLM.from_pretrained('premai-io/prem-1B-chat', torch_dtype=torch.bfloat16)
model = model.to('cuda')
terminators = [tokenizer.eos_token_id, tokenizer.encode('<|eot_id|>', add_special_tokens=False)[0]]
messages = [
    {
        "role": "system",
        "content": "You are a helpful AI assistant. You should give concise responses to very simple questions, but provide thorough responses to more complex and open-ended questions."
    },
    {
        'role': 'user',
        'content': 'Help me understand machine learning.'
    }
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_attention_mask=False, return_tensors="pt", add_special_tokens=False)
input_ids = inputs['input_ids']
input_ids = input_ids.to(model.device)
res = model.generate(input_ids=input_ids, max_new_tokens=400, pad_token_id=tokenizer.pad_token_id, eos_token_id=terminators)
generated_text = tokenizer.decode(res[0][input_ids.shape[1]:], skip_special_tokens=True).strip()
print(generated_text)Using pipelines:
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="premai-io/prem-1B-chat", torch_dtype=torch.bfloat16, device=0)
messages = [
    {
        "role": "system",
        "content": "You are a helpful AI assistant. You should give concise responses to very simple questions, but provide thorough responses to more complex and open-ended questions."
    },
    {
        'role': 'user',
        'content': 'Help me understand machine learning.'
    }
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
terminators = [pipe.tokenizer.eos_token_id, pipe.tokenizer.encode('<|eot_id|>', add_special_tokens=False)[0]]
outputs = pipe(prompt, max_new_tokens=400, do_sample=True, temperature=0.7, top_k=50, top_p=0.95, pad_token_id=pipe.tokenizer.pad_token_id, eos_token_id=terminators)
print(outputs[0]["generated_text"][len(prompt):])📚 References
RAG StrategiesThe article “RAG Strategies” explores Retrieval-Augmented Generation (RAG) methods, detailing Naive RAG, Advanced RAG, and Modular RAG approaches. It introduces RAFT, a fine-tuning technique, and discusses optimizing large language models for RAG tasks.PremSayantan Das
GitHub - jzhang38/TinyLlama: The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens. - jzhang38/TinyLlamaGitHubjzhang38
Meta Llama 2Llama 2 was pretrained on publicly available online data sources. The fine-tuned model, Llama Chat, leverages publicly available instruction datasets and over 1 million human annotations.Meta Llama
Mamba: Linear-Time Sequence Modeling with Selective State SpacesFoundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers’ computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.arXiv.orgAlbert Gu
LLM Mixture of Experts ExplainedExplaining Mixture of Experts LLM (MoE): GPT4 is just 8 smaller Expert models; Mixtral is just 8 Mistral models. See the advantages and disadvantages of MoE. Find out how to calculate their number of parameters.TensorOpsMiguel Carreira Neves
H2O-Danube-1.8B Technical ReportWe present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below the 2B parameter range. The models follow core principles of LLama 2 and Mistral, and we leverage and refine various techniques for pre-training large language models. We additionally release chat models trained with supervised fine-tuning followed by direct preference optimization. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.arXiv.orgPhilipp Singer
Stable LM 2 1.6B Technical ReportWe introduce StableLM 2 1.6B, the first in a new generation of our language model series. In this technical report, we present in detail the data and training procedure leading to the base and instruction-tuned versions of StableLM 2 1.6B. The weights for both models are available via Hugging Face for anyone to download and use. The report contains thorough evaluations of these models, including zero- and few-shot benchmarks, multilingual benchmarks, and the MT benchmark focusing on multi-turn dialogues. At the time of publishing this report, StableLM 2 1.6B was the state-of-the-art open model under 2B parameters by a significant margin. Given its appealing small size, we also provide throughput measurements on a number of edge devices. In addition, we open source several quantized checkpoints and provide their performance metrics compared to the original model.arXiv.orgMarco Bellagente
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your PhoneWe introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.arXiv.orgMarah Abdin
GitHub - Lightning-AI/litdata: Transform datasets at scale. Optimize datasets for fast AI model training.Transform datasets at scale. Optimize datasets for fast AI model training. - Lightning-AI/litdataGitHubLightning-AI
Direct Preference Optimization: Your Language Model is Secretly a Reward ModelWhile large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.arXiv.orgRafael Rafailov
GitHub - huggingface/alignment-handbook: Robust recipes to align language models with human and AI preferencesRobust recipes to align language models with human and AI preferences - huggingface/alignment-handbookGitHubhuggingface