devinterp.utils module

devinterp.utils.tokenize_and_concatenate(dataset, tokenizer, streaming: bool = False, max_length: int = 1024, column_name: str = 'text', add_bos_token: bool = True, num_proc: int = 10)[source]

Tokenize and concatenate a text dataset into fixed-length sequences.

Based on TransformerLens (MIT License, Copyright 2022 TransformerLensOrg): https://github.com/TransformerLensOrg/TransformerLens Core algorithm unchanged, with local additions: input validation (bos token and max_length checks), numpy reshape in place of einops, and the output column renamed from “tokens” to “input_ids”.

Joins all text separated by EOS tokens, tokenizes, then reshapes into (num_sequences, max_length) chunks.

Parameters:
  • dataset – HuggingFace text dataset.

  • tokenizer – HuggingFace tokenizer with bos_token_id and eos_token_id.

  • streaming – If True, disables parallel tokenization.

  • max_length – Context window length.

  • column_name – Name of the text column.

  • add_bos_token – Prepend BOS to each sequence (reduces usable length by 1).

  • num_proc – Number of processes for dataset.map().

Returns:

Dataset with an “input_ids” column of torch tensors.