devinterp package
Subpackages
- devinterp.optim package
- Submodules
- devinterp.optim.metrics module
MetricsMetrics.DOT_FIELDSMetrics.NORM_FIELDSMetrics.add_dot_products_()Metrics.add_sum_squared_()Metrics.aggregate()Metrics.distanceMetrics.dot_grad_noiseMetrics.dot_grad_priorMetrics.dot_prior_noiseMetrics.localizationMetrics.noiseMetrics.numelMetrics.priorMetrics.scaled_gradMetrics.sqrt_norms_()Metrics.to()Metrics.unscaled_gradMetrics.weight_decayMetrics.zero_()
- devinterp.optim.preconditioner module
- devinterp.optim.prior module
- devinterp.optim.sgld module
SGLDSGLD.add_param_group()SGLD.get_metrics()SGLD.get_sketches()SGLD.iter_group_metrics()SGLD.load_state_dict()SGLD.register_load_state_dict_post_hook()SGLD.register_load_state_dict_pre_hook()SGLD.register_state_dict_post_hook()SGLD.register_state_dict_pre_hook()SGLD.register_step_post_hook()SGLD.register_step_pre_hook()SGLD.state_dict()SGLD.step()SGLD.zero_grad()
- devinterp.optim.sgmcmc module
SGMCMCSGMCMC.add_param_group()SGMCMC.get_metrics()SGMCMC.get_params()SGMCMC.get_sketches()SGMCMC.iter_group_metrics()SGMCMC.load_state_dict()SGMCMC.register_load_state_dict_post_hook()SGMCMC.register_load_state_dict_pre_hook()SGMCMC.register_state_dict_post_hook()SGMCMC.register_state_dict_pre_hook()SGMCMC.register_step_post_hook()SGMCMC.register_step_pre_hook()SGMCMC.rmsprop_sgld()SGMCMC.sgld()SGMCMC.sgnht()SGMCMC.state_dict()SGMCMC.step()SGMCMC.zero_grad()
- devinterp.optim.sgnht module
SGNHTSGNHT.add_param_group()SGNHT.load_state_dict()SGNHT.register_load_state_dict_post_hook()SGNHT.register_load_state_dict_pre_hook()SGNHT.register_state_dict_post_hook()SGNHT.register_state_dict_pre_hook()SGNHT.register_step_post_hook()SGNHT.register_step_pre_hook()SGNHT.state_dict()SGNHT.zero_grad()
- devinterp.optim.sketch module
- devinterp.optim.utils module
- Module contents
- devinterp.slt package
- Submodules
- devinterp.slt.bif module
- devinterp.slt.config module
SamplerConfigSamplerConfig.batch_sizeSamplerConfig.bounding_box_sizeSamplerConfig.copy()SamplerConfig.epoch_modeSamplerConfig.gradient_accumulation_stepsSamplerConfig.init_noiseSamplerConfig.init_seedSamplerConfig.llc_weight_decaySamplerConfig.localizationSamplerConfig.lrSamplerConfig.match_sampling_input_ids_across_chainsSamplerConfig.model_computed_fieldsSamplerConfig.model_configSamplerConfig.model_construct()SamplerConfig.model_copy()SamplerConfig.model_dump()SamplerConfig.model_dump_json()SamplerConfig.model_extraSamplerConfig.model_fieldsSamplerConfig.model_fields_setSamplerConfig.model_json_schema()SamplerConfig.model_parametrized_name()SamplerConfig.model_post_init()SamplerConfig.model_rebuild()SamplerConfig.model_validate()SamplerConfig.model_validate_json()SamplerConfig.model_validate_strings()SamplerConfig.n_betaSamplerConfig.noise_levelSamplerConfig.num_burnin_stepsSamplerConfig.num_chainsSamplerConfig.num_drawsSamplerConfig.num_init_loss_batchesSamplerConfig.num_steps_bw_drawsSamplerConfig.sampling_methodSamplerConfig.sampling_method_kwargsSamplerConfig.save_metricsSamplerConfig.shuffle
- devinterp.slt.covariance module
- devinterp.slt.llc module
- devinterp.slt.lm_loss module
- devinterp.slt.observables module
- devinterp.slt.sampler module
- devinterp.slt.sampling module
- devinterp.slt.susceptibilities module
- devinterp.slt.weight_restrictions module
- devinterp.slt.writing module
- devinterp.slt.zarr_schema module
- Module contents
Submodules
devinterp.utils module
- devinterp.utils.tokenize_and_concatenate(dataset, tokenizer, streaming: bool = False, max_length: int = 1024, column_name: str = 'text', add_bos_token: bool = True, num_proc: int = 10)[source]
Tokenize and concatenate a text dataset into fixed-length sequences.
Based on TransformerLens (MIT License, Copyright 2022 TransformerLensOrg): https://github.com/TransformerLensOrg/TransformerLens Core algorithm unchanged, with local additions: input validation (bos token and max_length checks), numpy reshape in place of einops, and the output column renamed from “tokens” to “input_ids”.
Joins all text separated by EOS tokens, tokenizes, then reshapes into (num_sequences, max_length) chunks.
- Parameters:
dataset – HuggingFace text dataset.
tokenizer – HuggingFace tokenizer with bos_token_id and eos_token_id.
streaming – If True, disables parallel tokenization.
max_length – Context window length.
column_name – Name of the text column.
add_bos_token – Prepend BOS to each sequence (reduces usable length by 1).
num_proc – Number of processes for dataset.map().
- Returns:
Dataset with an “input_ids” column of torch tensors.