搜索 - 腾讯云开发者社区-腾讯云

文章/答案/技术大牛

发布

来自专栏腾讯云流计算 Oceanus
Managing Large State in Apache Flink®: An Intro to Incremental Checkpointing
We created the incremental checkpointing feature after we observed that writing the full state for every Incremental checkpointing instead maintains the differences (or ‘delta’) between each checkpoint and on checkpointingfor full details, but in summary, you enable checkpointing as normal and also enable incremental checkpointing in the constructor by setting the second parameter totrue. Overall, the process reduces the checkpointing time during normal operations but can lead to a longer
98950发布于 2018-08-10
来自专栏从流域到海域
大模型高效训练基础知识：梯度检查点（Gradient Checkpointing）
prerequiste: 大模型训练基础知识：梯度累积（Gradient Accumulationn）梯度检查点（Gradient Checkpointing）如今（2023年）大模型的参数量巨大梯度检查点（Gradient Checkpointing）在上述两种方式之间取了一个平衡，这种方法采用了一种策略选择了计算图上的一部分激活值保存下来，其余部分丢弃，这样被丢弃的那一部分激活值需要在计算梯度时重新计算 training_args = TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing Trainer(model=model, args=training_args, train_dataset=ds) result = trainer.train() 参考文献 1.Gradient Checkpointing 2.pytorch模型训练之fp16、apm、多GPU模型、梯度检查点（gradient checkpointing）显存优化等
4.3K30编辑于 2023-10-12
分布式锁服务深度解析：以Apache Flink的Checkpointing机制为例
如何使用以Apache Flink的Checkpointing机制为例，Checkpointing机制是Flink中实现容错的一种机制。使用Checkpointing机制的步骤如下：启用Checkpointing：在Flink作业中启用Checkpointing机制，并设置Checkpointing的间隔时间。参数：根据需要配置Checkpointing的相关参数，如存储位置、超时时间等。作业中处理Checkpointing事件，如保存状态和恢复状态。我们启用了Checkpointing机制，并设置了Checkpointing的间隔时间。在MyStatefulMapFunction中，我们使用了Flink提供的ValueState来存储中间结果。
57721编辑于 2024-10-19
来自专栏自然语言处理
【LLM训练系列01】Qlora如何加载、训练、合并大模型
我们在peft库中可以看到源码： def prepare_model_for_kbit_training(model, use_gradient_checkpointing=True, gradient_checkpointing_kwargs if "use_reentrant" not in gradient_checkpointing_kwargs or gradient_checkpointing_kwargs["use_reentrant (gradient_checkpointing_kwargs) > 0: warnings.warn( "gradient_checkpointing_kwargs ": gradient_checkpointing_kwargs} ) # enable gradient checkpointing for memory efficiency 检查模型是否支持 gradient_checkpointing_kwargs，并发出警告（如果版本过旧）。
1K10编辑于 2024-11-23
来自专栏浪浪山下那个村
【Flink】【更新中】状态后端和checkpoint
默认值说明 execution.checkpointing.interval 建议一般配置为1-10min左右 execution.checkpointing.mode EXACTLY_ONCE EXACTLY_ONCE：保证精确一次; AT_LEAST_ONCE 1 同时进行checkpoint的最大次数 execution.checkpointing.min-pause 可容忍的checkpoint的连续故障数目 execution.checkpointing.aligned-checkpoint-timeout 0 参考：execution.checkpointing.aligned-checkpoint-timeout （已经过期） execution.checkpointing.force
1.5K30编辑于 2023-09-08
来自专栏大数据成神之路
ApacheFlink深度解析-FaultTolerance
Apache Flink中创建快照的机制叫做Checkpointing，Checkpointing的理论基础 Stephan 在 Lightweight Asynchronous Snapshots for 在Apache Flink中以Checkpointing的机制进行容错，Checkpointing会产生类似binlog一样的、可以用来恢复任务状态的数据文件。上面我们了解到整个流上面我们会随这时间推移不断的做Checkpointing，不断的产生snapshot存储到Statebackend中，那么多久进行一次Checkpointing？通过上面内容我们了解了Apache Flink中Exactly-Once和At-Least-Once只是在进行checkpointing时候的配置模式，两种模式下进行checkpointing的原理是一致的 buffer中；当Operator接收到上游所有barrier的时候，当前Operator会进行Checkpointing，生成snapshot并持久化；当完Checkpointing时候将barrier
99320发布于 2019-03-19
来自专栏DataLink数据中台
Dinky从checkpoint与savepoint自动恢复整库同步作业
= 6000; SET execution.checkpointing.tolerable-failed-checkpoints = 10; SET execution.checkpointing.timeout =600000; SET execution.checkpointing.externalized-checkpoint-retention = RETAIN_ON_CANCELLATION; SET execution.checkpointing.mode = EXACTLY_ONCE; SET execution.checkpointing.unaligned = true; SET execution.checkpointing.max-concurrent-checkpoints = 6000; SET execution.checkpointing.tolerable-failed-checkpoints = 10; SET execution.checkpointing.timeout execution.checkpointing.mode = EXACTLY_ONCE; SET execution.checkpointing.unaligned = true; SET execution.checkpointing.max-concurrent-checkpoints
1.5K50编辑于 2023-02-26
来自专栏浪浪山下那个村
【Flink】【更新中】状态后端和checkpoint
建议一般配置为1-10min左右 execution.checkpointing.mode EXACTLY_ONCE EXACTLY_ONCE：保证精确一次; AT_LEAST_ONCE:至少一次。，建议不开启 execution.checkpointing.unaligned.forced false 是否强制开启非对齐checkpoint execution.checkpointing.max-concurrent-checkpoints 1 同时进行checkpoint的最大次数 execution.checkpointing.min-pause 0 两个checkpoint之间的最小停顿时间 execution.checkpointing.tolerable-failed-checkpoints - 可容忍的checkpoint的连续故障数目 execution.checkpointing.aligned-checkpoint-timeout 0 对齐checkpoint超时时间 execution.checkpointing.alignment-timeout 0 参考：execution.checkpointing.aligned-checkpoint-timeout （已经过期） execution.checkpointing.force false 是否强制检查点
1.9K30编辑于 2023-10-17
来自专栏DataLink数据中台
Dinky 扩展 ClickHouse 的实践分享
= 60000; SET execution.checkpointing.tolerable-failed-checkpoints = 10; SET execution.checkpointing.timeout =10000; SET execution.checkpointing.externalized-checkpoint-retention = RETAIN_ON_CANCELLATION; SET execution.checkpointing.mode = EXACTLY_ONCE; SET execution.checkpointing.unaligned = true; SET restart-strategy = 60000; SET execution.checkpointing.tolerable-failed-checkpoints = 10; SET execution.checkpointing.timeout execution.checkpointing.mode = EXACTLY_ONCE; SET execution.checkpointing.unaligned = true; SET restart-strategy
1.6K20编辑于 2023-02-26
来自专栏计算机视觉理论及其实现
理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用
Checkpointing Tutorial for TensorFlow, Keras, and PyTorch This post will demonstrate how to checkpoint Here are the steps to run the TensorFlow checkpointing example on FloydHub. according to the checkpointing strategy we adopted in our example. Finally, we are ready to see this checkpointing strategy applied during model training. We'll need to write our own solution according to our chosen checkpointing strategy.
6.5K30编辑于 2022-09-03
来自专栏DataLink数据中台
Dinky 扩展 iceberg 的实践分享
= 60s; set execution.checkpointing.timeout= 15000000; set execution.checkpointing.max-concurrent-checkpoints = 500; set execution.checkpointing.min-pause= 500; -- 开启状态后端类型为rocksdb，开启增量快照，开启checkpoints，记录数据状态，如果不开启 = 60s; set execution.checkpointing.timeout= 15000000; set execution.checkpointing.max-concurrent-checkpoints = 500; set execution.checkpointing.min-pause= 500; -- 开启状态后端类型为rocksdb，开启增量快照，开启checkpoints，记录数据状态，如果不开启 = 3s; set execution.checkpointing.timeout= 15000000; set execution.checkpointing.max-concurrent-checkpoints
2.1K10编辑于 2022-09-02
来自专栏AI研习社
OpenAI 开源最新工具包，模型增大 10 倍只需额外增加 20% 计算时间
AI 研习社消息，近日，OpenAI 在 GitHub 上开源最新工具包 gradient-checkpointing，该工具包通过设置梯度检查点（gradient-checkpointing）来节省内存资源雷锋网 AI 研习社将该开源信息编译整理如下：通过梯度检查点（gradient-checkpointing）来节省内存资源训练非常深的神经网络需要大量内存，利用 Tim Salimans 和 Yaroslav Bulatov 共同开发的 gradient-checkpointing 包中的工具，可以牺牲计算时间来解决内存过小的问题，让你更好地针对模型进行训练。使用常规的梯度函数和使用最新的优化内存函数，在不同层数的 ResNet 网络下的内存占用情况和执行时间的对比 via：GitHub（https://github.com/openai/gradient-checkpointing #saving-memory-using-gradient-checkpointing）
91070发布于 2018-03-16
来自专栏大数据成神之路
Flink1.19版本生产环境应用解读！
Checkpoint 1.19版本支持了一个能力，可以通过设置参数来设置Flink任务在读取不同数据源数据的checkpointing.interval能力。什么意思呢？这两个阶段就可以设置不同的checkpointing.interval。 execution.checkpointing.interval: 30sec execution.checkpointing.interval-during-backlog: 30min 以上就1.19
1.2K10编辑于 2024-03-21
来自专栏量子位
OpenAI推新程序包：GPU适应十倍大模型仅需增加20%训练时间
不怕，用这个OpenAI推出的gradient-checkpointing程序包，对于前馈模型来说，仅仅需要增加20%的计算时间，这个程序包，GPU就能适应十倍大的模型。还有这种操作？ OpenAI的研究科学家Tim Salimans和前Google Brain工程师的数据科学家Yaroslav Bulatov两人发布了一个python/TensorFlow包，名为gradient-checkpointing 如果想了解这个程序包是如何节约内存的，可以移步GitHub一探究竟： https://github.com/openai/gradient-checkpointing
807110发布于 2018-03-22
来自专栏罗西的思考
[源码解析] 深度学习流水线并行 GPipe(3) ----重计算
这时就不得不用另一项重要的技术：重计算（Checkpointing）。所以就引入了一项重要的技术：重计算（Checkpointing）。亚线性内存优化有两种思路，Checkpointing 和 CPU offload： Checkpointing 的核心思想是在前向网络中标记少量的 Tensor （被 Checkpointing 的 2.3.2.2 Checkpointing 优化上图展示了做 Checkpointing 之前和之后的计算图对比。左面灰色的是网络配置。 Checkpointing 本质就是用计算换内存。 Checkpointing 不存储用于后向计算所需要的整个计算图的全部中间激活值，而是在反向传播中重新计算它们。
1.4K20发布于 2021-09-08
来自专栏信数据得永生
ChatGLM2 源码解析：`GLMTransformer`
, dtype=config.torch_dtype) self.gradient_checkpointing # `presents`保存每一层的 KV 的缓存 presents = () if use_cache else None if self.gradient_checkpointing logger.warning_once( "`use_cache=True` is incompatible with gradient checkpointing _get_layer(index) if self.gradient_checkpointing and self.training: layer_ret
56620编辑于 2023-10-13
来自专栏GiantPandaCV
【DeepSpeed 教程翻译】二，Megatron-LM GPT2，Zero 和 ZeRO-Offload
and args.deepspeed_activation_checkpointing: set_deepspeed_activation_checkpointing(args) def set_deepspeed_activation_checkpointing(args): deepspeed.checkpointing.configure(mpu, activation partitioning, contiguous checkpointing 和 CPU checkpointing。 deepspeed.checkpointing.configure(mpu_, deepspeed_config=None, partition_activations=None, contiguous_checkpointing deepspeed.checkpointing.configure来进行配置。
4K10编辑于 2023-08-22
来自专栏容器计算
深入浅出Spark的Checkpoint机制
/** * Performs the checkpointing of this RDD by saving this. /** * This class contains all the information related to RDD checkpointing. It manages process of checkpointing of the associated RDD, * as well as, manages the post-checkpoint * * Subclasses should override this method to define custom checkpointing behavior. /** * Enumeration to manage state transitions of an RDD through checkpointing * * [ Initialized --
1.4K10发布于 2020-08-06
来自专栏罗西的思考
[源码解析] PyTorch 流水线并行实现 (6)--并行计算
_ensure_copy_streams() # The micro-batch index where the checkpointing stops. If checkpointing is enabled, here the # recomputation is scheduled at backpropagation. ([4] in 当使用 checkpointing，那么它必须在反向传播任务 B_{i,j} 之前和完成 B_{i+1,j} 之后被调度。这就要求必须在autograd引擎和在计算图中对其进行编码。对于这种细粒度的顺序控制，torchgpipe把checkpointing 操作改为使用两个单独的autograd函数Checkpoint和Recompute来实现。对于这种细粒度的顺序控制，torchgpipe把checkpointing 操作改为使用两个单独的autograd函数Checkpoint和Recompute来实现。
2K20发布于 2021-10-13
来自专栏博文视点Broadview
揭秘Spark应用性能调优
checkpointing 图算法中一个常用的模式是用每个迭代过程中运算后的新数据更新图。这意味着，实际构成图的顶点 RDD 亦或边 RDD 的链会变得越来越长。而由 RDD 提供并且被 Graph 继承的一个特性：checkpointing，能解决长 RDD 谱系问题。下面清单中的代码示范了如何使用 checkpointing，这样就可以持续输出顶点，更新结果图。 ? 一个标记为 checkpointing 的 Graph 会导致下面的顶点 RDD 和边 RDD 做 checkpoint。 checkpointing 在这里也不能缓解内存压力。遇到这种问题，首先要考虑序列化 Graph 对象。
1.3K20发布于 2020-06-11

第 2 页第 3 页第 4 页第 5 页第 6 页第 7 页第 8 页第 9 页第 10 页第 11 页

点击加载更多

Managing Large State in Apache Flink®: An Intro to Incremental Checkpointing

大模型高效训练基础知识：梯度检查点（Gradient Checkpointing）

分布式锁服务深度解析：以Apache Flink的Checkpointing机制为例

【LLM训练系列01】Qlora如何加载、训练、合并大模型

【Flink】【更新中】状态后端和checkpoint

ApacheFlink深度解析-FaultTolerance

Dinky从checkpoint与savepoint自动恢复整库同步作业

【Flink】【更新中】状态后端和checkpoint

Dinky 扩展 ClickHouse 的实践分享

理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

Dinky 扩展 iceberg 的实践分享

OpenAI 开源最新工具包，模型增大 10 倍只需额外增加 20% 计算时间

Flink1.19版本生产环境应用解读！

OpenAI推新程序包：GPU适应十倍大模型仅需增加20%训练时间

[源码解析] 深度学习流水线并行 GPipe(3) ----重计算

ChatGLM2 源码解析：`GLMTransformer`

【DeepSpeed 教程翻译】二，Megatron-LM GPT2，Zero 和 ZeRO-Offload

深入浅出Spark的Checkpoint机制

[源码解析] PyTorch 流水线并行实现 (6)--并行计算

揭秘Spark应用性能调优

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐