openvinotoolkit
diff --git a/‎docs/articles_en/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/articles_en/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/articles_en/openvino-workflow/running-inference/model-input-output/dynamic-shapes.rst‎
Lines changed: 9 additions & 7 deletions b/‎docs/articles_en/openvino-workflow/running-inference/model-input-output/dynamic-shapes.rst‎
Lines changed: 9 additions & 7 deletions
diff --git a/‎src/plugins/intel_gpu/src/graph/debug_helper.cpp‎
Lines changed: 24 additions & 0 deletions b/‎src/plugins/intel_gpu/src/graph/debug_helper.cpp‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎src/plugins/intel_gpu/src/graph/impls/ocl/kernels_cache.cpp‎
Lines changed: 0 additions & 7 deletions b/‎src/plugins/intel_gpu/src/graph/impls/ocl/kernels_cache.cpp‎
Lines changed: 0 additions & 7 deletions
@@ -289,6 +289,7 @@ Specifying ``EXPORT_BLOB`` and ``BLOB_PATH`` parameters works similarly to ``CAC
 * To export a blob with weights you need to pass ``"CACHE_MODE" : "OPTIMIZE_SPEED"`` in the config.
 * If the blob is exported as weightless you also need to either provide
   ``"WEIGHTS_PATH" : "path\\to\\original\\model.bin"`` or ``"MODEL_PTR" : original ov::Model object``.
+* Ahead-of-time import in weightless mode has been optimized to consume less memory than during regular compilation or using ``CACHE_DIR``.
 
 .. tab-set::
 
 
@@ -18,8 +18,8 @@ As it was demonstrated in the :doc:`Changing Input Shapes <changing-input-shape>
 Reshaping models provides an ability to customize the model input shape for the exact size required in the end application.
 This article explains how the ability of model to reshape can further be leveraged in more dynamic scenarios.
 
-Applying Dynamic Shapes
-#######################
+When to Use Dynamic Shapes
+##########################
 
 Conventional "static" model reshaping works well when it can be done once per many model inference calls with the same shape.
 However, this approach does not perform efficiently if the input tensor shape is changed on every inference call. Calling the ``reshape()`` and ``compile_model()`` methods each time a new size comes is extremely time-consuming.
@@ -40,12 +40,14 @@ The methods are sensitive to model internals, do not always give optimal perform
 For a short overview of the methods, refer to the :doc:`When Dynamic Shapes API is Not Applicable <dynamic-shapes/openvino-without-dynamic-shapes-api>` page.
 Apply those methods only if native dynamic shape API described in the following sections does not work or does not perform as expected.
 
-The decision about using dynamic shapes should be based on proper benchmarking of a real application with real data.
-Unlike statically shaped models, dynamically shaped ones require different inference time, depending on input data shape or input tensor content.
-Furthermore, using the dynamic shapes can bring more overheads in memory and running time of each inference call depending on hardware plugin and model used.
+It is recommended to benchmark your application with real data to see if you need dynamic shapes and how it affects performance and resource use. Dynamic shapes can change inference performance and memory requirements compared to static shapes. The impact depends on the hardware plugin  used, such as CPU, GPU, or NPU, and on the specific model.
 
-Handling Dynamic Shapes
-#######################
+.. note::
+
+    **GPU Dynamic Shape Support:** GPUs support dynamic shapes, but optimization is still in progress for a broader range of models. Performance may vary depending on the specific model and use case. Consider testing with your specific workload to evaluate performance.
+
+How to Use Dynamic Shapes
+#########################
 
 This section describes how to handle dynamically shaped models with OpenVINO Runtime API version 2022.1 and higher. When using dynamic shapes, there are three main differences in the workflow than with static shapes:
 
 
@@ -466,7 +466,31 @@ NodeDebugHelper::~NodeDebugHelper() {
                                        dump_raw);
                 }
             }
+            for (size_t i = 0; i < m_inst.get_intermediates_memories().size(); i++) {
+                std::string name = get_file_prefix() + "_intermediates_" + std::to_string(i);
+                auto output_mem = m_inst.get_intermediates_memories()[i];
+                if (output_mem == nullptr) {
+                    GPU_DEBUG_COUT << " intermediates_mem is nullptr. Nothing to dump." << std::endl;
+                    continue;
+                }
 
+                auto& output_layout = output_mem->get_layout();
+                if (config.get_dump_tensors_format() == ov::intel_gpu::DumpFormat::binary) {
+                    // Binary dump : raw
+                    auto filename = get_file_path_for_binary_dump(output_layout, name, config.get_dump_tensors_path());
+
+                    mem_lock<char, mem_lock_type::read> lock(output_mem, m_stream);
+                    ov::util::save_binary(filename, lock.data(), output_mem->size());
+                    GPU_DEBUG_COUT << " Dump layer dst : " << layer_name << " to " << filename << std::endl;
+                    debug_str_for_bin_load += (filename + ",");
+                } else {
+                    const bool dump_raw = config.get_dump_tensors_format() == ov::intel_gpu::DumpFormat::text_raw;
+                    GPU_DEBUG_COUT << " Dump " << (dump_raw ? "raw " : "") << name << std::endl;
+                    auto filename = config.get_dump_tensors_path() + get_name_for_dump(name) + ".txt";
+                    // Text dump
+                    log_memory_to_file(output_mem, output_layout, m_stream, filename, dump_raw);
+                }
+            }
             if (config.get_dump_tensors_format() == ov::intel_gpu::DumpFormat::binary && m_inst.is_input()) {
                 debug_str_for_bin_load[debug_str_for_bin_load.size()-1] = '\"';
                 GPU_DEBUG_COUT << debug_str_for_bin_load << std::endl;;
 
@@ -213,13 +213,6 @@ void kernels_cache::get_program_source(const kernels_code& kernels_source_code,
 
             current_batch.has_microkernels |= kernel_string->has_microkernels;
 
-            // TODO: Technically, microkernels doesn't require specific headers, but we don't want to include
-            // some headers to all batches as it may lead to compilation error on some driver versions.
-            // Need to generalize work with headers to include only necessary parts
-            if (current_batch.has_microkernels) {
-                current_batch.source.insert(current_batch.source.begin(), current_batch.micro_headers.begin(), current_batch.micro_headers.end());
-            }
-
             current_batch.source.push_back(std::move(full_code));
             current_batch.kernels_counter++;
         }