Merge pull request #2159 from Andrew-Luo1:main

PiperOrigin-RevId: 691868879 Change-Id: If34d40aef885970895aa2e1dd11ce2206fa729be
google-deepmind · Oct 31, 2024 · 09212a9 · 09212a9
2 parents c9f78ad + 9504a97
commit 09212a9
Showing 1 changed file with 25 additions and 3 deletions.
diff --git a/mjx/training_apg.ipynb b/mjx/training_apg.ipynb
@@ -66,10 +66,10 @@
     "$$\n",
     "\n",
     "$$\n",
-    "\\frac{\\partial x_t}{\\partial \\theta} = \\textcolor{Navy}{\\frac{\\partial f(x_{t-1}, a_{t-1})}{\\partial x_{t-1}}}\\frac{\\partial x_{t-1}}{\\partial \\theta} + \\textcolor{Navy}{\\frac{\\partial f(x_{t-1}, a_{t-1})}{\\partial a_{t-1}}} \\frac{\\partial a_{t-1}}{\\partial \\theta}\n",
+    "\\frac{\\partial x_t}{\\partial \\theta} = \\color{blue}{\\frac{\\partial f(x_{t-1}, a_{t-1})}{\\partial x_{t-1}}}\\frac{\\partial x_{t-1}}{\\partial \\theta} + \\color{blue}{\\frac{\\partial f(x_{t-1}, a_{t-1})}{\\partial a_{t-1}}} \\frac{\\partial a_{t-1}}{\\partial \\theta}\n",
     "$$\n",
     "\n",
-    "The navy-colored terms in the above expression are enabled by MJX's differentiability and are the key difference between FoPG's and ZoPG's. An important consideration is what these jacobians look like near contact points. To see why certain gradients within the jacobian can be pathological, imagine a hard sphere falling toward a block of marble. How does its velocity change with respect to distance ($\\frac{\\partial \\dot{z}_t}{\\partial z_t}$), the instant before it touches the ground? This is the case of an **uninformative gradient**, due to [hard contact](https://arxiv.org/html/2404.02887v1). Fortunately, the default contact settings in Mujoco are sufficiently [soft](https://mujoco.readthedocs.io/en/stable/computation/index.html#soft-contact-model) for learning via FoPG's. With soft contacts, the ground applies an increasing force on the ball as it penetrates it, unlike rigid contacts, which instantly provide enough force for deflection.\n",
+    "The blue-colored terms in the above expression are enabled by MJX's differentiability and are the key difference between FoPG's and ZoPG's. An important consideration is what these jacobians look like near contact points. To see why certain gradients within the jacobian can be pathological, imagine a hard sphere falling toward a block of marble. How does its velocity change with respect to distance ($\\frac{\\partial \\dot{z}_t}{\\partial z_t}$), the instant before it touches the ground? This is the case of an **uninformative gradient**, due to [hard contact](https://arxiv.org/html/2404.02887v1). Fortunately, the default contact settings in Mujoco are sufficiently [soft](https://mujoco.readthedocs.io/en/stable/computation/index.html#soft-contact-model) for learning via FoPG's. With soft contacts, the ground applies an increasing force on the ball as it penetrates it, unlike rigid contacts, which instantly provide enough force for deflection.\n",
     "\n",
     "A helpful way to think about FoPG's is via the chain rule and computation graphs, as illustrated below for how $r_2$ influences the policy gradient, again for the case that the reward does not depend on action:\n",
     "\n",
@@ -85,7 +85,29 @@
     "\n",
     "Last, despite the sample efficiency, FoPG methods can still struggle with wall-clock time. Because the gradients have low variance, they do not benefit significantly from massive parallelization of data collection - unlike [RL](https://arxiv.org/abs/2109.11978). Additionally, the policy gradient is typically calculated via autodifferentiation. This can be 3-5x slower than unrolling the simulation forward, and memory intensive, with memory requirements scaling with $O(m \\cdot (m+n) \\cdot T)$, where m and n are the state and control dimensions, $m \\cdot (m+n)$ is the jacobian dimension, and T is the number of steps propogated through.\n",
     "\n",
-    "Note that with certain models, using autodifferentiation through mjx.step currently causes [nan gradients](https://github.com/google-deepmind/mujoco/issues/1517). For now, we address this issue by using double-precision floats, at the cost of doubling the memory requirements and training time."
+    "Note that with certain models, using autodifferentiation through mjx.step currently causes [nan gradients](https://github.com/google-deepmind/mujoco/issues/1517). For now, we address this issue by using double-precision floats, at the cost of doubling the memory requirements and training time.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "**Publications**\n",
+    "\n",
+    "If you use this work in an academic context, please cite the following publication:\n",
+    "\n",
+    "```\n",
+    "@misc{luo2024residual,\n",
+    "  title={Residual Policy Learning for Perceptive Quadruped Control Using Differentiable Simulation},\n",
+    "  author={Luo, Jing Yuan and Song, Yunlong and Klemm, Victor and Shi, Fan and Scaramuzza, Davide and Hutter, Marco},\n",
+    "  year={2024},\n",
+    "  eprint={2410.03076},\n",
+    "  archivePrefix={arXiv},\n",
+    "  primaryClass={cs.RO},\n",
+    "  url={https://doi.org/10.48550/arXiv.2410.03076}\n",
+    "}\n",
+    "```"
    ]
   },
   {