334 handle spark session creation when already exists (#335)

xuwenyihust · web-flow · commit 3b2dca7c3945 · 2024-12-10T19:44:58.000+08:00
* Enhance demo notebook and SparkModel.js for improved Spark session management

- Added a new code cell in the demo notebook to initialize a Spark session with detailed configuration settings, including application ID and Spark UI link.
- Updated SparkModel.js to check for existing Spark application IDs before storing new session information, improving error handling and preventing duplicate entries.
- Enhanced logging for better visibility into Spark session creation and management processes.

* Update Dockerfile to install npm dependencies with legacy peer dependencies flag

- Modified the npm install command to include the --legacy-peer-deps flag, ensuring compatibility with older peer dependencies during the build process of the React application.

* Update Dockerfile to use Node 18 and enhance build process

- Upgraded Node.js version from 14 to 18 for improved performance and compatibility.
- Cleared npm cache before installing dependencies to ensure a clean environment.
- Added installation of @jridgewell/gen-mapping to support additional functionality.
- Increased memory allocation for the build process by setting NODE_OPTIONS to 4096 MB.

* Update Dockerfile to use Node 14 and streamline build process

- Downgraded Node.js version from 18 to 14 for compatibility.
- Simplified npm installation by removing cache cleaning and legacy peer dependencies flag.
- Removed increased memory allocation for the build process, optimizing the Dockerfile for a more straightforward build.

* Update Dockerfile to use Node 18 and optimize build process

- Upgraded Node.js version from 14 to 18 for improved performance.
- Implemented a clean install of npm dependencies with legacy peer dependencies support.
- Added specific package installations for @jridgewell/gen-mapping and @babel/generator.
- Increased memory allocation for the build process by setting NODE_OPTIONS to 4096 MB.

* Refactor Dockerfile for improved npm dependency management and build process

- Updated npm installation commands to set legacy peer dependencies and install packages in a specific order.
- Cleaned npm cache and rebuilt before running the build command to ensure a fresh environment.
- Increased clarity and efficiency in the Dockerfile setup for the web application.

* Enhance Spark session management and update demo notebook

- Updated demo notebook to reflect successful Spark session execution, including updated execution metadata and application ID.
- Refactored Spark session creation in backend to streamline the process, removing unnecessary parameters and improving error handling.
- Modified SparkModel.js to ensure proper session initialization and validation of Spark application IDs.
- Improved logging for better visibility during Spark session creation and management processes.

* Refactor demo notebook and SparkModel.js for improved Spark session handling

- Removed outdated code cells from the demo notebook to enhance clarity and usability.
- Updated SparkModel.js to improve validation of Spark application IDs, ensuring they start with 'app-' and are correctly extracted from the HTML.
- Simplified the logic for storing Spark session information in the Notebook component, enhancing overall session management.

* Refactor Notebook.js and SparkModel.js for improved Spark app ID handling

- Updated Notebook.js to extract and store Spark app ID more efficiently, ensuring it is only stored if valid.
- Enhanced logging to provide clearer visibility of the extracted Spark app ID.
- Added a console log in SparkModel.js to confirm successful extraction of the Spark app ID, improving debugging capabilities.

* Refactor Notebook.js to streamline Spark app ID logging

- Removed redundant console log for Spark app ID and retained a single log statement for clarity.
- Enhanced error handling in the Notebook component to ensure better debugging during cell execution.

* Implement Spark app status endpoint and enhance logging in SparkModel.js

- Added a new endpoint to retrieve the status of a Spark application by its ID in spark_app.py, improving the API's functionality.
- Enhanced logging in SparkModel.js to provide better visibility during the storage process of Spark application information, including status checks and error handling.
- Improved validation for Spark application IDs to ensure only valid IDs are processed, contributing to more robust error management.

* Refactor Spark app status retrieval and enhance error handling

- Moved the Spark app status retrieval logic from the route handler in spark_app.py to a static method in SparkApp class for better separation of concerns.
- Improved error handling and logging in the new get_spark_app_status method, ensuring clearer responses for application not found and internal errors.
- Simplified the route handler to directly return the response from the SparkApp method, enhancing code readability and maintainability.

* Enhance notebook path handling in getSparkApps method

- Safely handle notebook paths by simplifying them when they match the pattern work/user@domain/notebook.ipynb.
- Improved clarity by logging the simplified notebook path for better debugging and visibility.

* Refactor Spark app route and simplify notebook path handling

- Removed unused JWT and user identification decorators from the Spark app route in spark_app.py for cleaner code.
- Simplified the notebook path handling in getSparkApps method of NotebookModel.js by removing unnecessary path simplification logic, allowing direct usage of the provided notebook path.
- Enhanced code readability and maintainability by streamlining the logic in both files.

* Add create_spark_app endpoint and enhance error handling in SparkApp service

* Enhance create_spark_app endpoint with user authentication and error handling

- Added JWT authentication and user identification decorators to the create_spark_app route in spark_app.py to ensure only authenticated users can create Spark applications.
- Implemented user context validation in the SparkApp service, returning a 401 response if the user is not found.
- Added a database rollback mechanism on error during Spark app creation to maintain data integrity.

* Enhance Spark session management and update demo notebook

- Updated demo notebook to include successful Spark session execution details, including execution metadata and application ID.
- Removed the create_spark_session endpoint from spark_app.py to streamline session management.
- Refactored SparkApp service by removing the create_spark_session method, as session creation is now handled directly in the notebook.
- Improved SparkModel.js to ensure proper validation and storage of Spark application IDs, including enhanced logging for better visibility during the process.
diff --git a/examples/user_0@gmail.com/demo.ipynb b/examples/user_0@gmail.com/demo.ipynb
@@ -10,6 +10,39 @@
     "- This is just a demo notebook\n",
     "- For testing only"
    ]
+  },
+  {
+   "cell_type": "code",
+   "isExecuted": true,
+   "lastExecutionResult": "success",
+   "lastExecutionTime": "2024-12-10 10:26:03",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "        <div style=\"border: 1px solid #e8e8e8; padding: 10px;\">\n",
+       "            <h3>Spark Session Information</h3>\n",
+       "            <p><strong>Config:</strong> {'spark.driver.memory': '1g', 'spark.driver.cores': 1, 'spark.executor.memory': '1g', 'spark.executor.cores': 1, 'spark.executor.instances': 1, 'spark.dynamicAllocation.enabled': False}</p>\n",
+       "            <p><strong>Application ID:</strong> app-20241210080310-0003</p>\n",
+       "            <p><strong>Spark UI:</strong> <a href=\"http://localhost:18080/history/app-20241210080310-0003\">http://localhost:18080/history/app-20241210080310-0003</a></p>\n",
+       "        </div>\n",
+       "        "
+      ],
+      "text/plain": [
+       "Custom Spark Session (App ID: app-20241210080310-0003) - UI: http://0edb0a63b2fb:4040"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "spark = create_spark(\"work/user_0@gmail.com/demo.ipynb\")\n",
+    "spark"
+   ]
   }
  ],
  "metadata": {
diff --git a/server/app/routes/spark_app.py b/server/app/routes/spark_app.py
@@ -8,14 +8,6 @@
 
 logging.basicConfig(level=logging.INFO)
 
-@spark_app_blueprint.route('/spark_app/<path:spark_app_id>', methods=['POST'])
-def create_spark_app(spark_app_id):
-    data = request.get_json()
-    notebook_path = data.get('notebookPath', None)
-    return SparkApp.create_spark_app(spark_app_id=spark_app_id, notebook_path=notebook_path)
-
-# @jwt_required()
-# @identify_user
 @spark_app_blueprint.route('/spark_app/<path:notbook_path>/config', methods=['GET'])
 def get_spark_app_config(notbook_path):
     logging.info(f"Getting spark app config for notebook path: {notbook_path}")
@@ -27,20 +19,16 @@ def update_spark_app_config(notbook_path):
     data = request.get_json()
     return SparkApp.update_spark_app_config_by_notebook_path(notbook_path, data)
 
-@spark_app_blueprint.route('/spark_app/session', methods=['POST'])
-def create_spark_session():
+@spark_app_blueprint.route('/spark_app/<spark_app_id>/status', methods=['GET'])
+def get_spark_app_status(spark_app_id):
+    logging.info(f"Getting spark app status for app id: {spark_app_id}")
+    return SparkApp.get_spark_app_status(spark_app_id)
+
+@spark_app_blueprint.route('/spark_app/<spark_app_id>', methods=['POST'])
+@jwt_required()
+@identify_user
+def create_spark_app(spark_app_id):
+    logging.info(f"Creating spark app with id: {spark_app_id}")
     data = request.get_json()
     notebook_path = data.get('notebookPath')
-    spark_config = data.get('config')
-    
-    try:
-        spark_app_id = SparkApp.create_spark_session(notebook_path, spark_config)
-        return jsonify({
-            'status': 'success',
-            'sparkAppId': spark_app_id
-        })
-    except Exception as e:
-        return jsonify({
-            'status': 'error',
-            'message': str(e)
-        }), 500
+    return SparkApp.create_spark_app(spark_app_id=spark_app_id, notebook_path=notebook_path)
diff --git a/server/app/services/spark_app.py b/server/app/services/spark_app.py
@@ -1,7 +1,7 @@
 from app.models.spark_app import SparkAppModel
 from app.models.notebook import NotebookModel
 from app.models.spark_app_config import SparkAppConfigModel
-from flask import Response
+from flask import g, Response
 from datetime import datetime
 import json
 from database import db
@@ -143,45 +143,66 @@ def update_spark_app_config_by_notebook_path(notebook_path: str = None, data: di
       status=200)
   
   @staticmethod
-  def create_spark_app(spark_app_id: str = None, notebook_path: str = None):
-    logger.info(f"Creating spark app with id: {spark_app_id} for notebook path: {notebook_path}")
-
-    if spark_app_id is None:
-      logger.error("Spark app id is None")
-      return Response(
-        response=json.dumps({'message': 'Spark app id is None'}), 
-        status=404)
-
-    if notebook_path is None:
-      logger.error("Notebook path is None")
-      return Response(
-        response=json.dumps({'message': 'Notebook path is None'}), 
-        status=404)
-
+  def get_spark_app_status(spark_app_id: str):
+    logger.info(f"Getting spark app status for app id: {spark_app_id}")
     try:
-      # Get the notebook id
-      notebook = NotebookModel.query.filter_by(path=notebook_path).first()
-      notebook_id = notebook.id
-
-      # Create the spark app
-      spark_app = SparkAppModel(
-        spark_app_id=spark_app_id,
-        notebook_id=notebook_id,
-        user_id=notebook.user_id,
-        created_at=datetime.now().strftime("%Y-%m-%d %H:%M:%S")
-      )
-
-      db.session.add(spark_app)
-      db.session.commit()
-
-      logger.info(f"Spark app created: {spark_app}")
+        spark_app = SparkAppModel.query.filter_by(spark_app_id=spark_app_id).first()
+        if spark_app is None:
+            logger.error("Spark application not found")
+            return Response(
+                response=json.dumps({'message': 'Spark application not found'}),
+                status=404
+            )
+        return Response(
+            response=json.dumps({'status': spark_app.status}),
+            status=200
+        )
     except Exception as e:
-      logger.error(f"Error creating spark app: {e}")
-      return Response(
-        response=json.dumps({'message': 'Error creating spark app: ' + str(e)}), 
-        status=404)
-
-    return Response(
-      response=json.dumps(spark_app.to_dict()), 
-      status=200
-    )
+        logger.error(f"Error getting spark app status: {e}")
+        return Response(
+            response=json.dumps({'message': str(e)}),
+            status=500
+        )
+  
+  @staticmethod
+  def create_spark_app(spark_app_id: str, notebook_path: str):
+    logger.info(f"Creating spark app with id: {spark_app_id} for notebook: {notebook_path}")
+    try:
+        if not g.user:
+            logger.error("User not found in context")
+            return Response(
+                response=json.dumps({'message': 'User not authenticated'}),
+                status=401
+            )
+
+        # Get the notebook
+        notebook = NotebookModel.query.filter_by(path=notebook_path).first()
+        if notebook is None:
+            logger.error("Notebook not found")
+            return Response(
+                response=json.dumps({'message': 'Notebook not found'}),
+                status=404
+            )
+
+        # Create new spark app
+        spark_app = SparkAppModel(
+            spark_app_id=spark_app_id,
+            notebook_id=notebook.id,
+            user_id=g.user.id,
+            created_at=datetime.now()
+        )
+        
+        db.session.add(spark_app)
+        db.session.commit()
+
+        return Response(
+            response=json.dumps(spark_app.to_dict()),
+            status=200
+        )
+    except Exception as e:
+        logger.error(f"Error creating spark app: {e}")
+        db.session.rollback()  # Add rollback on error
+        return Response(
+            response=json.dumps({'message': str(e)}),
+            status=500
+        )
diff --git a/webapp/Dockerfile b/webapp/Dockerfile
@@ -1,10 +1,26 @@
 # Stage 1: Build the React application
-FROM node:14 as build
+FROM node:18 as build
 WORKDIR /app
+
+# Copy package files
 COPY package*.json ./
-RUN npm install
+
+# Clean and setup npm
+RUN npm cache clean --force && \
+    npm set legacy-peer-deps=true
+
+# Install dependencies in a specific order
+RUN npm install && \
+    npm install @jridgewell/gen-mapping@0.3.2 && \
+    npm install @babel/generator@7.23.0 && \
+    npm install @babel/traverse@7.23.0
+
+# Copy the rest of the application
 COPY . .
-RUN npm run build
+
+# Build with increased memory limit
+ENV NODE_OPTIONS="--max-old-space-size=4096"
+RUN npm rebuild && npm run build
 
 # Stage 2: Serve the app with nginx
 FROM nginx:stable-alpine
diff --git a/webapp/src/components/notebook/Notebook.js b/webapp/src/components/notebook/Notebook.js
@@ -257,11 +257,13 @@ function Notebook({
 
             // Check if contains a spark app id
             if (result[0] && result[0].data && result[0].data['text/html'] && SparkModel.isSparkInfo(result[0].data['text/html'])) {
-                setSparkAppId(SparkModel.extractSparkAppId(result[0].data['text/html']));
-                SparkModel.storeSparkInfo(SparkModel.extractSparkAppId(result[0].data['text/html']), notebook.path)
+                const appId = SparkModel.extractSparkAppId(result[0].data['text/html']);
+                setSparkAppId(appId);
+                if (appId) {
+                    SparkModel.storeSparkInfo(appId, notebook.path);
+                }
+                console.log('Spark app id:', appId);
             }
-            console.log('Spark app id:', sparkAppId);
-
         } catch (error) {
             console.error('Failed to execute cell:', error);
         }
@@ -288,7 +290,7 @@ function Notebook({
     const handleCreateSparkSession = async () => {
         console.log('Create Spark session clicked');
         try {
-            const { sparkAppId, initializationCode } = await SparkModel.createSparkSession(notebookState.path);
+            const { initializationCode } = await SparkModel.createSparkSession(notebookState.path);
             
             // Create a new cell with the initialization code
             const newCell = {
@@ -306,12 +308,10 @@ function Notebook({
                 content: { ...notebookState.content, cells }
             });
 
-            // Execute the cell (now need to use the last index)
+            // Execute the cell
             const newCellIndex = cells.length - 1;
             await handleRunCodeCell(newCell, CellStatus.IDLE, (status) => setCellStatus(newCellIndex, status));
             
-            console.log('Spark session created with ID:', sparkAppId);
-            setSparkAppId(sparkAppId);
         } catch (error) {
             console.error('Failed to create Spark session:', error);
             alert('Failed to create Spark session. Please check the configuration.');
diff --git a/webapp/src/models/SparkModel.js b/webapp/src/models/SparkModel.js