Zum Inhalt

Dataset Management

The Dataset Management module is a core component of DeepExtension, designed for structured data processing. It supports standardized dataset uploads, versioned management, and end-to-end tracking, providing high-quality data for model training and evaluation.

Key Features

  • Supports JSONL format datasets
  • Handles both single-file and multimodal datasets
  • Asynchronous background processing for stable large-file uploads
  • Comprehensive validation and error logging
  • Coming Soon: Version control (under development)

Note: Currently, only JSONL format is supported. For other formats (JSON/CSV/Parquet), please preprocess using conversion tools. Native support for additional formats is on our roadmap.


Dataset Upload Guide

Upload Process

  • Navigate to Datasets → Click 【Upload Dataset】
  • Select dataset type:

    • Single-File Dataset: Standalone JSONL file
    • Multimodal Dataset: Folder containing JSONL + images
  • Provide metadata:

    • Dataset Name (required)
    • Description (recommended for traceability)
  • File selection:

    • Single-file: Upload a JSONL file
    • Multimodal: Select a properly structured folder
  • Submit (processed asynchronously in the background)

  • Check results:
    • Success: Preview dataset
    • Failure: View detailed error logs

Format Specifications

Single-File Dataset

  • Encoding: UTF-8
  • Structural requirements:

    • Each line = valid JSON object
    • All objects must match the field structure of the first line
    • Empty values ("") allowed, but fields must exist
  • Technical limits:

    • Max 4,000 characters per line
    • No empty lines or comments

Multimodal Dataset

  • Folder structure:
    dataset_folder/  
    ├── metadata.jsonl  # Primary data file  
    └── images/        # Associated images  
    
  • JSONL example:
    {  
        "images": [{"imageId": "sample.jpg"}],  
        "qa": []  
    }  
    
  • Image requirements:
    • Must reside in the /images subfolder
    • Filename must exactly match imageId

Note: Please refer to the "Before You Begin" section in Quick Start: Run Your First Training Task for relevant dataset examples.


Dataset Lifecycle Management

Each upload creates an independent dataset entity, supporting:

  • Training: Fine-tuning data source
  • Evaluation: Test question sets (supports automated grading with reference answers)
  • Versioning: Dataset rollback coming soon (Roadmap Q4)

Best Practice: Use the description field to document data sources, preprocessing steps, and key characteristics for future reuse.


DeepExtension — Make your data structured, reusable, and ready for intelligent learning