-
Notifications
You must be signed in to change notification settings - Fork 13
/
codenet_mlm.yaml
65 lines (57 loc) · 2.27 KB
/
codenet_mlm.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# Copyright 2021 The MLX Contributors
#
# SPDX-License-Identifier: Apache-2.0
id: codenet-mlm
name: CodeNet - MLM
description: A small subset of the Project CodeNet dataset, containing 55,000 samples in the C programming language to demonstrate the use of a Masked Language Model for tokenization tasks.
version: 1.0.0
created: 2021-05-05
updated: 2021-05-05
format:
- type: C
url: https://en.wikipedia.org/wiki/The_C_Programming_Language
domain: Solutions to Programming Problems
# Information about the entity that makes the data set available
provider:
name: Data Asset eXchange
url: https://developer.ibm.com/exchanges/data/all/project-codenet/
# identifies where the data set is stored and how it is stored (REQUIRED)
repository:
type: HTTP
url: https://dax-cdn.cdn.appdomain.cloud/dax-project-codenet/1.0.0/Project_CodeNet_MLM.tar.gz
mime_type: application/x-tar
sha_512:
size: 51K
# REQUIRED; data set license information
license:
commercial: false
name: CDLA Permissive v2.0
url: https://cdla.dev/permissive-2-0/
# REQUIRED; describes relevant files in the data set archive
content:
- pattern: train, test
description: The dataset comprises 55,000 submissions in the C programming language.
records: 55000
size: 51K
type: file
format: C
mime_type: text/plain
# OPTIONAL; Identifies where the data set was obtained from
source:
name: Data Asset eXchange
url: https://developer.ibm.com/exchanges/data/all/project-codenet/
# OPTIONAL; but recommended
seo_tags:
- Solutions to Programming Problems
- Text
# OPTIONAL; assets that complement this data set, e.g. notebooks
related_assets:
- name: Project CodeNet Masked Language Model
description: Experiment investigates whether a popular attention model to construct a masked language model (MLM) can be used for source code instead of natural language sentences
mime_type: application/x-ipynb+json
url: https://github.com/machine-learning-exchange/katalog/blob/main/notebook-samples/src/codenet/Project_CodeNet_MLM.ipynb
application:
name: MLX
asset_id: notebooks/project-codenet-mlm
# OPTIONAL; url for the markdown file which describes the asset
readme_url: https://raw.githubusercontent.com/machine-learning-exchange/katalog/main/dataset-samples/codenet_mlm/codenet_mlm.md