Getting Started With KW-COCO¶
This document is a work in progress, and does need to be updated and refactored.
FAQ¶
Q: What is kwcoco
?
A: An extension of the MS-COCO data format for storing a “manifest” of
categories, images, and annotations.
Q: Why yet another data format? A: MS-COCO did not have support for video and multimodal imagery. These are important problems in computer vision and it seems reasonable (although challenging) that there could be a data format that could be used as an interchange for almost all vision problems.
Q: Why extend MS-COCO and not create something else? A: To draw on the existing adoption of the MS-COCO format.
Q: What’s so great about MS-COCO? A: It has an intuitive data structure that’s simple to interface with.
Q: Why not pycocotools?
A: That module doesn’t allow you to edit the dataset programmatically, and requires C backend.
This module allows dynamic modification addition and removal of images /
categories / annotations / videos, in addition to other places where it goes
beyond the functionality of the pycocotools module. We have a much more
configurable / expressive way of computing and recording object detection
metrics. If we are using an mscoco-compliant database (which can be verified
/ coerced from the kwcoco conform
CLI tool), then we do call
pycocotools for functionality not directly implemented here.
Q: Would you ever extend kwcoco to go beyond computer vision? A: Maybe, it would be something new though, and only use kwcoco as an inspiration. If extending past computer vision I would want to go back and rename / reorganize the spec.
Examples¶
These python files have a few example uses cases of kwcoco
Design Goals¶
Always be a strict superset of the original MS-COCO format
Extend the scope of MS-COCO to broader computer-vision domains.
Have a fast pure-Python API to perform lower level tasks. (Allow optional C backends for features that need speed boosts)
Have an easy-to-use command line interface to perform higher level tasks.
Use cases¶
KWCoco has been designed to work with these tasks in these image modalities.
Tasks¶
Captioning
Classification
Segmentation
Keypoint Detection / Pose Estimation
Object Detection
Modalities¶
Single Image
Video
Multispectral Imagery
Images with auxiliary information (2.5d, flow, disparity, stereo)
Combinations of the above.
Pseudo Spec¶
The following describes psuedo-code for the high level spec (some of which may
not be have full support in the Python API). A formal json-schema is defined in
:module:kwcoco.coco_schema.py
.
# All object categories are defined here.
category = {
'id': int,
'name': str, # unique name of the category
'supercategory': str, # parent category name
}
# Videos are used to manage collections of sequences of images.
video = {
'id': int,
'name': str, # a unique name for this video.
}
# Specifies how to find sensor data of a particular scene at a particular
# time. This is usually paths to rgb images, but auxiliary information
# can be used to specify multiple bands / etc...
image = {
'id': int,
'name': str, # an encouraged but optional unique name
'file_name': str, # relative path to the "base" image data
'width': int, # pixel width of "base" image
'height': int, # pixel height of "base" image
'channels': <ChannelSpec>, # a string encoding of the channels in the main image
'auxiliary': [ # information about any auxiliary channels / bands
{
'file_name': str, # relative path to associated file
'channels': <ChannelSpec>, # a string encoding
'width': <int> # pixel width of auxiliary image
'height': <int> # pixel height of auxiliary image
'warp_aux_to_img': <TransformSpec>, # tranform from "base" image space to auxiliary image space. (identity if unspecified)
}, ...
]
'video_id': str # if this image is a frame in a video sequence, this id is shared by all frames in that sequence.
'timestamp': str | int # a iso-string timestamp or an integer in flicks.
'frame_index': int # ordinal frame index which can be used if timestamp is unknown.
'warp_img_to_vid': <TransformSpec> # a transform image space to video space (identity if unspecified), can be used for sensor alignment or video stabilization
}
TransformSpec:
Currently there is only one spec that works with anything:
{'type': 'affine': 'matrix': <a-3x3 matrix>},
In the future we may do something like this:
{'type': 'scale', 'factor': <float|Tuple[float, float]>},
{'type': 'translate', 'offset': <float|Tuple[float, float]>},
{'type': 'rotate', 'radians_ccw': <float>},
ChannelSpec:
This is a string that describes the channel composition of an image.
For the purposes of kwcoco, separate different channel names with a
pipe ('|'). If the spec is not specified, methods may fall back on
grayscale or rgb processing. There are special string. For instance
'rgb' will expand into 'r|g|b'. In other applications you can "late
fuse" inputs by separating them with a "," and "early fuse" by
separating with a "|". Early fusion returns a solid array/tensor, late
fusion returns separated arrays/tensors.
# Ground truth is specified as annotations, each belongs to a spatial
# region in an image. This must reference a subregion of the image in pixel
# coordinates. Additional non-schma properties can be specified to track
# location in other coordinate systems. Annotations can be linked over time
# by specifying track-ids.
annotation = {
'id': int,
'image_id': int,
'category_id': int,
'track_id': <int | str | uuid> # indicates association between annotations across frames
'bbox': [tl_x, tl_y, w, h], # xywh format)
'score' : float,
'prob' : List[float],
'weight' : float,
'caption': str, # a text caption for this annotation
'keypoints' : <Keypoints | List[int] > # an accepted keypoint format
'segmentation': <RunLengthEncoding | Polygon | MaskPath | WKT >, # an accepted segmentation format
}
# A dataset bundles a manifest of all aformentioned data into one structure.
dataset = {
'categories': [category, ...],
'videos': [video, ...]
'images': [image, ...]
'annotations': [annotation, ...]
'licenses': [],
'info': [],
}
The Python API¶
Creating a dataset¶
The Python API can be used to load an existing dataset or initialize an empty dataset. In both cases the dataset can be modified by adding/removing/editing categories, videos, images, and annotations.
You can load an existing dataset as such:
import kwcoco
dset = kwcoco.CocoDataset('path/to/data.kwcoco.json')
You can initialize an empty dataset as such:
import kwcoco
dset = kwcoco.CocoDataset()
In both cases you can add and remove data items. When you add an item, it returns the internal integer primary id used to refer to that item.
cid = dset.add_category(name='cat')
gid = dset.add_image(file_name='/path/to/limecat.jpg')
aid = dset.add_annotation(image_id=gid, category_id=cid, bbox=[0, 0, 100, 100])
The CocoDataset
class has an instance variable dset.dataset
which is the
loaded JSON data structure. This dataset can be interacted with directly.
# Loop over all categories, images, and annotations
for img in dset.dataset['categories']:
print(img)
for img in dset.dataset['images']:
print(img)
for img in dset.dataset['annotations']:
print(img)
This the above example, this will result in:
OrderedDict([('id', 1), ('name', 'cat')])
OrderedDict([('id', 1), ('file_name', '/path/to/limecat.jpg')])
OrderedDict([('id', 1), ('image_id', 1), ('category_id', 1), ('bbox', [0, 0, 100, 100])])
In the above example, you can display the underlying dataset
structure as
such
print(dset.dumps(indent=' ', newlines=True))
This results in
{
"info": [],
"licenses": [],
"categories": [
{"id": 1, "name": "cat"}
],
"videos": [],
"images": [
{"id": 1, "file_name": "/path/to/limecat.jpg"}
],
"annotations": [
{"id": 1, "image_id": 1, "category_id": 1, "bbox": [0, 0, 100, 100]}
]
}
In addition to accessing dset.dataset
directly, the CocoDataset
object
maintains an index
which allows the user to quickly lookup objects by primary
or secondary keys. A list of available indexes are:
dset.index.anns # a mapping from annotation-ids to annotation dictionaries
dset.index.imgs # a mapping from image-ids to image dictionaries
dset.index.videos # a mapping from video-ids to video dictionaries
dset.index.cats # a mapping from category-ids to category dictionaries
dset.index.gid_to_aids # a mapping from an image id to annotation ids contained in the image
dset.index.cid_to_aids # a mapping from an annotation id to annotation ids with that category
dset.index.vidid_to_gids # a mapping from an video id to image ids contained in the video
dset.index.name_to_video # a mapping from a video name to the video dictionary
dset.index.name_to_cat # a mapping from a category name to the category dictionary
dset.index.name_to_img # a mapping from an image name to the image dictionary
dset.index.file_name_to_img # a mapping from an image file name to the image dictionary
These indexes are dynamically updated when items are added or removed.
Using kwcoco to write a torch dataset¶
The easiest way to write a torch dataset with kwcoco is to combine it with
ndsampler <https://gitlab.kitware.com/computer-vision/ndsampler>
Examples of kwcoco + ndsampler being to write torch datasets to train deep
networks can be found in netharn's <https://gitlab.kitware.com/computer-vision/netharn>
_ examples for:
detection <https://gitlab.kitware.com/computer-vision/netharn/-/blob/master/netharn/examples/object_detection.py>
_
classification <https://gitlab.kitware.com/computer-vision/netharn/-/blob/master/netharn/examples/classification.py>
_
segmentation <https://gitlab.kitware.com/computer-vision/netharn/-/blob/master/netharn/examples/segmentation.py>
_
Technical Dept¶
Based on design decisions made in the original MS-COCO and KW-COCO, there are a few weird things
The “bbox” field gives no indication it should be xywh format.
We can’t use “vid” as a variable name for “video-id” because “vid” is also an abbreviation for “video”. Hence, while category, image, and annotation all have a nice 1-letter prefix to their id in the standard variable names I use (i.e. cid, gid, aid). I have to use vidid to refer to “video-ids”.
I’m not in love with the way “keypoint_categories” are handled.
Are “images” always “images”? Are “videos” always “videos”?
Would we benefit from using JSON-LD?
The “prob” field needs to be better defined
The name “video” might be confusing. Its just a temporally ordered group of images.
Code Examples.¶
See the README and the doctests.