# Getting Started With KW-COCO

This document is a work in progress, and does need to be updated and
refactored.


## FAQ

Q: What is `kwcoco`?
A: An extension of the MS-COCO data format for storing a "manifest" of
   categories, images, and annotations.

Q: Why yet another data format? 
A: MS-COCO did not have support for video and multimodal imagery. These are
   important problems in computer vision and it seems reasonable (although
   challenging) that there could be a data format that could be used as an
   interchange for almost all vision problems.

Q: Why extend MS-COCO and not create something else?
A: To draw on the existing adoption of the MS-COCO format.

Q: What's so great about MS-COCO?
A: It has an intuitive data structure that's simple to interface with.

Q: Why not pycocotools?
A: That module doesn't allow you to edit the dataset programmatically, and requires C backend. 
   This module allows dynamic modification addition and removal of images /
   categories / annotations / videos, in addition to other places where it goes
   beyond the functionality of the pycocotools module. We have a much more
   configurable / expressive way of computing and recording object detection
   metrics. If we are using an mscoco-compliant database (which can be verified
   / coerced from the `kwcoco conform` CLI tool), then we do call 
   pycocotools for functionality not directly implemented here.

Q: Would you ever extend kwcoco to go beyond computer vision?
A: Maybe, it would be something new though, and only use kwcoco as an
   inspiration. If extending past computer vision I would want to go back and
   rename / reorganize the spec.

## Examples

These python files have a few example uses cases of kwcoco

* [draw_gt_and_predicted_boxes](https://github.com/Kitware/kwcoco/blob/master/kwcoco/examples/draw_gt_and_predicted_boxes.py)
* [modification_example](https://github.com/Kitware/kwcoco/blob/master/kwcoco/examples/modification_example.py)
* [simple_kwcoco_torch_dataset](https://github.com/Kitware/kwcoco/blob/master/kwcoco/examples/simple_kwcoco_torch_dataset.py)
* [getting_started_existing_dataset](https://github.com/Kitware/kwcoco/blob/master/kwcoco/examples/getting_started_existing_dataset.py)


## Design Goals

* Always be a strict superset of the original MS-COCO format 

* Extend the scope of MS-COCO to broader computer-vision domains.

* Have a fast pure-Python API to perform lower level tasks. (Allow optional C
  backends for features that need speed boosts)

* Have an easy-to-use command line interface to perform higher level tasks.

## Use cases

KWCoco has been designed to work with these tasks in these image modalities.


### Tasks

* Captioning

* Classification

* Segmentation

* Keypoint Detection / Pose Estimation

* Object Detection


### Modalities

* Single Image

* Video 

* Multispectral Imagery

* Images with auxiliary information (2.5d, flow, disparity, stereo) 

* Combinations of the above.


## Pseudo Spec

The following describes psuedo-code for the high level spec (some of which may
not be have full support in the Python API). A formal json-schema is defined in
:module:`kwcoco.coco_schema.py`.

```
    # All object categories are defined here.
    category = {
        'id': int,
        'name': str,  # unique name of the category
        'supercategory': str,   # parent category name
    }

    # Videos are used to manage collections of sequences of images.
    video = {
        'id': int,
        'name': str,  # a unique name for this video.
    }

    # Specifies how to find sensor data of a particular scene at a particular
    # time. This is usually paths to rgb images, but auxiliary information
    # can be used to specify multiple bands / etc...
    image = {
        'id': int,

        'name': str,  # an encouraged but optional unique name
        'file_name': str,  # relative path to the "base" image data

        'width': int,   # pixel width of "base" image
        'height': int,  # pixel height of "base" image

        'channels': <ChannelSpec>,   # a string encoding of the channels in the main image

        'auxiliary': [  # information about any auxiliary channels / bands
            {
                'file_name': str,     # relative path to associated file
                'channels': <ChannelSpec>,   # a string encoding
                'width':     <int>    # pixel width of auxiliary image
                'height':    <int>    # pixel height of auxiliary image
                'warp_aux_to_img': <TransformSpec>,  # tranform from "base" image space to auxiliary image space. (identity if unspecified)
            }, ...
        ]

        'video_id': str  # if this image is a frame in a video sequence, this id is shared by all frames in that sequence.
        'timestamp': str | int  # a iso-string timestamp or an integer in flicks.
        'frame_index': int  # ordinal frame index which can be used if timestamp is unknown.
        'warp_img_to_vid': <TransformSpec>  # a transform image space to video space (identity if unspecified), can be used for sensor alignment or video stabilization
    }

    TransformSpec:
        Currently there is only one spec that works with anything:
            {'type': 'affine': 'matrix': <a-3x3 matrix>},

        In the future we may do something like this:
            {'type': 'scale', 'factor': <float|Tuple[float, float]>},
            {'type': 'translate', 'offset': <float|Tuple[float, float]>},
            {'type': 'rotate', 'radians_ccw': <float>},

    ChannelSpec:
        This is a string that describes the channel composition of an image.
        For the purposes of kwcoco, separate different channel names with a
        pipe ('|'). If the spec is not specified, methods may fall back on
        grayscale or rgb processing. There are special string. For instance
        'rgb' will expand into 'r|g|b'. In other applications you can "late
        fuse" inputs by separating them with a "," and "early fuse" by
        separating with a "|". Early fusion returns a solid array/tensor, late
        fusion returns separated arrays/tensors.

    # Ground truth is specified as annotations, each belongs to a spatial
    # region in an image. This must reference a subregion of the image in pixel
    # coordinates. Additional non-schma properties can be specified to track
    # location in other coordinate systems. Annotations can be linked over time
    # by specifying track-ids.
    annotation = {
        'id': int,
        'image_id': int,
        'category_id': int,

        'track_id': <int | str | uuid>  # indicates association between annotations across frames

        'bbox': [tl_x, tl_y, w, h],  # xywh format)
        'score' : float,
        'prob' : List[float],
        'weight' : float,

        'caption': str,  # a text caption for this annotation
        'keypoints' : <Keypoints | List[int] > # an accepted keypoint format
        'segmentation': <RunLengthEncoding | Polygon | MaskPath | WKT >,  # an accepted segmentation format
    }

    # A dataset bundles a manifest of all aformentioned data into one structure.
    dataset = {
        'categories': [category, ...],
        'videos': [video, ...]
        'images': [image, ...]
        'annotations': [annotation, ...]
        'licenses': [],
        'info': [],
    }
```


## The Python API

### Creating a dataset

The Python API can be used to load an existing dataset or initialize an empty
dataset. In both cases the dataset can be modified by adding/removing/editing
categories, videos, images, and annotations.


You can load an existing dataset as such:

```python
import kwcoco
dset = kwcoco.CocoDataset('path/to/data.kwcoco.json')
```

You can initialize an empty dataset as such:

```python
import kwcoco
dset = kwcoco.CocoDataset()
```

In both cases you can add and remove data items. When you add an item, it
returns the internal integer primary id used to refer to that item.

```python
cid = dset.add_category(name='cat')

gid = dset.add_image(file_name='/path/to/limecat.jpg')

aid = dset.add_annotation(image_id=gid, category_id=cid, bbox=[0, 0, 100, 100])
```


The `CocoDataset` class has an instance variable `dset.dataset` which is the
loaded JSON data structure. This dataset can be interacted with directly.

```python
# Loop over all categories, images, and annotations

for img in dset.dataset['categories']:
    print(img)

for img in dset.dataset['images']:
    print(img)

for img in dset.dataset['annotations']:
    print(img)
```

This the above example, this will result in:

```
OrderedDict([('id', 1), ('name', 'cat')])
OrderedDict([('id', 1), ('file_name', '/path/to/limecat.jpg')])
OrderedDict([('id', 1), ('image_id', 1), ('category_id', 1), ('bbox', [0, 0, 100, 100])])
```

In the above example, you can display the underlying `dataset` structure as
such
```python
print(dset.dumps(indent='    ', newlines=True))
```

This results in

```
{
"info": [],
"licenses": [],
"categories": [
    {"id": 1, "name": "cat"}
],
"videos": [],
"images": [
    {"id": 1, "file_name": "/path/to/limecat.jpg"}
],
"annotations": [
    {"id": 1, "image_id": 1, "category_id": 1, "bbox": [0, 0, 100, 100]}
]
}
```

In addition to accessing `dset.dataset` directly, the `CocoDataset` object
maintains an `index` which allows the user to quickly lookup objects by primary
or secondary keys. A list of available indexes are:

```python
dset.index.anns    # a mapping from annotation-ids to annotation dictionaries
dset.index.imgs    # a mapping from image-ids to image dictionaries
dset.index.videos  # a mapping from video-ids to video dictionaries
dset.index.cats    # a mapping from category-ids to category dictionaries

dset.index.gid_to_aids    # a mapping from an image id to annotation ids contained in the image
dset.index.cid_to_aids    # a mapping from an annotation id to annotation ids with that category
dset.index.vidid_to_gids  # a mapping from an video id to image ids contained in the video

dset.index.name_to_video  # a mapping from a video name to the video dictionary
dset.index.name_to_cat    # a mapping from a category name to the category dictionary
dset.index.name_to_img    # a mapping from an image name to the image dictionary
dset.index.file_name_to_img  # a mapping from an image file name to the image dictionary
```

These indexes are dynamically updated when items are added or removed.


### Using kwcoco to write a torch dataset


The easiest way to write a torch dataset with kwcoco is to combine it with
`ndsampler <https://gitlab.kitware.com/computer-vision/ndsampler>`

Examples of kwcoco + ndsampler being to write torch datasets to train deep
networks can be found in `netharn's
<https://gitlab.kitware.com/computer-vision/netharn>`_ examples for:

`detection <https://gitlab.kitware.com/computer-vision/netharn/-/blob/master/netharn/examples/object_detection.py>`_

`classification <https://gitlab.kitware.com/computer-vision/netharn/-/blob/master/netharn/examples/classification.py>`_

`segmentation <https://gitlab.kitware.com/computer-vision/netharn/-/blob/master/netharn/examples/segmentation.py>`_


## Technical Dept

Based on design decisions made in the original MS-COCO and KW-COCO, there are a
few weird things

* The "bbox" field gives no indication it should be xywh format.

* We can't use "vid" as a variable name for "video-id" because "vid" is also an
  abbreviation for "video". Hence, while category, image, and annotation all have
  a nice 1-letter prefix to their id in the standard variable names I use (i.e.
  cid, gid, aid). I have to use vidid to refer to "video-ids".

* I'm not in love with the way "keypoint_categories" are handled.

* Are "images" always "images"? Are "videos" always "videos"?

* Would we benefit from using JSON-LD?

* The "prob" field needs to be better defined 

* The name "video" might be confusing. Its just a temporally ordered group of images.


## Code Examples. 

See the README and the doctests.