kwcoco.coco_sql_dataset

  • [ ] We get better speeds with raw SQL over alchemy. Can we mitigate the

    speed difference so we can take advantage of alchemy’s expressiveness?

Finally got a baseline implementation of an SQLite backend for COCO datasets. This mostly plugs into my existing tools (as long as only read operations are used; haven’t impelmented writing yet) by duck-typing the dict API.

This solves the issue of forking and then accessing nested dictionaries in the JSON-style COCO objects. (When you access the dictionary Python will increment a reference count which triggers copy-on-write for whatever memory page that data happened to live in. Non-contiguous access had the effect of excessive memory copies).

For “medium sized” datasets its quite a bit slower. Running through a torch DataLoader with 4 workers for 10,000 images executes at a rate of 100Hz but takes 850MB of RAM. Using the duck-typed SQL backend only uses 500MB (which includes the cost of caching), but runs at 45Hz (which includes the benefit of caching).

However, once I scale up to 100,000 images I start seeing benefits. The in-memory dictionary interface chugs at 1.05HZ, and is taking more than 4GB of memory at the time I killed the process (eta was over an hour). The SQL backend ran at 45Hz and took about 3 minutes and used about 2.45GB of memory.

Without a cache, SQL runs at 30HZ and takes 400MB for 10,000 images, and for 100,000 images it gets 30Hz with 1.1GB. There is also a much larger startup time. I’m not exactly sure what it is yet, but its probably some preprocessing I’m doing.

Using a LRU cache we get 45Hz and 1.05GB of memory, so that’s a clear win. We do need to be sure to disable the cache if we ever implement write mode.

I’d like to be a bit faster on the medium sized datasets (I’d really like to avoid caching rows, which is why the speed is currently semi-reasonable), but I don’t think I can do any better than this because single-row lookup time is O(log(N)) for sqlite, whereas its O(1) for dictionaries. (I wish sqlite had an option to create a hash-table index for a table, but I dont think it does). I optimized as many of the dictionary operations as possible (for instance, iterating through keys, values, and items should be O(N) instead of O(N log(N))), but the majority of the runtime cost is in the single-row lookup time.

There are a few questions I still have if anyone has insight:

  • Say I want to select a subset of K rows from a table with N entries, and I have a list of all of the rowids that I want. Is there any way to do this better than O(K log(N))? I tried using a SELECT col FROM table WHERE id IN (?, ?, ?, ?, ...) filling in enough ? as there are rows in my subset. I’m not sure what the complexity of using a query like this is. I’m not sure what the IN implementation looks like. Can this be done more efficiently by with a temporary table and a JOIN?

  • There really is no way to do O(1) row lookup in sqlite right? Is there a way in PostgreSQL or some other backend sqlalchemy supports?

I found that PostgreSQL does support hash indexes: https://www.postgresql.org/docs/13/indexes-types.html I’m really not interested in setting up a global service though 😞. I also found a 10-year old thread with a hash-index feature request for SQLite, which I unabashedly resurrected http://sqlite.1065341.n5.nabble.com/Feature-request-hash-index-td23367.html

Module Contents

Classes

Category

KeypointCategory

Video

Image

Annotation

SqlListProxy

A view of an SQL table that behaves like a Python list

SqlDictProxy

Duck-types an SQL table as a dictionary of dictionaries.

SqlIdGroupDictProxy

Similar to SqlDictProxy, but maps ids to groups of other ids.

CocoSqlIndex

Simulates the dictionary provided by kwcoco.coco_dataset.CocoIndex

CocoSqlDatabase

Provides an API nearly identical to kwcoco.CocoDatabase, but uses

Functions

orm_to_dict(obj)

_orm_yielder(query, size=300)

TODO: figure out the best way to yield, in batches or otherwise

_raw_yielder(result, size=300)

TODO: figure out the best way to yield, in batches or otherwise

_new_proxy_cache()

By returning None, we wont use item caching

_handle_sql_uri(uri)

Temporary function to deal with URI. Modern tools seem to use RFC 3968

cached_sql_coco_view(dct_db_fpath=None, sql_db_fpath=None, dset=None, force_rewrite=False)

Attempts to load a cached SQL-View dataset, only loading and converting the

ensure_sql_coco_view(dset, db_fpath=None, force_rewrite=False)

Create a cached on-disk SQL view of an on-disk COCO dataset.

demo(num=10)

assert_dsets_allclose(dset1, dset2, tag1='dset1', tag2='dset2')

_benchmark_dset_readtime(dset, tag='?')

Helper for understanding the time differences between backends

_benchmark_dict_proxy_ops(proxy)

Get insight on the efficiency of operations

devcheck()

Scratch work for things that should eventually become unit or doc tests

Attributes

CocoBase

ALCHEMY_MODE_DEFAULT

TBLNAME_TO_CLASS

tblname

kwcoco.coco_sql_dataset.CocoBase[source]
class kwcoco.coco_sql_dataset.Category[source]

Bases: CocoBase

__tablename__ = categories[source]
id[source]
name[source]
alias[source]
supercategory[source]
extra[source]
class kwcoco.coco_sql_dataset.KeypointCategory[source]

Bases: CocoBase

__tablename__ = keypoint_categories[source]
id[source]
name[source]
alias[source]
supercategory[source]
reflection_id[source]
extra[source]
class kwcoco.coco_sql_dataset.Video[source]

Bases: CocoBase

__tablename__ = videos[source]
id[source]
name[source]
caption[source]
width[source]
height[source]
extra[source]
class kwcoco.coco_sql_dataset.Image[source]

Bases: CocoBase

__tablename__ = images[source]
id[source]
name[source]
file_name[source]
width[source]
height[source]
video_id[source]
timestamp[source]
frame_index[source]
channels[source]
auxiliary[source]
extra[source]
class kwcoco.coco_sql_dataset.Annotation[source]

Bases: CocoBase

__tablename__ = annotations[source]
id[source]
image_id[source]
category_id[source]
track_id[source]
segmentation[source]
keypoints[source]
bbox[source]
_bbox_x[source]
_bbox_y[source]
_bbox_w[source]
_bbox_h[source]
score[source]
weight[source]
prob[source]
iscrowd[source]
caption[source]
extra[source]
kwcoco.coco_sql_dataset.ALCHEMY_MODE_DEFAULT = 0[source]
kwcoco.coco_sql_dataset.TBLNAME_TO_CLASS[source]
kwcoco.coco_sql_dataset.tblname[source]
kwcoco.coco_sql_dataset.orm_to_dict(obj)[source]
kwcoco.coco_sql_dataset._orm_yielder(query, size=300)[source]

TODO: figure out the best way to yield, in batches or otherwise

kwcoco.coco_sql_dataset._raw_yielder(result, size=300)[source]

TODO: figure out the best way to yield, in batches or otherwise

kwcoco.coco_sql_dataset._new_proxy_cache()[source]

By returning None, we wont use item caching

class kwcoco.coco_sql_dataset.SqlListProxy(proxy, session, cls)[source]

Bases: ubelt.NiceRepr

A view of an SQL table that behaves like a Python list

__len__(proxy)[source]
__nice__(proxy)[source]
__iter__(proxy)[source]
__getitem__(proxy, index)[source]
__contains__(proxy, item)[source]
__setitem__(proxy, index, value)[source]
__delitem__(proxy, index)[source]
class kwcoco.coco_sql_dataset.SqlDictProxy(proxy, session, cls, keyattr=None, ignore_null=False)[source]

Bases: kwcoco.util.dict_like.DictLike

Duck-types an SQL table as a dictionary of dictionaries.

The key is specified by an indexed column (by default it is the id column). The values are dictionaries containing all data for that row.

Note

With SQLite indexes are B-Trees so lookup is O(log(N)) and not O(1) as will regular dictionaries. Iteration should still be O(N), but databases have much more overhead than Python dictionaries.

Parameters
  • session (Session) – the sqlalchemy session

  • cls (Type) – the declarative sqlalchemy table class

  • keyattr – the indexed column to use as the keys

  • ignore_null (bool) – if True, ignores any keys set to NULL, otherwise NULL keys are allowed.

Example

>>> # xdoctest: +REQUIRES(module:sqlalchemy)
>>> from kwcoco.coco_sql_dataset import *  # NOQA
>>> import pytest
>>> sql_dset, dct_dset = demo(num=10)
>>> proxy = sql_dset.index.anns
>>> keys = list(proxy.keys())
>>> values = list(proxy.values())
>>> items = list(proxy.items())
>>> item_keys = [t[0] for t in items]
>>> item_vals = [t[1] for t in items]
>>> lut_vals = [proxy[key] for key in keys]
>>> assert item_vals == lut_vals == values
>>> assert item_keys == keys
>>> assert len(proxy) == len(keys)
>>> goodkey1 = keys[1]
>>> badkey1 = -100000000000
>>> badkey2 = 'foobarbazbiz'
>>> badkey3 = object()
>>> assert goodkey1 in proxy
>>> assert badkey1 not in proxy
>>> assert badkey2 not in proxy
>>> assert badkey3 not in proxy
>>> with pytest.raises(KeyError):
>>>     proxy[badkey1]
>>> with pytest.raises(KeyError):
>>>     proxy[badkey2]
>>> with pytest.raises(KeyError):
>>>     proxy[badkey3]
>>> # xdoctest: +SKIP
>>> from kwcoco.coco_sql_dataset import _benchmark_dict_proxy_ops
>>> ti = _benchmark_dict_proxy_ops(proxy)
>>> print('ti.measures = {}'.format(ub.repr2(ti.measures, nl=2, align=':', precision=6)))

Example

>>> # xdoctest: +REQUIRES(module:sqlalchemy)
>>> from kwcoco.coco_sql_dataset import *  # NOQA
>>> import kwcoco
>>> # Test the variant of the SqlDictProxy where we ignore None keys
>>> # This is the case for name_to_img and file_name_to_img
>>> dct_dset = kwcoco.CocoDataset.demo('shapes1')
>>> dct_dset.add_image(name='no_file_image1')
>>> dct_dset.add_image(name='no_file_image2')
>>> dct_dset.add_image(name='no_file_image3')
>>> sql_dset = dct_dset.view_sql(memory=True)
>>> assert len(dct_dset.index.imgs) == 4
>>> assert len(dct_dset.index.file_name_to_img) == 1
>>> assert len(dct_dset.index.name_to_img) == 3
>>> assert len(sql_dset.index.imgs) == 4
>>> assert len(sql_dset.index.file_name_to_img) == 1
>>> assert len(sql_dset.index.name_to_img) == 3
>>> proxy = sql_dset.index.file_name_to_img
>>> assert len(list(proxy.keys())) == 1
>>> assert len(list(proxy.values())) == 1
>>> proxy = sql_dset.index.name_to_img
>>> assert len(list(proxy.keys())) == 3
>>> assert len(list(proxy.values())) == 3
>>> proxy = sql_dset.index.imgs
>>> assert len(list(proxy.keys())) == 4
>>> assert len(list(proxy.values())) == 4
__len__(proxy)[source]
__nice__(proxy)[source]
__contains__(proxy, key)[source]
__getitem__(proxy, key)[source]
keys(proxy)[source]
values(proxy)[source]
items(proxy)[source]
class kwcoco.coco_sql_dataset.SqlIdGroupDictProxy(proxy, session, valattr, keyattr, parent_keyattr, group_order_attr=None)[source]

Bases: kwcoco.util.dict_like.DictLike

Similar to SqlDictProxy, but maps ids to groups of other ids.

Simulates a dictionary that maps ids of a parent table to all ids of another table corresponding to rows where a specific column has that parent id.

The items in the group can be sorted by the group_order_attr if specified.

For example, imagine two tables: images with one column (id) and annotations with two columns (id, image_id). This class can help provide a mpaping from each image.id to a Set[annotation.id] where those annotation rows have annotation.image_id = image.id.

Example

>>> # xdoctest: +REQUIRES(module:sqlalchemy)
>>> from kwcoco.coco_sql_dataset import *  # NOQA
>>> sql_dset, dct_dset = demo(num=10)
>>> proxy = sql_dset.index.gid_to_aids
>>> keys = list(proxy.keys())
>>> values = list(proxy.values())
>>> items = list(proxy.items())
>>> item_keys = [t[0] for t in items]
>>> item_vals = [t[1] for t in items]
>>> lut_vals = [proxy[key] for key in keys]
>>> assert item_vals == lut_vals == values
>>> assert item_keys == keys
>>> assert len(proxy) == len(keys)
>>> # xdoctest: +SKIP
>>> from kwcoco.coco_sql_dataset import _benchmark_dict_proxy_ops
>>> ti = _benchmark_dict_proxy_ops(proxy)
>>> print('ti.measures = {}'.format(ub.repr2(ti.measures, nl=2, align=':', precision=6)))

Example

>>> # xdoctest: +REQUIRES(module:sqlalchemy)
>>> from kwcoco.coco_sql_dataset import *  # NOQA
>>> import kwcoco
>>> # Test the group sorted variant of this by using vidid_to_gids
>>> # where the "gids" must be sorted by the image frame indexes
>>> dct_dset = kwcoco.CocoDataset.demo('vidshapes1')
>>> dct_dset.add_image(name='frame-index-order-demo1', frame_index=-30, video_id=1)
>>> dct_dset.add_image(name='frame-index-order-demo2', frame_index=10, video_id=1)
>>> dct_dset.add_image(name='frame-index-order-demo3', frame_index=3, video_id=1)
>>> dct_dset.add_video(name='empty-video1')
>>> dct_dset.add_video(name='empty-video2')
>>> dct_dset.add_video(name='empty-video3')
>>> sql_dset = dct_dset.view_sql(memory=True)
>>> orig = dct_dset.index.vidid_to_gids
>>> proxy = sql_dset.index.vidid_to_gids
>>> from kwcoco.util.util_json import indexable_allclose
>>> assert indexable_allclose(orig, dict(proxy))
>>> items = list(proxy.items())
>>> vals = list(proxy.values())
>>> keys = list(proxy.keys())
>>> assert len(keys) == len(vals)
>>> assert dict(zip(keys, vals)) == dict(items)
__nice__(self)[source]
__len__(proxy)[source]
__getitem__(proxy, key)[source]
__contains__(proxy, key)[source]
keys(proxy)[source]
items(proxy)[source]
values(proxy)[source]
class kwcoco.coco_sql_dataset.CocoSqlIndex(index)[source]

Bases: object

Simulates the dictionary provided by kwcoco.coco_dataset.CocoIndex

build(index, parent)[source]
kwcoco.coco_sql_dataset._handle_sql_uri(uri)[source]

Temporary function to deal with URI. Modern tools seem to use RFC 3968 URIs, but sqlalchemy uses RFC 1738. Attempt to gracefully handle special cases. With a better understanding of the above specs, this function may be able to be written more eloquently.

class kwcoco.coco_sql_dataset.CocoSqlDatabase(uri=None, tag=None, img_root=None)[source]

Bases: kwcoco.abstract_coco_dataset.AbstractCocoDataset, kwcoco.coco_dataset.MixinCocoAccessors, kwcoco.coco_dataset.MixinCocoObjects, kwcoco.coco_dataset.MixinCocoStats, kwcoco.coco_dataset.MixinCocoDraw, ubelt.NiceRepr

Provides an API nearly identical to kwcoco.CocoDatabase, but uses an SQL backend data store. This makes it robust to copy-on-write memory issues that arise when forking, as discussed in 1.

Note

By default constructing an instance of the CocoSqlDatabase does not create a connection to the databse. Use the connect() method to open a connection.

References

1

https://github.com/pytorch/pytorch/issues/13246

Example

>>> # xdoctest: +REQUIRES(module:sqlalchemy)
>>> from kwcoco.coco_sql_dataset import *  # NOQA
>>> sql_dset, dct_dset = demo()
>>> dset1, dset2 = sql_dset, dct_dset
>>> tag1, tag2 = 'dset1', 'dset2'
>>> assert_dsets_allclose(sql_dset, dct_dset)
MEMORY_URI = sqlite:///:memory:[source]
__nice__(self)[source]
classmethod coerce(self, data)[source]

Create an SQL CocoDataset from the input pointer.

Example

import kwcoco dset = kwcoco.CocoDataset.demo(‘shapes8’) data = dset.fpath self = CocoSqlDatabase.coerce(data)

from kwcoco.coco_sql_dataset import CocoSqlDatabase import kwcoco dset = kwcoco.CocoDataset.coerce(‘spacenet7.kwcoco.json’)

self = CocoSqlDatabase.coerce(dset)

from kwcoco.coco_sql_dataset import CocoSqlDatabase sql_dset = CocoSqlDatabase.coerce(‘spacenet7.kwcoco.json’)

# from kwcoco.coco_sql_dataset import CocoSqlDatabase import kwcoco sql_dset = kwcoco.CocoDataset.coerce(‘_spacenet7.kwcoco.view.v006.sqlite’)

__getstate__(self)[source]

Return only the minimal info when pickling this object.

Note

This object IS pickling when the multiprocessing context is “spawn”.

This object is NOT pickled when the multiprocessing context is “fork”. In this case the user needs to be careful to create new connections in the forked subprocesses.

Example

>>> # xdoctest: +REQUIRES(module:sqlalchemy)
>>> from kwcoco.coco_sql_dataset import *  # NOQA
>>> sql_dset, dct_dset = demo()
>>> # Test pickling works correctly
>>> import pickle
>>> serialized = pickle.dumps(sql_dset)
>>> assert len(serialized) < 3e4, 'should be very small'
>>> copy = pickle.loads(serialized)
>>> dset1, dset2, tag1, tag2 = sql_dset, copy, 'orig', 'copy'
>>> assert_dsets_allclose(dset1, dset2, tag1, tag2)
>>> # --- other methods of copying ---
>>> rw_copy = CocoSqlDatabase(
>>>     sql_dset.uri, img_root=sql_dset.img_root, tag=sql_dset.tag)
>>> rw_copy.connect()
>>> ro_copy = CocoSqlDatabase(
>>>     sql_dset.uri, img_root=sql_dset.img_root, tag=sql_dset.tag)
>>> ro_copy.connect(readonly=True)
>>> assert_dsets_allclose(dset1, ro_copy, tag1, 'ro-copy')
>>> assert_dsets_allclose(dset1, rw_copy, tag1, 'rw-copy')
__setstate__(self, state)[source]

Reopen new readonly connnections when unpickling the object.

disconnect(self)[source]

Drop references to any SQL or cache objects

connect(self, readonly=False)[source]

Connects this instance to the underlying database.

References

# details on read only mode, some of these didnt seem to work https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/dialects/sqlite/pysqlite.py#L71 https://github.com/pudo/dataset/issues/136 https://writeonly.wordpress.com/2009/07/16/simple-read-only-sqlalchemy-sessions/

property fpath(self)[source]
delete(self)[source]
populate_from(self, dset, verbose=1)[source]

Copy the information in a CocoDataset into this SQL database.

Example

>>> # xdoctest: +REQUIRES(module:sqlalchemy)
>>> from kwcoco.coco_sql_dataset import _benchmark_dset_readtime  # NOQA
>>> import kwcoco
>>> from kwcoco.coco_sql_dataset import *
>>> dset2 = dset = kwcoco.CocoDataset.demo()
>>> dset1 = self = CocoSqlDatabase('sqlite:///:memory:')
>>> self.connect()
>>> self.populate_from(dset)
>>> assert_dsets_allclose(dset1, dset2, tag1='sql', tag2='dct')
>>> ti_sql = _benchmark_dset_readtime(dset1, 'sql')
>>> ti_dct = _benchmark_dset_readtime(dset2, 'dct')
>>> print('ti_sql.rankings = {}'.format(ub.repr2(ti_sql.rankings, nl=2, precision=6, align=':')))
>>> print('ti_dct.rankings = {}'.format(ub.repr2(ti_dct.rankings, nl=2, precision=6, align=':')))
property dataset(self)[source]
property anns(self)[source]
property cats(self)[source]
property imgs(self)[source]
property name_to_cat(self)[source]
raw_table(self, table_name)[source]

Loads an entire SQL table as a pandas DataFrame

Parameters

table_name (str) – name of the table

Returns

DataFrame

Example

>>> # xdoctest: +REQUIRES(module:sqlalchemy)
>>> from kwcoco.coco_sql_dataset import *  # NOQA
>>> self, dset = demo()
>>> table_df = self.raw_table('annotations')
>>> print(table_df)
_column_lookup(self, tablename, key, rowids, default=ub.NoParam, keepid=False)[source]

Convinience method to lookup only a single column of information

Example

>>> # xdoctest: +REQUIRES(module:sqlalchemy)
>>> from kwcoco.coco_sql_dataset import *  # NOQA
>>> self, dset = demo(10)
>>> tablename = 'annotations'
>>> key = 'category_id'
>>> rowids = list(self.anns.keys())[::3]
>>> cids1 = self._column_lookup(tablename, key, rowids)
>>> cids2 = self.annots(rowids).get(key)
>>> cids3 = dset.annots(rowids).get(key)
>>> assert cids3 == cids2 == cids1
_all_rows_column_lookup(self, tablename, keys)[source]

Convinience method to look up all rows from a table and only a few columns.

Example

>>> # xdoctest: +REQUIRES(module:sqlalchemy)
>>> from kwcoco.coco_sql_dataset import *  # NOQA
>>> self, dset = demo(10)
>>> tablename = 'annotations'
>>> keys = ['id', 'category_id']
>>> rows = self._all_rows_column_lookup(tablename, keys)
tabular_targets(self)[source]

Convinience method to create an in-memory summary of basic annotation properties with minimal SQL overhead.

Example

>>> # xdoctest: +REQUIRES(module:sqlalchemy)
>>> from kwcoco.coco_sql_dataset import *  # NOQA
>>> self, dset = demo()
>>> targets = self.tabular_targets()
>>> print(targets.pandas())
property bundle_dpath(self)[source]
property data_fpath(self)[source]

data_fpath is an alias of fpath

kwcoco.coco_sql_dataset.cached_sql_coco_view(dct_db_fpath=None, sql_db_fpath=None, dset=None, force_rewrite=False)[source]

Attempts to load a cached SQL-View dataset, only loading and converting the json dataset if necessary.

kwcoco.coco_sql_dataset.ensure_sql_coco_view(dset, db_fpath=None, force_rewrite=False)[source]

Create a cached on-disk SQL view of an on-disk COCO dataset.

Note

This function is fragile. It depends on looking at file modified timestamps to determine if it needs to write the dataset.

kwcoco.coco_sql_dataset.demo(num=10)[source]
kwcoco.coco_sql_dataset.assert_dsets_allclose(dset1, dset2, tag1='dset1', tag2='dset2')[source]
kwcoco.coco_sql_dataset._benchmark_dset_readtime(dset, tag='?')[source]

Helper for understanding the time differences between backends

kwcoco.coco_sql_dataset._benchmark_dict_proxy_ops(proxy)[source]

Get insight on the efficiency of operations

kwcoco.coco_sql_dataset.devcheck()[source]

Scratch work for things that should eventually become unit or doc tests

from kwcoco.coco_sql_dataset import * # NOQA self, dset = demo()