:py:mod:`kwcoco.coco_sql_dataset` ================================= .. py:module:: kwcoco.coco_sql_dataset .. autoapi-nested-parse:: .. todo:: - [ ] We get better speeds with raw SQL over alchemy. Can we mitigate the speed difference so we can take advantage of alchemy's expressiveness? Finally got a baseline implementation of an SQLite backend for COCO datasets. This mostly plugs into my existing tools (as long as only read operations are used; haven't impelmented writing yet) by duck-typing the dict API. This solves the issue of forking and then accessing nested dictionaries in the JSON-style COCO objects. (When you access the dictionary Python will increment a reference count which triggers copy-on-write for whatever memory page that data happened to live in. Non-contiguous access had the effect of excessive memory copies). For "medium sized" datasets its quite a bit slower. Running through a torch DataLoader with 4 workers for 10,000 images executes at a rate of 100Hz but takes 850MB of RAM. Using the duck-typed SQL backend only uses 500MB (which includes the cost of caching), but runs at 45Hz (which includes the benefit of caching). However, once I scale up to 100,000 images I start seeing benefits. The in-memory dictionary interface chugs at 1.05HZ, and is taking more than 4GB of memory at the time I killed the process (eta was over an hour). The SQL backend ran at 45Hz and took about 3 minutes and used about 2.45GB of memory. Without a cache, SQL runs at 30HZ and takes 400MB for 10,000 images, and for 100,000 images it gets 30Hz with 1.1GB. There is also a much larger startup time. I'm not exactly sure what it is yet, but its probably some preprocessing I'm doing. Using a LRU cache we get 45Hz and 1.05GB of memory, so that's a clear win. We do need to be sure to disable the cache if we ever implement write mode. I'd like to be a bit faster on the medium sized datasets (I'd really like to avoid caching rows, which is why the speed is currently semi-reasonable), but I don't think I can do any better than this because single-row lookup time is `O(log(N))` for sqlite, whereas its `O(1)` for dictionaries. (I wish sqlite had an option to create a hash-table index for a table, but I dont think it does). I optimized as many of the dictionary operations as possible (for instance, iterating through keys, values, and items should be O(N) instead of `O(N log(N))`), but the majority of the runtime cost is in the single-row lookup time. There are a few questions I still have if anyone has insight: * Say I want to select a subset of K rows from a table with N entries, and I have a list of all of the rowids that I want. Is there any way to do this better than ``O(K log(N))``? I tried using a ``SELECT col FROM table WHERE id IN (?, ?, ?, ?, ...)`` filling in enough `?` as there are rows in my subset. I'm not sure what the complexity of using a query like this is. I'm not sure what the `IN` implementation looks like. Can this be done more efficiently by with a temporary table and a ``JOIN``? * There really is no way to do ``O(1)`` row lookup in sqlite right? Is there a way in PostgreSQL or some other backend sqlalchemy supports? I found that PostgreSQL does support hash indexes: https://www.postgresql.org/docs/13/indexes-types.html I'm really not interested in setting up a global service though 😞. I also found a 10-year old thread with a hash-index feature request for SQLite, which I unabashedly resurrected http://sqlite.1065341.n5.nabble.com/Feature-request-hash-index-td23367.html Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: kwcoco.coco_sql_dataset.Category kwcoco.coco_sql_dataset.KeypointCategory kwcoco.coco_sql_dataset.Video kwcoco.coco_sql_dataset.Image kwcoco.coco_sql_dataset.Annotation kwcoco.coco_sql_dataset.SqlListProxy kwcoco.coco_sql_dataset.SqlDictProxy kwcoco.coco_sql_dataset.SqlIdGroupDictProxy kwcoco.coco_sql_dataset.CocoSqlIndex kwcoco.coco_sql_dataset.CocoSqlDatabase Functions ~~~~~~~~~ .. autoapisummary:: kwcoco.coco_sql_dataset.orm_to_dict kwcoco.coco_sql_dataset._orm_yielder kwcoco.coco_sql_dataset._raw_yielder kwcoco.coco_sql_dataset._new_proxy_cache kwcoco.coco_sql_dataset._handle_sql_uri kwcoco.coco_sql_dataset.cached_sql_coco_view kwcoco.coco_sql_dataset.ensure_sql_coco_view kwcoco.coco_sql_dataset.demo kwcoco.coco_sql_dataset.assert_dsets_allclose kwcoco.coco_sql_dataset._benchmark_dset_readtime kwcoco.coco_sql_dataset._benchmark_dict_proxy_ops kwcoco.coco_sql_dataset.devcheck Attributes ~~~~~~~~~~ .. autoapisummary:: kwcoco.coco_sql_dataset.CocoBase kwcoco.coco_sql_dataset.ALCHEMY_MODE_DEFAULT kwcoco.coco_sql_dataset.TBLNAME_TO_CLASS kwcoco.coco_sql_dataset.tblname .. py:data:: CocoBase .. py:class:: Category Bases: :py:obj:`CocoBase` .. py:attribute:: __tablename__ :annotation: = categories .. py:attribute:: id .. py:attribute:: name .. py:attribute:: alias .. py:attribute:: supercategory .. py:attribute:: extra .. py:class:: KeypointCategory Bases: :py:obj:`CocoBase` .. py:attribute:: __tablename__ :annotation: = keypoint_categories .. py:attribute:: id .. py:attribute:: name .. py:attribute:: alias .. py:attribute:: supercategory .. py:attribute:: reflection_id .. py:attribute:: extra .. py:class:: Video Bases: :py:obj:`CocoBase` .. py:attribute:: __tablename__ :annotation: = videos .. py:attribute:: id .. py:attribute:: name .. py:attribute:: caption .. py:attribute:: width .. py:attribute:: height .. py:attribute:: extra .. py:class:: Image Bases: :py:obj:`CocoBase` .. py:attribute:: __tablename__ :annotation: = images .. py:attribute:: id .. py:attribute:: name .. py:attribute:: file_name .. py:attribute:: width .. py:attribute:: height .. py:attribute:: video_id .. py:attribute:: timestamp .. py:attribute:: frame_index .. py:attribute:: channels .. py:attribute:: auxiliary .. py:attribute:: extra .. py:class:: Annotation Bases: :py:obj:`CocoBase` .. py:attribute:: __tablename__ :annotation: = annotations .. py:attribute:: id .. py:attribute:: image_id .. py:attribute:: category_id .. py:attribute:: track_id .. py:attribute:: segmentation .. py:attribute:: keypoints .. py:attribute:: bbox .. py:attribute:: _bbox_x .. py:attribute:: _bbox_y .. py:attribute:: _bbox_w .. py:attribute:: _bbox_h .. py:attribute:: score .. py:attribute:: weight .. py:attribute:: prob .. py:attribute:: iscrowd .. py:attribute:: caption .. py:attribute:: extra .. py:data:: ALCHEMY_MODE_DEFAULT :annotation: = 0 .. py:data:: TBLNAME_TO_CLASS .. py:data:: tblname .. py:function:: orm_to_dict(obj) .. py:function:: _orm_yielder(query, size=300) TODO: figure out the best way to yield, in batches or otherwise .. py:function:: _raw_yielder(result, size=300) TODO: figure out the best way to yield, in batches or otherwise .. py:function:: _new_proxy_cache() By returning None, we wont use item caching .. py:class:: SqlListProxy(proxy, session, cls) Bases: :py:obj:`ubelt.NiceRepr` A view of an SQL table that behaves like a Python list .. py:method:: __len__(proxy) .. py:method:: __nice__(proxy) .. py:method:: __iter__(proxy) .. py:method:: __getitem__(proxy, index) .. py:method:: __contains__(proxy, item) .. py:method:: __setitem__(proxy, index, value) .. py:method:: __delitem__(proxy, index) .. py:class:: SqlDictProxy(proxy, session, cls, keyattr=None, ignore_null=False) Bases: :py:obj:`kwcoco.util.dict_like.DictLike` Duck-types an SQL table as a dictionary of dictionaries. The key is specified by an indexed column (by default it is the `id` column). The values are dictionaries containing all data for that row. .. note:: With SQLite indexes are B-Trees so lookup is O(log(N)) and not O(1) as will regular dictionaries. Iteration should still be O(N), but databases have much more overhead than Python dictionaries. :Parameters: * **session** (*Session*) -- the sqlalchemy session * **cls** (*Type*) -- the declarative sqlalchemy table class * **keyattr** -- the indexed column to use as the keys * **ignore_null** (*bool*) -- if True, ignores any keys set to NULL, otherwise NULL keys are allowed. .. rubric:: Example >>> # xdoctest: +REQUIRES(module:sqlalchemy) >>> from kwcoco.coco_sql_dataset import * # NOQA >>> import pytest >>> sql_dset, dct_dset = demo(num=10) >>> proxy = sql_dset.index.anns >>> keys = list(proxy.keys()) >>> values = list(proxy.values()) >>> items = list(proxy.items()) >>> item_keys = [t[0] for t in items] >>> item_vals = [t[1] for t in items] >>> lut_vals = [proxy[key] for key in keys] >>> assert item_vals == lut_vals == values >>> assert item_keys == keys >>> assert len(proxy) == len(keys) >>> goodkey1 = keys[1] >>> badkey1 = -100000000000 >>> badkey2 = 'foobarbazbiz' >>> badkey3 = object() >>> assert goodkey1 in proxy >>> assert badkey1 not in proxy >>> assert badkey2 not in proxy >>> assert badkey3 not in proxy >>> with pytest.raises(KeyError): >>> proxy[badkey1] >>> with pytest.raises(KeyError): >>> proxy[badkey2] >>> with pytest.raises(KeyError): >>> proxy[badkey3] >>> # xdoctest: +SKIP >>> from kwcoco.coco_sql_dataset import _benchmark_dict_proxy_ops >>> ti = _benchmark_dict_proxy_ops(proxy) >>> print('ti.measures = {}'.format(ub.repr2(ti.measures, nl=2, align=':', precision=6))) .. rubric:: Example >>> # xdoctest: +REQUIRES(module:sqlalchemy) >>> from kwcoco.coco_sql_dataset import * # NOQA >>> import kwcoco >>> # Test the variant of the SqlDictProxy where we ignore None keys >>> # This is the case for name_to_img and file_name_to_img >>> dct_dset = kwcoco.CocoDataset.demo('shapes1') >>> dct_dset.add_image(name='no_file_image1') >>> dct_dset.add_image(name='no_file_image2') >>> dct_dset.add_image(name='no_file_image3') >>> sql_dset = dct_dset.view_sql(memory=True) >>> assert len(dct_dset.index.imgs) == 4 >>> assert len(dct_dset.index.file_name_to_img) == 1 >>> assert len(dct_dset.index.name_to_img) == 3 >>> assert len(sql_dset.index.imgs) == 4 >>> assert len(sql_dset.index.file_name_to_img) == 1 >>> assert len(sql_dset.index.name_to_img) == 3 >>> proxy = sql_dset.index.file_name_to_img >>> assert len(list(proxy.keys())) == 1 >>> assert len(list(proxy.values())) == 1 >>> proxy = sql_dset.index.name_to_img >>> assert len(list(proxy.keys())) == 3 >>> assert len(list(proxy.values())) == 3 >>> proxy = sql_dset.index.imgs >>> assert len(list(proxy.keys())) == 4 >>> assert len(list(proxy.values())) == 4 .. py:method:: __len__(proxy) .. py:method:: __nice__(proxy) .. py:method:: __contains__(proxy, key) .. py:method:: __getitem__(proxy, key) .. py:method:: keys(proxy) .. py:method:: values(proxy) .. py:method:: items(proxy) .. py:class:: SqlIdGroupDictProxy(proxy, session, valattr, keyattr, parent_keyattr, group_order_attr=None) Bases: :py:obj:`kwcoco.util.dict_like.DictLike` Similar to :class:`SqlDictProxy`, but maps ids to groups of other ids. Simulates a dictionary that maps ids of a parent table to all ids of another table corresponding to rows where a specific column has that parent id. The items in the group can be sorted by the ``group_order_attr`` if specified. For example, imagine two tables: images with one column (id) and annotations with two columns (id, image_id). This class can help provide a mpaping from each `image.id` to a `Set[annotation.id]` where those annotation rows have `annotation.image_id = image.id`. .. rubric:: Example >>> # xdoctest: +REQUIRES(module:sqlalchemy) >>> from kwcoco.coco_sql_dataset import * # NOQA >>> sql_dset, dct_dset = demo(num=10) >>> proxy = sql_dset.index.gid_to_aids >>> keys = list(proxy.keys()) >>> values = list(proxy.values()) >>> items = list(proxy.items()) >>> item_keys = [t[0] for t in items] >>> item_vals = [t[1] for t in items] >>> lut_vals = [proxy[key] for key in keys] >>> assert item_vals == lut_vals == values >>> assert item_keys == keys >>> assert len(proxy) == len(keys) >>> # xdoctest: +SKIP >>> from kwcoco.coco_sql_dataset import _benchmark_dict_proxy_ops >>> ti = _benchmark_dict_proxy_ops(proxy) >>> print('ti.measures = {}'.format(ub.repr2(ti.measures, nl=2, align=':', precision=6))) .. rubric:: Example >>> # xdoctest: +REQUIRES(module:sqlalchemy) >>> from kwcoco.coco_sql_dataset import * # NOQA >>> import kwcoco >>> # Test the group sorted variant of this by using vidid_to_gids >>> # where the "gids" must be sorted by the image frame indexes >>> dct_dset = kwcoco.CocoDataset.demo('vidshapes1') >>> dct_dset.add_image(name='frame-index-order-demo1', frame_index=-30, video_id=1) >>> dct_dset.add_image(name='frame-index-order-demo2', frame_index=10, video_id=1) >>> dct_dset.add_image(name='frame-index-order-demo3', frame_index=3, video_id=1) >>> dct_dset.add_video(name='empty-video1') >>> dct_dset.add_video(name='empty-video2') >>> dct_dset.add_video(name='empty-video3') >>> sql_dset = dct_dset.view_sql(memory=True) >>> orig = dct_dset.index.vidid_to_gids >>> proxy = sql_dset.index.vidid_to_gids >>> from kwcoco.util.util_json import indexable_allclose >>> assert indexable_allclose(orig, dict(proxy)) >>> items = list(proxy.items()) >>> vals = list(proxy.values()) >>> keys = list(proxy.keys()) >>> assert len(keys) == len(vals) >>> assert dict(zip(keys, vals)) == dict(items) .. py:method:: __nice__(self) .. py:method:: __len__(proxy) .. py:method:: __getitem__(proxy, key) .. py:method:: __contains__(proxy, key) .. py:method:: keys(proxy) .. py:method:: items(proxy) .. py:method:: values(proxy) .. py:class:: CocoSqlIndex(index) Bases: :py:obj:`object` Simulates the dictionary provided by :class:`kwcoco.coco_dataset.CocoIndex` .. py:method:: build(index, parent) .. py:function:: _handle_sql_uri(uri) Temporary function to deal with URI. Modern tools seem to use RFC 3968 URIs, but sqlalchemy uses RFC 1738. Attempt to gracefully handle special cases. With a better understanding of the above specs, this function may be able to be written more eloquently. .. py:class:: CocoSqlDatabase(uri=None, tag=None, img_root=None) Bases: :py:obj:`kwcoco.abstract_coco_dataset.AbstractCocoDataset`, :py:obj:`kwcoco.coco_dataset.MixinCocoAccessors`, :py:obj:`kwcoco.coco_dataset.MixinCocoObjects`, :py:obj:`kwcoco.coco_dataset.MixinCocoStats`, :py:obj:`kwcoco.coco_dataset.MixinCocoDraw`, :py:obj:`ubelt.NiceRepr` Provides an API nearly identical to :class:`kwcoco.CocoDatabase`, but uses an SQL backend data store. This makes it robust to copy-on-write memory issues that arise when forking, as discussed in [1]_. .. note:: By default constructing an instance of the CocoSqlDatabase does not create a connection to the databse. Use the :func:`connect` method to open a connection. .. rubric:: References .. [1] https://github.com/pytorch/pytorch/issues/13246 .. rubric:: Example >>> # xdoctest: +REQUIRES(module:sqlalchemy) >>> from kwcoco.coco_sql_dataset import * # NOQA >>> sql_dset, dct_dset = demo() >>> dset1, dset2 = sql_dset, dct_dset >>> tag1, tag2 = 'dset1', 'dset2' >>> assert_dsets_allclose(sql_dset, dct_dset) .. py:attribute:: MEMORY_URI :annotation: = sqlite:///:memory: .. py:method:: __nice__(self) .. py:method:: coerce(self, data) :classmethod: Create an SQL CocoDataset from the input pointer. .. rubric:: Example import kwcoco dset = kwcoco.CocoDataset.demo('shapes8') data = dset.fpath self = CocoSqlDatabase.coerce(data) from kwcoco.coco_sql_dataset import CocoSqlDatabase import kwcoco dset = kwcoco.CocoDataset.coerce('spacenet7.kwcoco.json') self = CocoSqlDatabase.coerce(dset) from kwcoco.coco_sql_dataset import CocoSqlDatabase sql_dset = CocoSqlDatabase.coerce('spacenet7.kwcoco.json') # from kwcoco.coco_sql_dataset import CocoSqlDatabase import kwcoco sql_dset = kwcoco.CocoDataset.coerce('_spacenet7.kwcoco.view.v006.sqlite') .. py:method:: __getstate__(self) Return only the minimal info when pickling this object. .. note:: This object IS pickling when the multiprocessing context is "spawn". This object is NOT pickled when the multiprocessing context is "fork". In this case the user needs to be careful to create new connections in the forked subprocesses. .. rubric:: Example >>> # xdoctest: +REQUIRES(module:sqlalchemy) >>> from kwcoco.coco_sql_dataset import * # NOQA >>> sql_dset, dct_dset = demo() >>> # Test pickling works correctly >>> import pickle >>> serialized = pickle.dumps(sql_dset) >>> assert len(serialized) < 3e4, 'should be very small' >>> copy = pickle.loads(serialized) >>> dset1, dset2, tag1, tag2 = sql_dset, copy, 'orig', 'copy' >>> assert_dsets_allclose(dset1, dset2, tag1, tag2) >>> # --- other methods of copying --- >>> rw_copy = CocoSqlDatabase( >>> sql_dset.uri, img_root=sql_dset.img_root, tag=sql_dset.tag) >>> rw_copy.connect() >>> ro_copy = CocoSqlDatabase( >>> sql_dset.uri, img_root=sql_dset.img_root, tag=sql_dset.tag) >>> ro_copy.connect(readonly=True) >>> assert_dsets_allclose(dset1, ro_copy, tag1, 'ro-copy') >>> assert_dsets_allclose(dset1, rw_copy, tag1, 'rw-copy') .. py:method:: __setstate__(self, state) Reopen new readonly connnections when unpickling the object. .. py:method:: disconnect(self) Drop references to any SQL or cache objects .. py:method:: connect(self, readonly=False) Connects this instance to the underlying database. .. rubric:: References # details on read only mode, some of these didnt seem to work https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/dialects/sqlite/pysqlite.py#L71 https://github.com/pudo/dataset/issues/136 https://writeonly.wordpress.com/2009/07/16/simple-read-only-sqlalchemy-sessions/ .. py:method:: fpath(self) :property: .. py:method:: delete(self) .. py:method:: populate_from(self, dset, verbose=1) Copy the information in a :class:`CocoDataset` into this SQL database. .. rubric:: Example >>> # xdoctest: +REQUIRES(module:sqlalchemy) >>> from kwcoco.coco_sql_dataset import _benchmark_dset_readtime # NOQA >>> import kwcoco >>> from kwcoco.coco_sql_dataset import * >>> dset2 = dset = kwcoco.CocoDataset.demo() >>> dset1 = self = CocoSqlDatabase('sqlite:///:memory:') >>> self.connect() >>> self.populate_from(dset) >>> assert_dsets_allclose(dset1, dset2, tag1='sql', tag2='dct') >>> ti_sql = _benchmark_dset_readtime(dset1, 'sql') >>> ti_dct = _benchmark_dset_readtime(dset2, 'dct') >>> print('ti_sql.rankings = {}'.format(ub.repr2(ti_sql.rankings, nl=2, precision=6, align=':'))) >>> print('ti_dct.rankings = {}'.format(ub.repr2(ti_dct.rankings, nl=2, precision=6, align=':'))) .. py:method:: dataset(self) :property: .. py:method:: anns(self) :property: .. py:method:: cats(self) :property: .. py:method:: imgs(self) :property: .. py:method:: name_to_cat(self) :property: .. py:method:: raw_table(self, table_name) Loads an entire SQL table as a pandas DataFrame :Parameters: **table_name** (*str*) -- name of the table :returns: DataFrame .. rubric:: Example >>> # xdoctest: +REQUIRES(module:sqlalchemy) >>> from kwcoco.coco_sql_dataset import * # NOQA >>> self, dset = demo() >>> table_df = self.raw_table('annotations') >>> print(table_df) .. py:method:: _column_lookup(self, tablename, key, rowids, default=ub.NoParam, keepid=False) Convinience method to lookup only a single column of information .. rubric:: Example >>> # xdoctest: +REQUIRES(module:sqlalchemy) >>> from kwcoco.coco_sql_dataset import * # NOQA >>> self, dset = demo(10) >>> tablename = 'annotations' >>> key = 'category_id' >>> rowids = list(self.anns.keys())[::3] >>> cids1 = self._column_lookup(tablename, key, rowids) >>> cids2 = self.annots(rowids).get(key) >>> cids3 = dset.annots(rowids).get(key) >>> assert cids3 == cids2 == cids1 .. py:method:: _all_rows_column_lookup(self, tablename, keys) Convinience method to look up all rows from a table and only a few columns. .. rubric:: Example >>> # xdoctest: +REQUIRES(module:sqlalchemy) >>> from kwcoco.coco_sql_dataset import * # NOQA >>> self, dset = demo(10) >>> tablename = 'annotations' >>> keys = ['id', 'category_id'] >>> rows = self._all_rows_column_lookup(tablename, keys) .. py:method:: tabular_targets(self) Convinience method to create an in-memory summary of basic annotation properties with minimal SQL overhead. .. rubric:: Example >>> # xdoctest: +REQUIRES(module:sqlalchemy) >>> from kwcoco.coco_sql_dataset import * # NOQA >>> self, dset = demo() >>> targets = self.tabular_targets() >>> print(targets.pandas()) .. py:method:: bundle_dpath(self) :property: .. py:method:: data_fpath(self) :property: data_fpath is an alias of fpath .. py:function:: cached_sql_coco_view(dct_db_fpath=None, sql_db_fpath=None, dset=None, force_rewrite=False) Attempts to load a cached SQL-View dataset, only loading and converting the json dataset if necessary. .. py:function:: ensure_sql_coco_view(dset, db_fpath=None, force_rewrite=False) Create a cached on-disk SQL view of an on-disk COCO dataset. .. note:: This function is fragile. It depends on looking at file modified timestamps to determine if it needs to write the dataset. .. py:function:: demo(num=10) .. py:function:: assert_dsets_allclose(dset1, dset2, tag1='dset1', tag2='dset2') .. py:function:: _benchmark_dset_readtime(dset, tag='?') Helper for understanding the time differences between backends .. py:function:: _benchmark_dict_proxy_ops(proxy) Get insight on the efficiency of operations .. py:function:: devcheck() Scratch work for things that should eventually become unit or doc tests from kwcoco.coco_sql_dataset import * # NOQA self, dset = demo()