What is unstructured data?
Unstructured data refers to data that cannot be stored in a tabular or key-value format. Nearly all human-generated data (images, video, text, etc...) is unstructured - some market analysts estimate that over 80% of data generated by 2024 will be unstructured data. Towhee is the first open-source project that's meant to process a variety of unstructured data using ETL pipelines.
To accomplish this, we built Towhee atop popular machine learning and unstructured data processing libraries, i.e.
transformers, etc. Models or functions from different libraries are wrapped as standard Towhee operators, and can be freely integrated into application-oriented pipelines using a Pythonic API. To ensure user-friendliness, pre-built pipelines can also be called in just a single line of code, without the need to understand the underlying models or modules used to build them.
For more information, take a look at our quick start page.
Problems Towhee solves
Modern ML applications require far more than a single neural network. Running a modern ML application in production requires a combination of online pre-processing, data transformation, the models themselves, and other ML-related tools. Building an application that recognizes objects within a video, for example, involves decompression, key-frame extraction, image deduplication, object detection, etc. This necessitates a platform that offers a fast and robust method for developing end-to-end application pipelines that use ML models in addition to supporting data parallelism and resource management.
Towhee solves this problem by reintroducing the concept of
Pipelineas being application-centric instead of model-centric. Where model-centric pipelines are composed of a single model followed by auxiliary code, application-centric pipelines treat every single data processing step as a first-class citizen. Towhee also exposes a Pythonic API for developing more complex applications in just a couple lines of code.
Too many model implementations exist without any interface standard. Machine learning models (NN-based and traditional) are ubiquitous. Different implementations of machine learning models require different auxiliary code to support testing and fine-tuning, making model evaluation and productionization a tedious task.
Towhee solves this by providing a universal
Operatorwrapper for dataset loading, basic data transformations, ML models, and other miscellaneous scripts. Operators have a pre-defined API and glue logic to make Towhee work with a number of machine learning and data processing libraries. Operators can be chained together in a DAG to form entire ML applications.
ETL pipelines for unstructured data are nearly nonexistent. ETL, short for extract, transform, and load, is a framework used by data scientists, ML application developers, and other engineers to extract data from various sources, transform the data into a format that can be understood by computers, and load the data into downstream platforms for recommendation, analytics, and other business intelligence tasks.
Towhee solves this by providing an open-source vision for ETL in the era of unstructured data. We provide: 1) over 300 pre-built pipelines across a multitude of different data transformation tasks (including but not limited to image embedding, audio embedding, text summarization), and 2) a way to build pipelines of arbitrary complexity through an intuitive Python API called
Convenient: Towhee pipelines can be created to implement a variety of data transformation tasks. Any pipeline creations or embedding tasks can be done in no more than 10 lines of code. We provide a number of pre-built pipelines on our hub.
Extensible: Individual operators have standard interfaces, and can be reconfigured/reused in different pipelines. Pipelines can be deployed anywhere you want - on your local machine, on a server with 4 GPUs, or even in the cloud.
Application-oriented: Instead of being "just another model hub", we provide full end-to-end embedding pipelines. Each pipeline can make use of any number of machine learning models or Python functions in a variety of configurations - ensembles, flows, or any combination thereof.
Where to go from here
- Reverse image search: search for similar or related images.
- Image deduplication: detect and remove identical or near-identical photos.
- Music recognition: music identification with full-length song or a snippet.