Overview
Towhee is a framework that provides ETL for unstructured data using SoTA machine learning models.
What is unstructured data?
Unstructured data refers to data that cannot be stored in a tabular or key-value format. Nearly all human-generated data (images, video, text, etc...) is unstructured - some market analysts estimate that over 80% of data generated by 2024 will be unstructured data. Towhee is the first open-source project that's meant to process a variety of unstructured data using ETL pipelines.
To accomplish this, we built Towhee atop popular machine learning and unstructured data processing libraries, i.e. torch
, timm
, transformers
, etc. Models or functions from different libraries are wrapped as standard Towhee operators, and can be freely integrated into application-oriented pipelines using a Pythonic API. To ensure user-friendliness, pre-built pipelines can also be called in just a single line of code, without the need to understand the underlying models or modules used to build them.
For more information, take a look at our quick start page.
Problems Towhee solves
Modern ML applications require far more than a single neural network. Running a modern ML application in production requires a combination of online pre-processing, data transformation, the models themselves, and other ML-related tools. Building an application that recognizes objects within a video, for example, involves decompression, key-frame extraction, image deduplication, object detection, etc. This necessitates a platform that offers a fast and robust method for developing end-to-end application pipelines that use ML models in addition to supporting data parallelism and resource management.
Towhee solves this problem by reintroducing the concept of
Pipeline
as being application-centric instead of model-centric. Where model-centric pipelines are composed of a single model followed by auxiliary code, application-centric pipelines treat every single data processing step as a first-class citizen. Towhee also exposes a Pythonic API for developing more complex applications in just a couple lines of code.Too many model implementations exist without any interface standard. Machine learning models (NN-based and traditional) are ubiquitous. Different implementations of machine learning models require different auxiliary code to support testing and fine-tuning, making model evaluation and productionization a tedious task.
Towhee solves this by providing a universal
Operator
wrapper for dataset loading, basic data transformations, ML models, and other miscellaneous scripts. Operators have a pre-defined API and glue logic to make Towhee work with a number of machine learning and data processing libraries. Operators can be chained together in a DAG to form entire ML applications.ETL pipelines for unstructured data are nearly nonexistent. ETL, short for extract, transform, and load, is a framework used by data scientists, ML application developers, and other engineers to extract data from various sources, transform the data into a format that can be understood by computers, and load the data into downstream platforms for recommendation, analytics, and other business intelligence tasks.
Towhee solves this by providing an open-source vision for ETL in the era of unstructured data. We provide:
- over 300 pre-built pipelines across a multitude of different data transformation tasks (including but not limited to image embedding, audio embedding, text summarization)
- a way to build pipelines of arbitrary complexity through an intuitive Python API.
Design philosophy
Convenient: Towhee pipelines can be created to implement a variety of data transformation tasks. Any pipeline creations or embedding tasks can be done in no more than 10 lines of code. We provide a number of pre-built pipelines on our hub.
Extensible: Individual operators have standard interfaces, and can be reconfigured/reused in different pipelines. Pipelines can be deployed anywhere you want - on your local machine, on a server with 4 GPUs, or even in the cloud.
Application-oriented: Instead of being "just another model hub", we provide full end-to-end embedding pipelines. Each pipeline can make use of any number of machine learning models or Python functions in a variety of configurations - ensembles, flows, or any combination thereof.
Where to go from here
Getting started:
- Check out our Quick Start: install Towhee and try your first pipeline.
- Try pre-built Operators.
- Towhee API Reference.
Tutorials:
- Reverse image search: search for similar or related images.
- Image deduplication: detect and remove identical or near-identical photos.
- Music recognition: music identification with full-length song or a snippet.
Community:
- Github: https://github.com/towhee-io/towhee
- Slack: https://slack.towhee.io
- Twitter: https://twitter.com/towheeio