Be first to read the latest tech news, Industry Leader's Insights, and CIO interviews of medium and large enterprises exclusively from CIO Advisor APAC
Each cloud machine learning platform includes capabilities for managing the entire machine learning lifecycle.
FREMONT, CA: The major cloud providers—and several smaller cloud providers have invested significant resources in developing their machine learning platforms to assist the entire machine learning lifecycle, from project planning to model maintenance. The following is a list of 12 capabilities that every end-to-end machine learning platform should include.
Keep data close at hand
If a user has the massive amounts of data necessary to construct precise models, he does not want to ship them halfway around the world. The issue here is not distance but time: Even on an ideal network with infinite bandwidth, data transmission speed is ultimately restricted by the speed of light. Extended distances imply latency. For extensive data sets, the perfect case is to build the model in the data location, avoiding the need for mass data transmission. Numerous databases support this to varying degrees. The second-best case scenario is that the data is on the same high-speed network as the model-building software, typically within the same data centre. Moving data within a cloud availability zone from one data centre can introduce significant delays if a user has terabytes (TB) or more. This can be mitigated by performing incremental updates.
Assist in the development of an ETL or ELT pipeline
ETL (export, transform, and load) and ELT (export, load, and change) are two frequently used data pipeline configurations in the database world. Machine learning and deep learning exacerbate the requirement for these, particularly the transform component. ELT provides greater flexibility when changing user’s transformations, as the load phase is typically the most time-consuming phase for big data. By and large, data collected in the wild is noisy. This must be filtered.
Additionally, data collected in the wild has a range of values: A variable's maximum value could be in the millions, while another's range could be -0.1 to -0.001. To prevent variables with extensive coverage from dominating the model, variables must be transformed to standardised ranges before machine learning. Which standardised range is used depends on the model's algorithm.
Contribute to the development of an online environment for model building
Traditionally, the conventional wisdom was that a user should import his data to his desktop for model construction. However, the sheer volume of data required to build effective machine learning and deep learning models alters the landscape: While users can download a small sample of data to their desktop for exploratory data analysis and model building, they need full access to the data for production models.
Model building is well suited to web-based development environments. For example, if the user's data is stored in the same cloud as the notebook environment, he can conduct analysis directly on the data, avoiding time-consuming data movement.