Best Practices for Imbalanced Data and Partitioning

Presented by:
DataRobot
DataRobot is the leader in enterprise AI, delivering trusted AI technology and ROI enablement services to global enterprises. DataRobot’s enterprise AI platform democratizes data science with end-to-end automation for building, deploying, and managing machine learning models.
- About this Session
- Session Resources
In this two-part learning session, we discuss best practices around data partitioning and working with imbalanced datasets.
Five-fold cross-validation is often the silver bullet for partitioning your validation dataset, but there are some dangerous caveats you have to be aware of to make sure that you’re building robust models. In this learning session (part 1) , we talk about those pitfalls and outline strategies for handling them.
Binary target variables are very common in data science use cases, many of which are severely imbalanced. When you’re building models for infrequent events, such as predicting fraud or identifying product failures, it’s important to watch out for imbalance in your data. (In part 2 of this learning session we discuss strategies for working with imbalanced datasets and provide some rules-of-thumb for these types of use cases.)
Featured Presenters
Supported By

Alegion
Alegion provides an industry-leading platform that provides annotation for image, video, and text.

Amazon Web Services (AWS)
AWS is the world’s most comprehensive and broadly adopted cloud platform, offering over 175 fully featured services from datacenters globally. Millions of customers are using AWS to lower costs, become more agile, and innovate faster.

Appen
Appen collects and labels images, text, speech, audio, video, and other data used to build and continuously improve the world’s most innovative artificial intelligence systems.

Carahsoft
Carahsoft is a trusted Government IT Solutions Provider working with reseller partners, system integrators, and manufacturers to proving leading IT solutions to government markets.

CloudFactory
CloudFactory is a global leader in combining people and technology to provide a workforce in the cloud for machine learning and core business data processing.

Databricks
Databricks is the data and AI company. Thousands of organizations worldwide rely on Databricks’ open and unified platform for data engineering, machine learning and analytics. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to solve the world’s toughest problems.

DataRobot
DataRobot is the leader in enterprise AI, delivering trusted AI technology and ROI enablement services to global enterprises. DataRobot’s enterprise AI platform democratizes data science with end-to-end automation for building, deploying, and managing machine learning models.

HEAVY.AI
HEAVY.AI provides advanced analytics that empower businesses and the government to visualize high-value opportunities and risks hidden in their big location and time data, supporting time-sensitive, high-impact decisions.

Labelbox
Labelbox is an end-to-end training data platform that is used to create and manage high-quality training data. The platform provides fast labeling tools, collaboration features, and supports any data type (e.g., images, videos, text, etc.)

Maverick Quantum Inc (mavQ)
Maverick Quantum Inc (mavQ) is a low code & artificial intelligence platform that enables organizations with digital transformation while creating valuable insights and outcomes.

Microsoft
Microsoft allows you to improve your agency’s collaboration, transparency, and sustainability by using more secure and compliant tools.

Veritone
Transform audio, video, and other data sources into actionable intelligence with Veritone’s aiWARE.