Publications - Zifan Liu

TSDS: Data Selection for Task-Specific Model Finetuning

Zifan Liu, Amin Karbasi, Theodoros Rekatsinas

Conference on Neural Information Processing Systems (NeurIPS) 2024

Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task.

[PDF] [Code]

TSDS: Data Selection for Task-Specific Model Finetuning

Zifan Liu, Amin Karbasi, Theodoros Rekatsinas

Conference on Neural Information Processing Systems (NeurIPS) 2024

Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task.

[PDF] [Code]

Rapidash: Efficient Detection of Constraint Violations

Zifan Liu, Shaleen Deep, Anna Fariha, Fotis Psallidas, Ashish Tiwari, Avrilia Floratou

International Conference on Very Large Databases (VLDB) 2024

Denial Constraint (DC) is a well-established formalism that captures a wide range of integrity constraints commonly encountered, including candidate keys, functional dependencies, and ordering constraints, among others. We establish a connection between orthogonal range search and DC violation detection. We then introduce Rapidash, a novel algorithm that demonstrates near-linear time and space complexity, representing a theoretical improvement over prior work.

[PDF] [Code]

Rapidash: Efficient Detection of Constraint Violations

Zifan Liu, Shaleen Deep, Anna Fariha, Fotis Psallidas, Ashish Tiwari, Avrilia Floratou

International Conference on Very Large Databases (VLDB) 2024

Denial Constraint (DC) is a well-established formalism that captures a wide range of integrity constraints commonly encountered, including candidate keys, functional dependencies, and ordering constraints, among others. We establish a connection between orthogonal range search and DC violation detection. We then introduce Rapidash, a novel algorithm that demonstrates near-linear time and space complexity, representing a theoretical improvement over prior work.

[PDF] [Code]

AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis

Zifan Liu, Evan Rosen, Paul Suganthan G. C

NeurIPS Workshop on Challenges in Deploying and Monitoring Machine Learning Systems 2022

Automated slicing aims to identify subsets of evaluation data where a trained model performs anomalously. This is an important problem for machine learning pipelines in production since it plays a key role in model debugging and comparison, as well as the diagnosis of fairness issues. We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing.

[PDF]

AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis

Zifan Liu, Evan Rosen, Paul Suganthan G. C

NeurIPS Workshop on Challenges in Deploying and Monitoring Machine Learning Systems 2022

Automated slicing aims to identify subsets of evaluation data where a trained model performs anomalously. This is an important problem for machine learning pipelines in production since it plays a key role in model debugging and comparison, as well as the diagnosis of fairness issues. We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing.

[PDF]

Picket: guarding against corrupted data in tabular data during learning and inference

Zifan Liu, Zhechun Zhou, Theodoros Rekatsinas

The VLDB Journal 2022

Data corruption is an impediment to modern machine learning deployments. Corrupted data can severely bias the learned model and can also lead to invalid inferences. We present, Picket, a simple framework to safeguard against data corruptions during both training and deployment of machine learning models over tabular data.

[PDF] [Code]

Picket: guarding against corrupted data in tabular data during learning and inference

Zifan Liu, Zhechun Zhou, Theodoros Rekatsinas

The VLDB Journal 2022

Data corruption is an impediment to modern machine learning deployments. Corrupted data can severely bias the learned model and can also lead to invalid inferences. We present, Picket, a simple framework to safeguard against data corruptions during both training and deployment of machine learning models over tabular data.

[PDF] [Code]

On Robust Mean Estimation under Coordinate-Level Corruption

Zifan Liu, Jongho Park, Theodoros Rekatsinas, Christos Tzamos

International Conference on Machine Learning (ICML) 2021

We study the problem of robust mean estimation and introduce a novel Hamming distance-based measure of distribution shift for coordinate-level corruptions. We show that this measure yields adversary models that capture more realistic corruptions than those used in prior works, and present an information-theoretic analysis of robust mean estimation in these settings.

[PDF]

On Robust Mean Estimation under Coordinate-Level Corruption

Zifan Liu, Jongho Park, Theodoros Rekatsinas, Christos Tzamos

International Conference on Machine Learning (ICML) 2021

We study the problem of robust mean estimation and introduce a novel Hamming distance-based measure of distribution shift for coordinate-level corruptions. We show that this measure yields adversary models that capture more realistic corruptions than those used in prior works, and present an information-theoretic analysis of robust mean estimation in these settings.

[PDF]