Zifan Liu, Amin Karbasi, Theodoros Rekatsinas
Conference on Neural Information Processing Systems (NeurIPS) 2024
Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task.
Zifan Liu, Amin Karbasi, Theodoros Rekatsinas
Conference on Neural Information Processing Systems (NeurIPS) 2024
Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task.
Zifan Liu, Shaleen Deep, Anna Fariha, Fotis Psallidas, Ashish Tiwari, Avrilia Floratou
International Conference on Very Large Databases (VLDB) 2024
Denial Constraint (DC) is a well-established formalism that captures a wide range of integrity constraints commonly encountered, including candidate keys, functional dependencies, and ordering constraints, among others. We establish a connection between orthogonal range search and DC violation detection. We then introduce Rapidash, a novel algorithm that demonstrates near-linear time and space complexity, representing a theoretical improvement over prior work.
Zifan Liu, Shaleen Deep, Anna Fariha, Fotis Psallidas, Ashish Tiwari, Avrilia Floratou
International Conference on Very Large Databases (VLDB) 2024
Denial Constraint (DC) is a well-established formalism that captures a wide range of integrity constraints commonly encountered, including candidate keys, functional dependencies, and ordering constraints, among others. We establish a connection between orthogonal range search and DC violation detection. We then introduce Rapidash, a novel algorithm that demonstrates near-linear time and space complexity, representing a theoretical improvement over prior work.
Zifan Liu, Evan Rosen, Paul Suganthan G. C
NeurIPS Workshop on Challenges in Deploying and Monitoring Machine Learning Systems 2022
Automated slicing aims to identify subsets of evaluation data where a trained model performs anomalously. This is an important problem for machine learning pipelines in production since it plays a key role in model debugging and comparison, as well as the diagnosis of fairness issues. We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing.
Zifan Liu, Evan Rosen, Paul Suganthan G. C
NeurIPS Workshop on Challenges in Deploying and Monitoring Machine Learning Systems 2022
Automated slicing aims to identify subsets of evaluation data where a trained model performs anomalously. This is an important problem for machine learning pipelines in production since it plays a key role in model debugging and comparison, as well as the diagnosis of fairness issues. We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing.
Zifan Liu, Zhechun Zhou, Theodoros Rekatsinas
The VLDB Journal 2022
Data corruption is an impediment to modern machine learning deployments. Corrupted data can severely bias the learned model and can also lead to invalid inferences. We present, Picket, a simple framework to safeguard against data corruptions during both training and deployment of machine learning models over tabular data.
Zifan Liu, Zhechun Zhou, Theodoros Rekatsinas
The VLDB Journal 2022
Data corruption is an impediment to modern machine learning deployments. Corrupted data can severely bias the learned model and can also lead to invalid inferences. We present, Picket, a simple framework to safeguard against data corruptions during both training and deployment of machine learning models over tabular data.
Zifan Liu, Jongho Park, Theodoros Rekatsinas, Christos Tzamos
International Conference on Machine Learning (ICML) 2021
We study the problem of robust mean estimation and introduce a novel Hamming distance-based measure of distribution shift for coordinate-level corruptions. We show that this measure yields adversary models that capture more realistic corruptions than those used in prior works, and present an information-theoretic analysis of robust mean estimation in these settings.
Zifan Liu, Jongho Park, Theodoros Rekatsinas, Christos Tzamos
International Conference on Machine Learning (ICML) 2021
We study the problem of robust mean estimation and introduce a novel Hamming distance-based measure of distribution shift for coordinate-level corruptions. We show that this measure yields adversary models that capture more realistic corruptions than those used in prior works, and present an information-theoretic analysis of robust mean estimation in these settings.