Job Description
Overview:
Pioneer cutting-edge multimodal research contributing to innovative prototypes and scalable systems.
Responsibilities:
* Design novel AI architectures for multimodal language models integrating text, visual, and audio modalities.
* Engineer training and inference pipelines optimized for large-scale multimodal datasets and distributed GPU systems.
* Optimize systems and algorithms for efficient data processing and model execution.
* Develop tools for preprocessing, analyzing, and managing multimodal data assets.
* Collaborate with research and engineering teams to translate model innovations into production-grade solutions.
* Prototype generative AI applications showcasing new capabilities of multimodal foundation models.
* Develop benchmarking tools to evaluate model performance across diverse multimodal tasks.
Requirements
* Bachelor's degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent practical experience.
* Expertise in Python & PyTorch, including practical experience working with the full development pipeline.
* Experience working with large-scale text data or interleaved data spanning multiple modalities.
* Direct hands-on experience in developing or benchmarking at least one of the following topics: LLMs, Vision Language Models, Audio Language Models, generative video models.
Nice to Have Skills
* PhD in Vision, Machine Learning, NLP, Computer Science, Applied Statistics, or a closely related field.
* Demonstrated expertise in vision, video generation foundation model and/or multimodal research.
* First-author publications at leading AI conferences such as CVPR, ICCV, ECCV, ICML, ICLR, NeurIPS etc.