Filtered on:
Student Thesis PublishedThe training of generative artificial intelligence (AI) models demands extensive datasets often sourced from web scraping. However, current practices frequently overlook copyright compliance, posing significant ethical and legal challenges. This project aims to develop a tool for license-aware web crawling leveraging natural language processing (NLP) techniques to detect and extract licensing information from websites automatically. The tool demonstrated high accuracy in license type detection, achieving 100%, and moderate effectiveness in extracting license text, with ROUGE-L scores showing an F1 score of 0.499, precision of 0.588, and recall of 0.503. By identifying the specific license type, the algorithm facilitates the creation of legally compliant datasets essential for responsible AI training. This tool not only ensures adherence to copyright laws but also promotes ethical data usage, thereby supporting the sustainable advancement of AI technologies.Read more
Call for ContributionsGenerative AI is trained on vast amounts of data. The current practice of collecting the data to train such models rarely includes considerations about copyright and intellectual property rights. While the models themselves and the data they are trained on are often multilingual and multinational, legal requirements and rules vary between jurisdictions. Open questions remain not just about the input data used to train Generative AI, but also about the artifacts it produces, and which rules apply to them (these can include in how far the output is a derivative of existing, specific rules that apply to AI systems (e.g. AI Act), general rules, e.g. with regard to trademarks).Read more