I. INTRODUCTION
The role of the internet within today’s digital ecosystem is no longer limited, as it once was, to merely enabling access to information; rather, it also facilitates the systematic processing and analysis of information and its transformation into economic or strategic value. In the process of efficiently handling large volumes of data and converting them into meaningful information, methods based solely on human labour have proven inadequate, thereby leading to an increasing reliance on automation-based technologies. In this context, techniques such as web crawling, web scraping, and screen scraping are employed as essential technical tools for the large-scale systematic collection, extraction, and structuring of raw data available online. Recent judicial approaches in the international sphere demonstrate that the transformation of raw data into value-added information is increasingly regarded as a distinct form of production and, as such, falls within the scope of transformative use. Accordingly, it is observed that data not protected by technical barriers may be processed for transformative purposes, including the training of artificial intelligence systems, thereby leaving an open pathway for digital innovation and technological development.
Today, a wide range of business models—including search engines, price comparison platforms, training datasets for large language models, and financial analysis tools—are built upon the automated collection and processing of data available on the internet. While the original purpose of the internet was limited to the static display of information, information has now evolved into a resource that can be systematically processed and transformed as a core input of technological innovation. Databases created through the investments of commercial actors—such as e-commerce platforms, flight information systems, or real estate listings—have moved beyond serving solely as content presented to end users and have acquired the character of data pools from which new forms of added value are generated within the artificial intelligence and data analytics ecosystem. In this context, just as search engines such as Google or Bing are regarded as a necessity of the digital age for facilitating access to information through indexing the internet, the processing of data by third-party ventures for transformative purposes and the development of new technologies has increasingly come to be viewed as part of the freedom of innovation. International judicial decisions indicate an evolving framework in which data not restricted by technical barriers is increasingly permitted to be used in artificial intelligence training and analytical processes.
The increasing use of data, particularly in the training of artificial intelligence systems, data analytics, and the development of new digital services, has placed automation-based data mining techniques at the centre of legal debate. Determining under which circumstances data mining activities constitute an infringement of intellectual property rights, and under which they may be regarded as lawful and transformative use, requires, first and foremost, a proper understanding of the technical functioning and conceptual foundations of these technologies.
II. TECHNICAL AND CONCEPTUAL FOUNDATIONS OF DATA SCRAPING TECHNOLOGIES
Data scraping is an overarching concept that encompasses the automated collection and processing of data from digital environments and, despite its widespread use today, does not yet have a universally accepted standard definition. In general terms, this process refers to the extraction of data from digital sources through automated means and consists of the stages of data collection, pre-processing, and the subsequent use of the extracted data1.
Web crawling, in its most basic form, refers to the systematic visiting, scanning, and indexing of data dispersed across the internet through specialized software commonly referred to as “bots,” “spiders,” or “crawlers.” This process is based on HTTP requests, similar to the way a user navigates individual web pages through a web browser. However, owing to the speed and scale enabled by automation, data is not merely read but is also extracted, parsed, and structured, thereby being transformed into processable information. In today’s digital economy, web crawling is employed across a broad spectrum of applications, ranging from search engines to the training of artificial intelligence models, and enables information to be extracted from its raw form and rendered systematically processable2.
Web scraping constitutes a subsequent stage of web crawling and refers to the automated extraction and storage of specific content from crawled web pages. Screen scraping, by contrast, relies on obtaining data through the visual interface displayed on the screen rather than directly extracting it from the page’s source code, thereby representing a distinct technical method. The common feature of these techniques lies in the collection of digital content made publicly available on the internet through software-based tools without human intervention. Nevertheless, each method differs in terms of its impact on data and its intended use: while crawling primarily denotes a general and continuous scanning activity, scraping refers to a targeted and purpose-specific extraction of data.
Crawling activities are employed for both commercial and scientific purposes across a broad range of applications, including search engine indexing processes, market and competition analyses, the preparation of training datasets for artificial intelligence models, as well as price monitoring and inventory tracking systems. In large-scale and systematic data collection projects in particular, web crawling typically constitutes the initial stage of the scraping process and forms the foundational infrastructure of the overall data collection workflow. The operational logic of crawler software assumes that publicly available data may be processed so long as access is not explicitly restricted through technical protocols such as robots.txt. Nevertheless, the inherent inability of these systems to comprehend natural human language raises the question of the extent to which the intent of content owners can be rendered binding through technical means, a consideration that may give rise to legal consequences under intellectual property law. In this regard, website owners seek to restrict automated data collection activities through terms of service agreements and technical protection measures3.
By virtue of its technical operation and its position within the data extraction chain, web crawling necessitates a multi-layered assessment under intellectual property law. The nature of the automated scanning activity, the processing operations performed on the data, and the resulting patterns of use play a decisive role in determining the scope of exclusive rights. Indeed, judicial jurisprudence evolving on a global scale has developed along these conceptual foundations and serves to delineate the legal boundaries applicable to web crawling activities.
III. ASSESSMENT OF WEB CRAWLING UNDER INTELLECTUAL PROPERTY LAW
Assessing web crawling activities from the perspective of intellectual property law requires an accurate determination of the technical methods by which online content is scanned and the types of operations performed on the data because of such scanning. In most cases, crawling consists merely of automated access to publicly available web pages and the indexing of those pages; however, when followed by scraping and analytical activities, it becomes part of a broader data processing chain. Accordingly, the classification of infringement claims under the relevant categories of rights depends on criteria such as whether the scanned content qualifies as a protected work, the scope of reproduction, and whether the use is transformative in nature. The balance between technical freedom of access and the scope of the right holder’s exclusive rights is therefore established differently in each individual case. Within this framework, an integrated assessment of the approaches developed in comparative law together with the protection regime envisaged under Turkish law serves to concretize the legal characterization of web crawling activities.
1. The European Union Approach
Under European Union law, the assessment of web crawling activities from the perspective of intellectual property law forms part of the broader debate on the automated processing of online data. Within the framework of EU law, the legitimacy of data mining activities, without prejudice to the scientific research exception, is not based on a model requiring prior authorization from the right holder, but rather on an “opt-out” approach that focuses on whether the right holder has expressly restricted such activities through technical means. This approach brings to the forefront the question of how the balance between technical freedom of access and the exclusive rights of the right holder should be struck when determining the legal characterization of web crawling activities.
In assessments concerning Article 4 of Directive (EU) 2019/790 on the Digital Single Market (“Directive”), the use of machine-readable technical measures—such as robots.txt—has been recognized as a mandatory prerequisite for right holders to render their intentions legally effective. Accordingly, where no explicit technical “lock” has been imposed through such protocols, the crawling and processing of data by commercial entities is regarded, under the prevailing legal norms, not as an infringement of rights but as falling within a lawful sphere of freedom. In this respect, the European Union acquis assigns a decisive role to technical restriction mechanisms in determining the conditions under which web crawling activities are deemed lawful, thereby providing a framework that shapes the standard of conduct for both content providers and data collectors.
2. The United States Perspective
Under U.S. law, the assessment of web crawling activities from the perspective of intellectual property law is primarily shaped by the manner in which data is processed and by the concept of transformative use. Pursuant to the fair use doctrine and recent judicial developments, the scanning of works by commercial entities for the purposes of building databases or conducting analytical activities is, in many instances, not considered to constitute copyright infringement. The central criterion under this approach is whether the content obtained through crawling and data extraction is directly consumed, namely whether the resulting use substitutes for the market of the original work.
In U.S. judicial decisions, web crawling is characterized as a technical access activity and is assessed differently from traditional acts of reproduction. Accordingly, legal analysis focuses less on the stage at which crawler software automatically visits web pages and more on the purpose for which the data obtained through such visits is ultimately used. Where the output generated through data extraction acquires a new and independent function distinct from the original work, the use is more readily recognized as transformative for the purposes of fair use analysis. Conversely, where the extracted data is made available to third parties without being subjected to any analytical processing, the use may be considered substitutive in nature, thereby giving rise to claims of copyright infringement. Consequently, under U.S. law, the decisive factor in determining the lawfulness of web crawling is not the access technique itself, but the concrete impact of that technique on rights protected by copyright.
3. Japanese Approach
In the construction of the technological innovation and artificial intelligence ecosystem, Japan redefined the normative framework governing data processing activities through a structural reform of its Copyright Act in 2018. Pursuant to the “Non-Enjoyment Purpose” principle, which constitutes the doctrinal foundation of this reform, the use of a copyrighted work—by any means, including for commercial purposes—is deemed lawful without the consent of the right holder, provided that the use does not aim at the direct “enjoyment” or experiential appreciation of the ideas or emotions embodied in the work, nor at enabling such enjoyment by third parties. The theoretical underpinning of this approach rests on a fundamental assumption regarding the nature of copyright protection: technical uses that do not pursue the enjoyment or experiential consumption of the work are considered to fall outside the natural scope of copyright protection, on the grounds that they do not, in principle, undermine the right holder’s opportunities to derive remuneration or economic value from the work4.
Article 30-4 of the Japanese Copyright Act has concretized the boundaries of this broad freedom within the framework of “Data Analysis,” explicitly including processes such as the extraction, comparison, classification, and statistical analysis of large datasets within the scope of the exception. However, this regime of permissibility is not absolute and is subject to the condition that the relevant activity must not cause “unreasonable prejudice” to the legitimate interests of the right holder. Considering the preparatory materials, such prejudice is considered to arise only where the resulting output directly substitutes for the market of the original work or obstructs its potential channels of commercialization.
The reform approach adopted under the Japanese Copyright Act clearly delineates the distinction between technical use and the purpose of enjoyment of content in the legal assessment of web crawling activities. Accordingly, automated crawling is predominantly regarded as a technological process consisting of the software-based visiting of web pages and the temporary processing of data; the impact of this process on rights protected by copyright is determined by the way the data obtained through crawling is ultimately used. Given that the will of the author cannot be interpreted by machines, the restrictive role of usage conditions is considered legally contentious; by contrast, the processing of publicly accessible content for transformative purposes is regarded as a legitimate sphere of freedom that enables the development of new technologies.
4. The Turkish Copyright Law Approach under Law Code No 5846
Under Turkish law, the assessment of web crawling and similar automated data extraction activities from the perspective of intellectual property law raises the issue of how the protection regime envisaged under Law No. 5846 on Intellectual and Artistic Works (“FSEK”) is to be applied in digital environments. The fact that a significant portion of online content does not qualify as a protected work, while datasets created through commercial websites may nonetheless acquire economic value, further accentuates the question of the legal criteria under which crawling activities should be examined. In this context, legal analysis focuses first on whether the extracted data constitutes a protected work or an original database within the meaning of FSEK, and subsequently on whether the automated crawling activity gives rise to a concrete interference with the exclusive rights granted to the author.
4.1. Protection of the “Work”
In Turkish intellectual property law, the first step in assessing web crawling activities from a copyright perspective is to determine the legal nature of the content subject to automated scanning. Although digital data available online may differ in structure and methods of creation, copyright protection arises only in respect of content that qualifies as a protected work under FSEK. Accordingly, the lawfulness analysis of data scraping activities is concretized through the question of whether the extracted data bears the author’s own intellectual creation. The presence of data consisting solely of publicly available facts and realities, on the other hand, leads in many disputes to a limitation of the scope of copyright protection.
In analysing the status of data scraping activities under FSEK, the assessment must first determine whether the extracted data or the database qualifies as a “work,” and subsequently whether the activity infringes upon the exclusive rights of the author. Pursuant to Article 1/B of FSEK, a work is defined as “any intellectual or artistic product that bears the own intellectual creation and falls within the categories of literary and scientific works, musical works, works of fine art, or cinematographic works.” Under the same provision, the author is defined as the person who creates the work. The author holds exclusive ownership over the economic and moral rights vested in the work.
A substantial portion of the data subject to data scraping—such as stock market data, weather information, sports match scores, flight schedules, telephone numbers, addresses, and product prices—essentially consists of facts. Under the well-established idea–expression dichotomy5 recognized in intellectual property law doctrine, ideas and facts, as such, do not benefit from copyright protection. The legally protected interest lies not in the facts themselves, but in the way those facts are expressed.
Accordingly, in its decision dated 10 December 2024, the Court of Cassation upheld the legal reasoning adopted by the court of first instance, which held that the gallop and sprint records published on the claimant’s website (www.idmanmerkezi.com)—and recorded through the claimant’s teams—constituted ordinary information that could equally be identified by other individuals observing the horses’ training sessions or races; that such data did not bear the originality of the persons recording it; that it did not fall within the scope of database protection; and that it did not qualify as a “work” within the meaning of FSEK. The court further found that the claimants had failed to substantiate, through concrete and undisputed evidence, their assertions that the data at issue belonged to them and had been extracted and used without authorization by the defendants. It was therefore concluded that the allegations concerning the unauthorized use of the claimants’ data had not been proven and that, in the absence of any conduct attributable to the defendants that could be assessed as unfair competition, the court found no legal basis to uphold the claimants’ claims for damages6.
Considering the aforementioned ruling, when the facts of the case are examined, it becomes apparent that the data in question consist not of an intellectual creation, but rather of raw measurements containing information on the horses’ speed and distance. It has been concluded that no creative arrangement capable of conferring originality or the author’s author’s own intellectual creation on the database is present. The information obtained within the scope of the dispute has been characterised solely as data recorded through timekeeping and possessing an objective nature. Accordingly, it has been deemed legally untenable to afford protection to this compilation as an “original database bearing the author’s own intellectual creation” within the meaning of Article 6/11 of FSEK.
The assessment of web crawling activities under copyright law requires, first and foremost, the correct identification of the legally protected interest. Given that automated crawling actions, in most cases, are based on the technical processing of data of a factual nature, the possibility of infringement arises only where original expressions are reproduced or reused within the scope of copyright protection. In this respect, the crawling of content that does not qualify as a work and the protection of databases created through an original systematic arrangement are subject to distinct legal regimes.
4.2. Protection and Doctrinal Approaches
Web crawling activities are often directed not at isolated items of content, but at the automated collection of large volumes of data aggregated on commercial websites. For this reason, an intellectual property law analysis cannot be confined solely to the copyright status of the expressions contained on the crawled pages; it must also address whether the datasets formed by these pages as a whole fall within the scope of database protection. The increasing distinction in comparative law between copyright protection and database protection makes it necessary to assess web crawling activities under Turkish law within this systematic framework.
For a database to be recognised as such, certain criteria must be satisfied independently of the conditions required for copyright protection of a work. A database is defined as a collection or compilation of works, data, or other materials, arranged and organised according to a specific purpose and a particular systematic plan, and made accessible through a system by means of a technical tool7. Accordingly, in order to speak of a database within the scope of the FSEK, there must first be existing content; such content must be compiled in line with a specific objective and pursuant to a special systematic structure; and finally, the data must be individually accessible, independently of one another8.
The doctrinal approaches underpinning the legal protection of databases are generally structured around two main axes: “creativity” and “labour”9. According to the first approach, for a database to benefit from copyright protection as a work, the selection or arrangement of the data must involve a creative effort and, in terms of Turkish law, must reflect the author’s “author’s own intellectual creation”. Data compilations lacking this characteristic fall outside the scope of copyright protection. The second approach, commonly referred to as the “sweat of the brow” doctrine, shifts the focus from intellectual creativity to the labour expended. According to this view, the compilation of large-scale data constitutes a laborious process requiring substantial time, cost, and effort; therefore, even in the absence of original creativity, the resulting product should benefit from legal protection solely in order to reward the investment and labour involved10.
Pursuant to Article 6/1 and 6/11 of FSEK protection is confined to the original structure of databases and to the manner in which their contents are selected or arranged and cannot be directly extended to cover non-systematic or dispersed data contained therein. In the context of web scraping, the scope of protection afforded under Article 6/11 of the FSEK generally remains limited. This is because many commercial websites, such as telephone directories or flight listings, present data solely in standard and logical orders, including alphabetical, chronological, or numerical sequences. Whether such systematic arrangement and compilation of data satisfy the requirement of being an author’s own intellectual creation must be assessed by the courts on a case-by-case basis.
This general regime governing the protection of databases necessitates a separate assessment of web crawling activities, particularly about the “substantial part” criterion. The possibility that database contents may be systematically extracted through automated crawling may give rise, unlike copyright protection, to the application of sui generis database rights. Accordingly, the impact of web crawling activities on database rights must be examined separately within the framework of the concepts of extraction and re-utilisation.
4.3. Sui Generis Protection under FSEK and the Criteria of “Extraction and Re-utilisation”
Another provision concerning the protection of databases under the FSEK is sui generis rights, established pursuant to Additional Article 8. The first requirement for the application of this protection regime is that a substantial investment has been made in the creation, verification, or presentation of the database. Accordingly, in assessing liability with respect to databases through web scraping, it is necessary to separately examine the actions undertaken by the third parties carrying out the activity and whether these actions infringe the rights granted to the database producer.
The purpose of sui generis database protection is not to protect the creation of the data itself, but to safeguard the financial and professional investments made in the process of compiling and collecting existing data in a manner that forms a database11. The concept of a substantial investment encompasses not only financial expenditures, but also the human effort, time, technical tools, and methodical work devoted to the preparation and processing of the database. The stage of acquisition refers to the collection and systematic compilation of independent data and materials; the stage of verification entails testing the reliability of the information contained therein; and the stage of presentation involves arranging and displaying the data in a manner appropriate to a specific purpose12.
The second fundamental requirement for sui generis protection is that the database producer must be able to prevent third parties from extracting or re-utilising all or a substantial part of the database contents, either qualitatively or quantitatively. The concept of a “substantial part” should be assessed in comparison to the database as a whole and determined to reflect the scope of the producer’s investment, both in terms of quality and quantity. Accordingly, the extraction or re-utilisation of a substantial part of the database without the producer’s consent is legally preventable. Within this framework, the extraction or re-utilisation of a substantial part of the database without the producer’s consent is legally preventable, thereby securing the protected database with respect to the producer’s economic and professional investments13.
When assessing the impact of web crawling activities on database rights, a distinction should be made between “General Crawling” and “Targeted Crawling.” Through general crawling, the exact contents of a database may be copied and presented on a competing platform as a substitute product, which can be regarded as a competition-distorting act. The situation is different in targeted crawling projects. Here, the web scraper does not aim at the entire database or its systematic integrity; rather, it processes only the specific data fragments (e.g., prices) required to generate analytical reports, treating them as “raw material.”
In this context, when assessing the “substantial part” criterion, it can be argued that targeted crawling constitutes a quantitatively negligible portion of the source database. Moreover, the use of the extracted data for transformative purposes—providing new and distinct added value to the market, such as risk analysis or valuation, rather than substituting the function of the original database—qualitatively reduces the likelihood of infringement. As in Japanese law, the requirement that the activity does not cause unjust harm to the legitimate interests of the rights holder also forms the basis of the legality threshold for sui generis rights under Turkish law.
Moreover, it is particularly important to assess data scraping activities not only through copyright law but also within the framework of competition law. In Germany and certain other European countries, systematically extracting substantial database content through automated means can be completely prohibited by contracts or technical measures. Such prohibitions may draw scrutiny from competition authorities, as they could reinforce monopolistic structures in digital markets. Preventing third parties from processing large-scale or strategically valuable data is often seen as an abuse of exclusive rights and a barrier to innovation, and restrictions that strengthen the market power of data holders are assessed under competition law principles. The global trend of allowing the use of publicly accessible content not blocked by robots.txt for analytical and transformative purposes highlights the need, in Turkish legal evaluations, to balance copyright protection with the freedoms provided under competition law.
Ultimately, the legality of database rights in the context of web crawling is determined less by the crawling technique itself than by the level of extraction and the concrete impact of the subsequent re-utilisation of the obtained content. Where the extracted output does not substitute for the market of the original work and serves legitimate purposes such as artificial intelligence training or market analysis, web crawling can be regarded as a complementary element of the digital economy.
IV. WEB CRAWLING PRACTICES THROUGH THE LENS OF INTERNATIONAL CASE LAW
Web crawling, as it constitutes automated access to online content and serves as the initial stage of large-scale data processing chains, has been subject to numerous judicial assessments across different legal systems. Courts, when examining this activity, evaluate the technical nature of the crawling, whether the extracted content qualifies as a work or merely as facts, and the scope of reproductions resulting from automated access. The likelihood of infringement is frequently assessed based on whether the extraction leads to permanent and systematic transfers or the creation of a substitute product. The treatment of temporary copies as an essential part of the technological process, coupled with the requirement that the rights holder’s objections be expressed through machine-readable means, represents a common trend in global case law. These approaches provide a general evaluative framework for understanding the boundaries within which web crawling activities can be considered lawful under intellectual property law.
1. Field v. Google14
One of the most significant precedents clarifying the status of web scraping, indexing, and caching activities under U.S. intellectual property law is the 2006 case Field v. Google, Inc. This decision holds particular significance as it establishes the legal legitimacy of the operational principles of search engines and automated data collectors. Blake Field, an attorney admitted to the Nevada Bar, filed a copyright infringement lawsuit against Google, alleging that 51 literary works published on his personal website were copied and distributed by Google. Field contended that Google’s automated crawler bots, known as Googlebot, violated his exclusive reproduction and distribution rights15 by crawling his website, caching the content, and presenting these copies to users in search results. The court ruled that Google’s actions did not constitute direct copyright infringement and dismissed the case.
The court held that the plaintiff, Field, had the ability to prevent Googlebot from crawling or caching his website by simply placing a meta-tag on the site. However, by failing to exercise this technical option, the court considered that he had effectively granted Google an implied license to crawl and display the content. Furthermore, because the plaintiff remained passive and thereby facilitated Google’s use of the data, his subsequent copyright claim was barred under the doctrine of estoppel.
In its reasoning, the court found that Google’s caching activity satisfied the four-factor “fair use” test. Under the factor of “purpose and character of the use,” the court emphasized that Google’s use was transformative. While Field’s works served an “artistic/literary” purpose, Google’s caching function served a completely different and functional objective, namely facilitating access to information, archiving, and tracking changes on websites. The court further held that Google’s status as a commercial entity did not diminish the transformative and public-benefit nature of the use.
The court also noted that Google’s activities were protected under the “caching” safe harbour provisions of the Digital Millennium Copyright Act (DMCA). Google’s presentation of content through temporary storage and retrieval fell within the scope of these safe harbour rules, which shield service providers from liability.
In conclusion, the court established that, when assessing the legality of web scraping and Google’s activities as a search engine, including web crawling, indexing, and caching, the “use of available technical measures” is a decisive criterion. Accordingly, a content creator is deemed to have given implied consent for search engines and crawlers to process their content as long as they do not block bots using industry-standard technical controls, such as robots.txt or no-archive meta-tags.
2. Meltwater (Public Relations Consultants Association Ltd v. The Newspaper Licensing Agency Ltd)16
The legal status of temporary copies that inevitably arise during the processing of digital data was clarified by the Court of Justice of the European Union (CJEU) in its 5 June 2014 decision, Public Relations Consultants Association Ltd (PRCA) v Newspaper Licensing Agency Ltd (NLA). The CJEU held that “on-screen copies” and “cached copies” created on a user’s computer during internet browsing do not constitute copyright infringement under Article 5 of Directive 2001/29. This ruling is particularly significant for web crawling and web scraping technologies in relation to the “technical necessity” defence.
In the ruling, the act of copying was interpreted as part of the “technological process,” and the Court emphasized that these copies are an integral part of the process. More importantly, the “essentiality” criterion was defined in terms of efficiency. The Court acknowledged that without cached copies, the system would be unable to handle the existing volume of data, and the process would become “less efficient.” This finding legally strengthens the argument that copies created by web crawlers in RAM or temporary storage are necessary for the system to operate correctly and efficiently.
The ruling confirmed the “temporary” nature of on-screen copies that are deleted when the user leaves the site and cached copies that are overwritten once capacity is full. Accordingly, it was acknowledged that cached copies cannot exist independently of the technological process. This provides a strong legal defence in scenarios where web crawling for AI training involves retaining data only during the analysis process and deleting the raw data once the final model is generated.
The CJEU held that, since the copies form an integral part of the technological process, machine-driven copying in the background does not require copyright authorization as long as the final use accessible to human perception is lawful. This precedent supports the view that in web crawling activities, intermediary technical copies can be considered instrumental acts without independent economic value, provided that the ultimate purpose, such as data analysis or AI training, is lawful.
3. Robert Kneschke v LAION17
In the context of AI training, where data scraping has become a widespread practice and large-scale datasets are generated, the German judiciary issued a significant decision on 10 December 2025 in the German Hanseatic Higher Regional Court. In the legal dispute initiated by photographer Robert Kneschke against the non-profit association LAION, exceptions under the German Copyright Act (UrhG) were assessed at the appellate level, and Robert Kneschke’s appeal was dismissed. In the dispute between Kneschke and LAION, the Court held that downloading and analysing copyrighted images for verification purposes before their inclusion in the dataset, even as part of a pre-processing activity prior to the main model training, constitutes a TDM exception under Article 44b UrhG18. The Court further explained that testing the alignment between visual content and textual descriptions using software methods should not be regarded merely as a technical verification step, but as an analytical process aimed at extracting meaningful relationships from the data in a way that is consistent with the legislature’s intent to promote innovation.
The decision concerns the validity requirements of rights holders’ “opt-out” declarations intended to protect their exclusive rights. While the Court recognized that rights holders possess the authority to prevent data scraping, it emphasized that for such declarations to have legal effect, they must be presented in a machine-readable format19. The decision emphasized that, under German Copyright Act § 44b (3) (UrhG § 44b Abs. 3), any reservation of rights over works accessible online must be made in a machine-readable form to be legally effective. In this context, disclaimers included in general website terms of service or expressed in natural language comprehensible only to humans are not considered binding for automated crawling systems.
Moreover, in its “Three-Step Test” analysis, the Court held that the internal data processing at issue did not directly compete with the normal exploitation of the work, and that any potential market harm from the outputs generated at this stage was legally too abstract. Accordingly, the German Hanseatic Higher Regional Court established that proactive technical measures are a precondition for enforcing property claims in the digital environment, thereby recognizing a zone of legitimacy for data scraping activities where no technical barriers exist.20.
In conclusion, the decision of the German Hanseatic Higher Regional Court signals a shift in intellectual property law from “human-centred” protection mechanisms to “machine-centred” protocols. In this context, web crawling is recognized as a fundamental method for data collection in the artificial intelligence ecosystem, and the legality of the activity is assessed based on whether it serves a transformative data-mining purpose and is not explicitly blocked through protocols such as robots.txt. In cases where no technical barriers are in place, categorical prohibitions based solely on the database owner’s general terms of service are deemed insufficient to trigger copyright liability.
4. LinkedIn v hiQ Labs21
The defendant, hiQ Labs, used automated software to crawl publicly accessible user profiles on LinkedIn and provided data analytics services to employers, reporting the “employee attrition risk” of company staff. When LinkedIn sought to block this activity on the grounds of “unauthorized access,” hiQ Labs filed a lawsuit against LinkedIn, seeking the removal of the restriction and a declaration that its activities were lawful.
In evaluating the dispute, the U.S. Court of Appeals did not limit its analysis to copyright law criteria alone; it also examined, under the CFAA provisions, whether automated access constituted unfair competition or unauthorized system use. In this context, the Court focused on whether hiQ Labs’ activities created a “substitute product.” The Court found that hiQ Labs did not use LinkedIn data to establish a competing social networking platform; rather, it processed the data to provide employers with services such as “employee engagement and attrition risk analysis,” which LinkedIn at that time did not offer.
In its assessment, the Court drew a legal distinction between copying raw data to create a competing product within the same market and processing the data to generate a different substitute product. The decision noted that allowing data providers absolute control over publicly available information could restrict the flow of information; consequently, contractual mechanisms imposing unilateral limitations on web crawling cannot be interpreted broadly in a manner that would stifle competition. The Court also held that unilateral statements of intent or cease-and-desist notices issued by the data provider do not constitute a sufficient legal basis to prevent the use of publicly available data in non-competitive and complementary business models. Within this framework, data analysis activities that do not substitute for the provider’s market share and do not directly compete with the source platform were found not to constitute unfair competition. The ruling’s recognition that automated crawling technologies can provide a legitimate usage domain for innovative, data-driven services is significant for understanding the legal status of web crawling on a global scale.
V. CONCLUSION
The act of web crawling constitutes a fundamental component of the digital ecosystem, as it represents a scalable scanning activity based on automated access to publicly available online data. Comparative legal approaches reveal that, particularly under EU law, crawling activities are considered to fall within a lawful domain in the absence of machine-readable technical barriers such as robots.txt. Similarly, in U.S. case law, under the doctrine of fair use, content that is not directly consumed but instead processed for transformative purposes largely negates copyright infringement claims. This demonstrates that, in many jurisdictions, the balance between promoting innovation and protecting rights holders is increasingly established through technical criteria. In this context, it is not merely contractual prohibitions or general disclaimers addressed to human users, but machine-readable technical barriers like robots.txt that have become determinative. Consequently, the protection paradigm in intellectual property law can be said to be evolving from “human-centred” norms toward “machine-centred” protocols.
Under the FSEK framework, the scope of copyright protection remains limited, particularly with respect to raw factual data and compilations of data lacking originality or the author’s own intellectual creation. Sui generis database protection, on the other hand, becomes relevant only where substantial investment has been made, and a significant part of the database is reproduced in a substitutive manner. Within this context, transformative, targeted, and supplementary data analysis activities may be considered legitimate uses that support digital competition and innovation. Considering developments in international case law, the legal regime governing web crawling is expected to evolve toward a clearer and more predictable framework in the future. Taken together, these considerations demonstrate that the assessment of web crawling under intellectual property law is not unidimensional but requires a multi-layered analysis that accounts for technical, contractual, and economic effects simultaneously.
B. KEY TAKEAWAYS
(1)Pursuant to FSEK Art. 1/B and the case law of the Court of Cassation of Türkiye, “facts” such as stock market data, weather reports, or real estate listings do not reflect the author’s own intellectual creation and therefore are not eligible for copyright protection.
(2)Temporary records created in computer memory (RAM) to enable data analysis during web crawling should not be considered acts of “copying” under CJEU (Meltwater decision) and contemporary doctrine; they are integral elements of the technological process, lack independent economic value, and are recognized as “temporary technical necessities” that enhance system efficiency.
(3)The Hamburg/LAION decision requires that, in the digital era, rights holders must employ “machine-readable” technical protocols such as robots.txt to effectively assert their proprietary claims. Data that is publicly available and not protected by a technical “lock” may be crawled legally under some circumstances.
(4)A distinction must be drawn between “General Crawling,” in which the source database is copied wholesale and used to create a competing platform, and “Targeted Crawling,” in which the data serves solely as input for analytical purposes.
(5)Approaches in Japanese and U.S. law recognize that data processing intended for computational analysis rather than consumption constitutes a “transformative” use. Such use does not conflict with the normal exploitation of the work, does not substitute the market, and contributes value to the innovation ecosystem.
(6)Under FSEK Additional Article 8, the “substantial part” criterion should be assessed not only by the quantitative proportion of the extracted data but also through a qualitative impact analysis, examining whether the extraction undermines the commercial value of the original database or the return on investment.
(7)Recent judicial decisions tend to consider the use of datasets for AI training as lawful under TDM exceptions, viewing it as part of innovation and knowledge acquisition, particularly where no technical barriers are in place.
(8)The legal future of data scraping is evolving toward a hybrid regime shaped by technical protocols such as robots.txt, principles of transparency, and licensing models that mediate between data providers and collectors.
(9)In assessing the legality of data processing, a key determinant is whether the newly produced output constitutes a substitute product that replaces the market for the original work; analytical reports that do not narrow the market but provide complementary value should be protected under competition law principles.




