• Sailor Sega Saturn@awful.systems
    link
    fedilink
    English
    arrow-up
    10
    ·
    4 days ago

    Edit: But also - why do AI scrapers request pages that show differences between versions of wiki pages (or perform other similarly complex requests)? What’s the point of that anyway?

    This is just naive web crawling: Crawl a page, extract all the links, then crawl all the links and repeat.

    Any crawler that doesn’t know what their doing and doesn’t respect robots but wants to crawl an entire domain will end up following these sorts of links naturally. It has no sense that the requests are “complex”, just that it’s fetching a URL with a few more query parameters than it started at.

    The article even alludes to how to take advantage of this with it’s “trap the bots in a maze of fake pages” suggestion. Even crawlers that know what they’re doing will sometimes struggle with infinite URL spaces.

    • HedyL@awful.systems
      link
      fedilink
      English
      arrow-up
      7
      ·
      edit-2
      4 days ago

      This is just naive web crawling: Crawl a page, extract all the links, then crawl all the links and repeat.

      It’s so ridiculous - supposedly these people have access to a super-smart AI (which is supposedly going to take all our jobs soon), but the AI can’t even tell them which pages are worth scraping multiple times per second and which are not. Instead, they appear to kill their hosts like maladapted parasites regularly. It’s probably not surprising, but still absurd.

      Edit: Of course, I strongly assume that the scrapers don’t use the AI in this context (I guess they only used it to write their code based on old Stackoverflow posts). Doesn’t make it any less ridiculous though.