• S&T Moderators: Skorpio | VerbalTruist

Technology Tarpits - A Class of Malware Designed to Deter Scraping

4DQSAR

Bluelighter
Joined
Feb 3, 2025
Messages
1,367

I found the above article to be quite interesting. It's been decades since I've done any programming but I think I understand the basic concept.

But I do have a couple of technical questions and a more philosophical question.

I'm guessing that tarpits contain HTML files that continually reference each other so that a crawler bot never reaches the end of a file but instead is tapped in a cycle where one one hand huge amounts of data are scraped, but that data is of no value. Would that be a correect analysis?

In the article 'Aaron' mentions that using hie tarpit (Nepenthes) only uses the resources of quite a modest virtual machine - a Raspberry Pi being the example. So does the virtualization simply keep on generating virtual files and possibly fiddling with the attributes of said files e.g. the header of a file might state that the file is say 16K in sise but in actuality is practically endless?

The issue I sense with using Markov babble is that AI crawlers could potentially chose to reject data that is at odds with previous 'learnt' patterns.

On a much more omonous note, could someone adapt a tarpit to specifically inject untrue 'facts'? Could tarpits be used to rewrite history?

I really know nothing about this topic. But I was shocked by the potential implications. AI has reached the point where much of the technology employed to ensure the user isn't a robot are now redundant. Are there any emerging technologies that can potentially defeat AI?

I will conclude with a slightly OT point (but possibly of value). I've read that AI can learn to play many games but one criteria I have read is 'the board must be of a finite size'. With that in mind, is it therefore impossible for AI to master games such as infinite chess?
 
I really need a "mini scraper". Something that can download a list of URLs in a format similar to "Webpage, Complete". Grok has attempted to make me two using wget and Python's requests library, but he just can't get it right, and I don't understand why since it's such a simple function and he's almost got it.
 
I really need a "mini scraper". Something that can download a list of URLs in a format similar to "Webpage, Complete". Grok has attempted to make me two using wget and Python's requests library, but he just can't get it right, and I don't understand why since it's such a simple function and he's almost got it.
I wouldn't necessarily consider the string formatting an easy thing to work into a prompt for an LLM like Grok. This sounds a lot like a school assignment I've tutored before, it's almost like deja-vu or some shit. Either way, this can be done with a single for loop and a print statement compounding a couple variables and a string literal.

This simple of a program that you're asking about here is also highly vulnerable to the tarpits that the thread is about, detecting anomalies like that requires an entire system to keep track of short term memory, it's a significant pain in the ass.
 
Very interesting article (thanks for sharing). I knew about sites combating web crawlers, and that AI was basically getting trained on the internet, but hadn't thought thru the next logical step of AI bots crawling for data. Nor did I consider the counter to that, tarpits is an apt name. I'm not sure how well these tarpit solutions will hold up, poisoning the AI feed will quickly be identified and eliminated. There's still the waste of time and effort by the AI companies to gather useless data, but they have a lot more money and staff to put into improving it's data gathering. Guys like the ones in the article are Davids against Goliaths, so I doubt they'll hold up for long. But, it is always good to see the fight in the Davids, and knowing they have a vast network of people like them, all with new and innovative takes on how to counter the AI feast. Nobody is simply shrugging and giving up, which is inspiring.
 
It may well devolve into models more frequenly used to explain the attacker/defender dynamic in warfare.

However a crawler detects a tarpit, researchers will quickly find a way to defeat it. Scraping relies on just how cheap it is to collect all of the data. As soon as it becomes more costly, the entire dynamic changes.

Elsewhere I mentioned that at least as of 2022, no chatbot could produce a sensible proposition based on Tractatus Logico Philosophocus possibly because it has no way of knowing what is a formal statement, what is an aside and what is a joke. Similarly, how would scraping handle something like 'Finnigan's Wake' which uses the sound of words as a form of association (along with many other non-logical associations).

So I really can imagine a situation where someone like Crowdstrike simply places a tarpit onto every computer. Now you may reasonable assert that virus-detection software would spot it but over a decade ago ARM patented what it was pleased to term 'Chameleon code' in which every copy of the code was different but functionally does the same thing.

I suggest that the advantage lies with the defender. One need not stop all scraping. One need only make it too costly to continue.
 
'Chameleon code' in which every copy of the code was different but functionally does the same thing.
My degree is in InfoSec and I've got close to a decade and a half of working with cybersecurity. There exists a thing called "mutagenic malware" that acts under this same concept of "mutagenic non-volatile payloads", where the payload gets rewritten in code that people refer to as synonymous. I've tried to write variants of this, even just for a VM that I wrote myself to be simple and straight-forward, and it's shockingly difficult. Especially if you're trying to get it to target an exploit with only a single vector of attack, you often have to be sneaky and time things just write, and that can be a pain in the ass to find synonymous code for. It helps to bounce things back and forth through assembly, various forms of embedded .exes (like a Wine file, for example), hell I've managed to get mutagenic malware to hide itself as a renderable .png before once. If you ever want to talk cybersecurity don't hesitate to hit me up! It's a big passion of mine.
 
My degree is in InfoSec and I've got close to a decade and a half of working with cybersecurity. There exists a thing called "mutagenic malware" that acts under this same concept of "mutagenic non-volatile payloads", where the payload gets rewritten in code that people refer to as synonymous. I've tried to write variants of this, even just for a VM that I wrote myself to be simple and straight-forward, and it's shockingly difficult. Especially if you're trying to get it to target an exploit with only a single vector of attack, you often have to be sneaky and time things just write, and that can be a pain in the ass to find synonymous code for. It helps to bounce things back and forth through assembly, various forms of embedded .exes (like a Wine file, for example), hell I've managed to get mutagenic malware to hide itself as a renderable .png before once. If you ever want to talk cybersecurity don't hesitate to hit me up! It's a big passion of mine.

I suppose it largely depends on the CPU the code is for. ARM Thumb 2 is largely orthogonal so right away you can just swap around register usage. A step up from that would, I assume, for each subroutine to be compiled seperately and the whole thing linked in a different order each time. Then you realize all fhe odd cases that maybe only someone with 8-bit experience would know. In Z80 it's faster to XOR A than to LD A,#0. Well Thumb has some of those as well. Not faster, but different.

Then it's almost certainly going to be the case that if a piece of code is calling all of these subroutines, there will be many cases in which the order doesn't matter.

The first example I saw was an 8086 virus with just one valid instruction. It was an INT21 and as it turns out, when DOS loads a file, it's in GCR format and the data needed to decode it is returned in the load. So if one can apply 'Chameleon code' to a code-fragment that decodes the payload as and when needed, I suspect it would be hard for any traditional virus-checker to spot... or at least UNTIL it's executed and given. As I'm sure you know, zero-day exploits are worth money - LOTS of money. But if one can modify a process, off you go.

A more recent cure one actually managed to alter the standard libraries of a compiler so anything compiled came complete with malware.

But that is extreme. As soon as you wind up with constant user invention due to false alerts, that's enough.

Hence my concluding that the defender has the advantage. They do not seek to 'win' nor even for scrapers to 'lose' but rather to make it such a costly exercise, it no longer makes finanical sense.

I wonder if AI would be able to spot 'Chameleon Code'? Because you can bet ARM will want to ensure the SecurCore line remains secure.
 
Hence my concluding that the defender has the advantage. They do not seek to 'win' nor even for scrapers to 'lose' but rather to make it such a costly exercise, it no longer makes finanical sense.
I definitely agree with this, there's much more effort in the creation of mutagenic malware than there is in defending from it.
I wonder if AI would be able to spot 'Chameleon Code'? Because you can bet ARM will want to ensure the SecurCore line remains secure.
I actually have a script I wrote back in the day during my Perl phase that would boot up a sandbox and run something in it, watching the heap and registers and look for similarities to other flags of malware I had in a SQL database. The hardest part was just getting Perl to not forget to commit to that SQL database, as the library I was using would throw some of the most classically esoteric Perl errors I'd ever seen in my entire life.
I suppose it largely depends on the CPU the code is for. ARM Thumb 2 is largely orthogonal so right away you can just swap around register usage. A step up from that would, I assume, for each subroutine to be compiled seperately and the whole thing linked in a different order each time. Then you realize all fhe odd cases that maybe only someone with 8-bit experience would know. In Z80 it's faster to XOR A than to LD A,#0. Well Thumb has some of those as well. Not faster, but different.

Then it's almost certainly going to be the case that if a piece of code is calling all of these subroutines, there will be many cases in which the order doesn't matter.

The first example I saw was an 8086 virus with just one valid instruction. It was an INT21 and as it turns out, when DOS loads a file, it's in GCR format and the data needed to decode it is returned in the load. So if one can apply 'Chameleon code' to a code-fragment that decodes the payload as and when needed, I suspect it would be hard for any traditional virus-checker to spot... or at least UNTIL it's executed and given. As I'm sure you know, zero-day exploits are worth money - LOTS of money. But if one can modify a process, off you go.
I suspect that the majority of these would be targeting x86's and ARMs, because Windows and Android devices are the bread and butter of most users. Swapping register usage is actually a quick way to get caught by an AV, I found that storing obfuscated data in the same registers triggered AVs far less often when I was writing my old 'Mutagenerator' when I first got into this as a teenager. I'm glad another BLer on here is familiar with assembly code, I've almost used XOR in sentences before on here because I've spent too much time coding and now I'm just codebrained I suppose. Fragmenting the subroutines into various locations helped avoid AV detection, and honestly the mutagenic nature of this doesn't need to necessarily be fast (like your XORA vs LD A #0 trick), in fact if it operates slowly and waits for just the right time, say something like checking for an pid with an open handle on an inode that is known not to be of the malware, so it's 'busy' at the moment, that could be the best time to deploy the payload. Deobfuscation of the payload and keeping track of the pointers if it's been fragmented throughout memory though must be an absolute goddamn nightmare, and I'm sure that it would be possible for AVs to eventually look for unexpected fragmentations in memory, especially if it's not of any recognized file format/encoding.

The 8086 came out 21 years before I was born, what was it like working on them? This is the first time I've heard of a security investigation involving 8086's. I've always been fascinated with emulators of 8086's, PDP11's, Commodore 64's etc., they have always been passions of mine.
 
BTW there are at least two flaws in the design of the SecurCore SOCs.

One is that certain bitfields in the SR are 'privilaged' BUT since SecureCore supports Thumb, Thumb 2 and Jazelle. If you exit Jazelle (bytecode $ff followed by $00,$00,$00) it will change a privilaged bit without triggering a FAULT. That done, you can go on to alter the other bits. Whatever prevents the FAULT persists.

But building on that, SecurCore's MMU allocates RAM in 128 byte blocks - so a well-crafted piece of JavaCard can drop out of Jazelle and have 64 instructions with which to further exploit the as yet unproven third zero-day (if that's the term) design flaw. To whit, JavaCard relying on static testing during compilation allowing malformed applets which contain something more complex. It's been shown that one can corrupt the pointer-list so call unchecked bytecode instructions.

At the end of the day, SecurCore is just a Cortex M3 with bits added and due to conflicts in the design, there are a host of odd things.

One I noted was that the bit-shift instructions support getting the number of shifts from one register and using the value to bit shift the number in another register. But for no reason anyone at ARM could give, it uses the bottom EIGHT bits in spite of anything beyond 5 bits won't officially do anything. Better, you can specifiy the same register i.e. take the bottom 8 bits of R0 and shift R0 by that number. Now it's not the only example by far. You can perform bit-shifts on IP, SP and LR. The last of these strikes me as the most useful. IF you can alter the contents of the LR without privilages, I see a hole...
 
BTW there are at least two flaws in the design of the SecurCore SOCs.

One is that certain bitfields in the SR are 'privilaged' BUT since SecureCore supports Thumb, Thumb 2 and Jazelle. If you exit Jazelle (bytecode $ff followed by $00,$00,$00) it will change a privilaged bit without triggering a FAULT. That done, you can go on to alter the other bits. Whatever prevents the FAULT persists.

But building on that, SecurCore's MMU allocates RAM in 128 byte blocks - so a well-crafted piece of JavaCard can drop out of Jazelle and have 64 instructions with which to further exploit the as yet unproven third zero-day (if that's the term) design flaw. To whit, JavaCard relying on static testing during compilation allowing malformed applets which contain something more complex. It's been shown that one can corrupt the pointer-list so call unchecked bytecode instructions.

At the end of the day, SecurCore is just a Cortex M3 with bits added and due to conflicts in the design, there are a host of odd things.

One I noted was that the bit-shift instructions support getting the number of shifts from one register and using the value to bit shift the number in another register. But for no reason anyone at ARM could give, it uses the bottom EIGHT bits in spite of anything beyond 5 bits won't officially do anything. Better, you can specifiy the same register i.e. take the bottom 8 bits of R0 and shift R0 by that number. Now it's not the only example by far. You can perform bit-shifts on IP, SP and LR. The last of these strikes me as the most useful. IF you can alter the contents of the LR without privilages, I see a hole...
This is fascinating, I haven't dove into ARM exploits in a while, my focus has been much more on the development of an audio glitch making tool for music producers.

Your description of these exploits don't even seem challenging to exploit given an assembler or some C code with inline assembly. Did you come across these vulnerabilities by NVD profiled CVEs, a whitepaper (or whitepapers), or was it from a messageboard/IRC server?
 
Yeah - I wrote all the music/SFX drivers for quite a famous UK-based game developer. We just ripped the sound chips out of each console and had them mounted as boards that plugged into a PC. So one tracker allowed the musician to write tunes and SFX for everything from the C64 to Gameboy to NES, SNES, Master System, Megadrive and even the GBA and N64. All 100% assembly language. For the Megadrive version, I used the Z80 so it took anywhere upto 128 cycles to trigger a new tune AND six SFX.

I sat down with the manual and read it, then I got hold of an M0+ based SOC to practice Thumb coding (and to write a fixed-point MP2/MP3 decoder which works on a 32MHz M0+ based system). To sell but also to find every optimization.

For example, unlike 8-bit CPUs like the 6502, the M0 doesn't have a zero-page. However, you can load an 8-bit value into a register in a single instruction AKA single cycle. So I placed all of the variables I accessed a lot in the bottom 1K of RAM. I should explain that when using Rx,Ry addressing, if you access a 16-bit value, it doubles the index. If you access a 32-bit value, it multiplies the index by four. So with careful planning, you can access 256 variables, whatever their size.

But then I actually got hold of a SIM with an SC300 SecurCore. But as you know, adding applets to SIMs requires OTA updates so it's slow. But having identified flaws, I tested them. But in truth, I would really have needed to obtain the object code of the SIMs applets and OS and to run it on an emulator and fuzz all hell out of it.

I always felt that security through obstification was a REALLY bad idea. Eventually people waded through all of that paperwork and now we find that at base, JavaCard isn't secure and likewise, ARM still largely relies on obstification itself. I did ask and could get no rational answers. Mostly along the lines of admission that what I had noted MIGHT be true... but they couldn't see a practical attack. Now I have NOTHING on my phone. NOTHING. But I suppose if someone was a high enough priority target, someone with the resources CAN get in there.

You have to remember how old I am. It's not genius, it's just age. I just do not trust any digital device.
 
Yeah - I wrote all the music/SFX drivers for quite a famous UK-based game developer. We just ripped the sound chips out of each console and had them mounted as boards that plugged into a PC. So one tracker allowed the musician to write tunes and SFX for everything from the C64 to Gameboy to NES, SNES, Master System, Megadrive and even the GBA and N64. All 100% assembly language. For the Megadrive version, I used the Z80 so it took anywhere upto 128 cycles to trigger a new tune AND six SFX.

I sat down with the manual and read it, then I got hold of an M0+ based SOC to practice Thumb coding (and to write a fixed-point MP2/MP3 decoder which works on a 32MHz M0+ based system). To sell but also to find every optimization.

For example, unlike 8-bit CPUs like the 6502, the M0 doesn't have a zero-page. However, you can load an 8-bit value into a register in a single instruction AKA single cycle. So I placed all of the variables I accessed a lot in the bottom 1K of RAM. I should explain that when using Rx,Ry addressing, if you access a 16-bit value, it doubles the index. If you access a 32-bit value, it multiplies the index by four. So with careful planning, you can access 256 variables, whatever their size.

But then I actually got hold of a SIM with an SC300 SecurCore. But as you know, adding applets to SIMs requires OTA updates so it's slow. But having identified flaws, I tested them. But in truth, I would really have needed to obtain the object code of the SIMs applets and OS and to run it on an emulator and fuzz all hell out of it.

I always felt that security through obstification was a REALLY bad idea. Eventually people waded through all of that paperwork and now we find that at base, JavaCard isn't secure and likewise, ARM still largely relies on obstification itself. I did ask and could get no rational answers. Mostly along the lines of admission that what I had noted MIGHT be true... but they couldn't see a practical attack. Now I have NOTHING on my phone. NOTHING. But I suppose if someone was a high enough priority target, someone with the resources CAN get in there.

You have to remember how old I am. It's not genius, it's just age. I just do not trust any digital device.
Wow, I haven't even thought about trackers in a long time. Growing up for me, it was the age of DAWs like FL Studio and Ableton Live, I've personally released a handful of albums all made entirely in FL Studio, and used to work as a professional audio engineer using Pro Tools and FL, depending on the client. I can't even IMAGINE trying to write meaningful audio DSP in assembly code.

Obfuscation is terrible security, however I'd argue that obscurity can provide fantastic security. I've known some math PhD students who work in cryptography and understand cryptographic primitives well enough to play with writing algos that are bordering on intractable to crack, I used to try to help crack them by devising scripts that generated rainbow tables we'd throw at them to see if any keys clashed. My primary concern with ARM exploits (at least, that I'm aware of, like I said I've been in the dev world moreso than the sec world recently) are drive-by downloads, I've personally written some and tested them on myself and even on the most recent versions of a variety of phones (plus in Android Studio), they all worked perfectly. That means that one accidental pop-up could initiate a download, and if that DL made it past the AV, the job is done.

With age comes wisdom, I'm not expecting genius out of you, but I do value the conversation with somebody else who obviously holds passion and has dedicated a lot of time to refining their understanding of computing, computing is such a beautiful thing I'd arguably call it the closest thing I have to a religion.

I also trust no digital devices, faraday bags and keeping them in a pillow in the freezer (to block out being listened to) have been helpful for me in the past. I'm currently reworking a really bizarre piece of cryptography software I wrote a while back to make it significantly more difficult to crack, if you ever want to talk tech at all feel free to reach out! I've been coding for 18 years, hacking various things (mostly cracking software) for 13, and worked in the industry for 4. Most of all though, I'm just a huge nerd for all of this, hahaha.
 
I'm afraid I don't get on well with any form of OS. I just about stood still for DOS, but I'm strictly assembly language and strictly on the basis of hand-optimized code. The M0+ claimed 0.95 MIP/MHz but for the MP3 decode, the inner-loop of the FFT is 134 instructions and Thumb only supports a signed 8-bit branch so I ended up having to decrease the loop counter and branch to exit and have a B to loop.

Now the JMP is also an interesting one. It's still PC-relative and goodness knows how the MMU of a SecurCore would handle it. But that's the point, isn't it? To do things so unexpected that the hardware designers don't fully think through the security implications. Because it's a rare example of an instruction that takes two cycles and if you B (or BX or BXJ) if it's to a 32-bit aligned address, it will switch to full ARM if present or to Thumb if it's not aligned. But when does that switch occur? Also, BX also swaps the register-set soooo what if you can modify that alternate set?

FYI ARM explains how it's assembly-language instructions work in the form of snippets of C (or some manner of C) BUT I've already found lots of cases in which the CPU certainly does NOT do what the C does. The fact I see essentially useless but provably different behaviors kind of makes me wonder just how many there may be...

I spent a couple of decades writing code for commercial games - always the guy who hand-optimized the parts of the C code the profiler showed to be using a lot of cycles. Hence I DO notice how pipelining works (and doesn't). I've had cases where if executed one instruction at a time, the code behaved differently when just run due to pipelines trying to be clever. Give me VLIW and I will manage 4 or 8 instructions in parallel. NOP is a word I hate.

But what you do is WAY out of my league.
 
FYI ARM explains how it's assembly-language instructions work in the form of snippets of C (or some manner of C) BUT I've already found lots of cases in which the CPU certainly does NOT do what the C does. The fact I see essentially useless but provably different behaviors kind of makes me wonder just how many there may be...
Getting to know these incongruities between documentation and reality could lead to some very interesting exploits, this shit's got me feeling devious now hahaha.
I spent a couple of decades writing code for commercial games - always the guy who hand-optimized the parts of the C code the profiler showed to be using a lot of cycles. Hence I DO notice how pipelining works (and doesn't). I've had cases where if executed one instruction at a time, the code behaved differently when just run due to pipelines trying to be clever. Give me VLIW and I will manage 4 or 8 instructions in parallel. NOP is a word I hate.
If NOP is a word you hate, and I hate to use it once more, but have you ever had to NOP sled your way out of a branch of assembly? I've used them before when writing malware payloads to scoot the instruction cycle over to where I want it to be, as NOPs on Windows XP Service Package 2 would almost never trigger AVs. Many a shoddily written piece of malware was created by teenaged me using this technique, but I was always curious if developers who are heavy on assembly did similar things.
But what you do is WAY out of my league.
Eh, just different leagues, all talents that develop are kind of like progressing down different avenues. There's no way I'm more proficient with assembly code than you, for example, and I'm probably more experienced at something like hardening a server or pentesting a network, just different skillsets from different eras.
 
Do something useful with the cycles?

Write PC-relative code so you can just copy it anywhere in RAM and it will execute fine?

The latter is apparently how early 68000-based Apple Macs worked. You had to write code where EVERYTHING was PC relative. Think about that for a moment. The MegaCD sound-driver I wrote used PC-relative code/data so it could be included as a binary. My goodness, what a nightmate.
 
Jesus man, I get frustrated keeping track of C++ pointers when I'm writing audio DSP code, I couldn't imagine trying to write an entire sound driver, in assembly, and having to maintain the program counter in that state is unreal.

If you ever want to play around and build a CPU from scratch with logic gates, the book Nand2Tetris was mind blowing for me when I was younger and had a ton to learn about CPUs and why they had variants in ISAs. With your background of dealing with assembly, I bet you'd be able to slap the hardware together quickly and probably write some fascinating things if you wanted, and the best part is that you'd get to develop your own ISA for it!

Thanks again for the conversation here, I appreciate it a ton.
 
BTW for the M0, a NOP was actually a MOV R0,R0 and they bragged that for the M0+, it had a REAL NOP.

In fact, isn't the x86 NOP actually just a MOV AX,AX?

Jazz it up using MOV BX,BX or something...
 
Top