Get Information

Thank you for your interest. Please fill out the form below or call us at 800.921.2414 to request more information.
* required fields

The Ntrepid Podcast

Click on any of the podcasts below to learn more about Ntrepid products

Subscribe

Subscribe to This Ntrepid Podcast X

For your podcatcher:

https://www.ntrepidcorp.com/podcasts/rss
Transcript

Episode 1

Improving Data Harvesting

Transcript

Episode 2

Malware Protection

Transcript

Episode 3

International Data Scraping Episode

Transcript

Episode 4

Cookies and Web Scraping

Transcript

Episode 5

Browser Fingerprints and Scraping

Trends in Cybersecurity

Click on any of the podcasts below to listen to our conversations with cybersecurity industry thought leaders

Subscribe

Subscribe to This Ntrepid Podcast X

For your podcatcher:

https://www.ntrepidcorp.com/trends-podcast/rss
Transcript

Episode 1

Jon Miller, Cylance

Video Podcasts

Click on any of the video podcasts below to learn about online threats and how to protect yourself

You Can't Trust Your Browser

Three newly discovered zero-day Adobe Flash exploits are just the latest in a continuous flood of serious browser vulnerabilities.

Videos

Click on any of the videos below to learn more about Ntrepid products

Ion for Web Scraping

Empower Your Data Harvesting Tools. ION disperses data harvesting activity over a vast network of anonymous IP addresses.

Documents

Click on any of the documents below to learn more about Ntrepid products

    Passages Secure Virtual Browser

    Data Sheet

    Nfusion Secure Virtual Desktop

    Data Sheet

    Modern Malware and Corporate Information Security

    White Paper

    Ion Web Scraping Solutions

    Data Sheet

    Peeling Back the Layers of the Tor Network

    White Paper

Events

    Jun 13 – Jun 15

    AFCEA Defensive Cyber Operations Symposium

    AFCEA's Defensive Cyber Operations Symposium provides an ethical forum where government and industry will focus on "Connect and Protect." This year's summit will be held from June 13 - 15 in Baltimore, MD. Discussions will revolve around innovative technology, advancing cybersecurity, and building new relationships to ensure that the networks within DoD are adaptive, resilient, and effective against diverse threats. Our team will be at booth #774 demonstrating how Ntrepid's solutions empower government and industry online research and data collection, while eliminating threats to an online workforce.

Press Releases

    March 1, 2017

    Passages Honored with Cutting Edge Anti-Malware Solution Award

    February 14, 2017

    Passages Wins Three 2017 Info Security Products Guide Global Excellence Awards

    December 7, 2016

    ESG Report Illustrates the Case for Secure Virtual Browsers

    November 21, 2016

    Ntrepid’s Hire of CIA Cyber Security Architect Bolsters Cyber Defense Expertise

    October 19, 2016

    Ntrepid Corporation Founder Richard Helms to Speak at CyberMaryland 2016

    September 13, 2016

    Ntrepid Reveals Technical Advisory Board for Passages

    September 9, 2016

    Passages Chief Scientist Lance Cottrell to Speak at (ISC)2 Security Congress 2016

    June 7, 2016

    Ntrepid Recognized As Top Innovator by NATO Communications and Information Agency

    April 13, 2016

    Ntrepid to Host Second Georgetown SSP Cyber Symposium

    March 29, 2016

    Ntrepid Announces Key Enhancements to Passages Enterprise

    March 2, 2016

    Ntrepid's Passages Takes Home Two Awards During RSA 2016 Conference

    February 29, 2016

    Ntrepid Announces Sign Up for Secure Web Browser for OPM Breach Victims

    February 17, 2016

    Ntrepid Collaborates with Cylance for Industry Leading Protection Against Drive-by Malware

    February 3, 2016

    Ntrepid Offers Secure Web Browser to Victims of OPM Breach

    December 15, 2015

    Ntrepid Announces General Availability of Passages Enterprise

    November 19, 2015

    Ntrepid Teams with Georgetown to Present SSP-Ntrepid Cyber Symposium

    September 28, 2015

    Ntrepid’s Lance Cottrell Headlines Two Presentations at 2015 (ISC)² Security Congress

    April 20, 2015

    Ntrepid Protects Against Targeted Attacks, Isolates the Network With Passages Secure Virtual Browser

    September 29, 2014

    Ntrepid Awards Students Prize Money for Coursework

    July 24, 2014

    Ntrepid Timestream Software Ranked Among Hottest Legal Products of 2014

    June 19, 2014

    Ntrepid Makes Interactive Timeline Software Available to Universities

    February 24, 2014

    Ntrepid Announces Passages, the Secure Browsing Solution for Online Safety and Identity Protection

    February 21, 2014

    Ntrepid Corporation Issued Patent for Secure Network Privacy System

    February 3, 2014

    Ntrepid Introduces Presentations and Reports to Timestream, Export Both Directly from the Case Timeline

    October 17, 2013

    Ntrepid Introduces Timestream Capture, an iOS App that Simplifies Field Evidence Collection

    September 3, 2013

    Ntrepid Introduces Timestream Connect, a Streamlined Approach to Case Visualization and Collaboration

    April 30, 2013

    Ntrepid Timestream Interactive Timeline Software Makes It Easy to Visualize and Explain Events Over Time

    February 25, 2013

    Ntrepid Corporation Issued Patent for Online Identity Protection System

Episode 1

Improving Data Harvesting

Welcome to the Ntrepid audio briefs: Issue Number 1. My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation. In this issue, I will be talking about collecting big data against resisting targets.

Big data is the big buzzword right now, and rightly so. There’s really two kinds of big data out there: there’s what you collect in the course of business, internally generated big data, and the big data that you go out and get. And I’m really going to focus on the second here. Going from basic Internet data collection to big data Internet collection introduces some real problems.

So let’s consider a couple scenarios. In the first case, imagine you’re trying to collect a large amount of data from a web search engine to look at your SEO (Search Engine Optimization) rankings. So you’re going to want to look at lots and lots of different search terms and not just the first page of results, but many pages of results, and this is going to add up to a lot of hits on the search engine site. They’re fairly quickly going to detect this activity and you’ll hit their throttles and they’ll block your activity – they’ll prevent you from being able to do the searches. And staying below that threshold may make your activity take hours or days versus just minutes if you could go as fast as you possibly can.

Another scenario would be looking for competitive intelligence. So, imagine you need to be getting information on pricing or product information, trademark infringement, monitoring your resellers – lots of different reasons you’d want to look at your competitors or even subsidiaries on the Internet. And we see a lot of blocking here too when you’re doing too much activity and exceeding some kind of threshold. But we’re also seeing sites getting really smart.

So, imagine you’re an airline and you want to look up pricing for your competitors. So, Airline A wants to look at Airline B’s prices, and they don’t want to just look at one price, they want to look at every pair of cities for every departure time for every day between now and several months from now, because we know these prices aren’t static, they’re changing continuously. Now what happens is that if you’re detected, you actually get fed wrong information, right? The prices will be systematically incorrect, they may make all the prices higher than they appear, higher than they really are, to trick you into competing against those prices, and therefore, you won’t get to fill your seats. Or they’ll make them look lower than they really are, get you to underprice and undercut your margins. So it’s really very important to avoid detection when you’re going about these kinds of activities.

Now there’s a lot of things that can lead to these variations in information. It may not just be who you are, it may be by location, or time of day or many other kinds of characteristics. For example, Orbitz for quite a while was showing more expensive hotels to people searching from Mac computers versus Windows computers.

The general principle here is that websites aren’t things. We often talk about “the” Internet, but that’s really very misleading. Much of the web is now created on the fly, it’s all dynamic, it’s more of a process than a thing. So, when you go to the webpage, it’s created in the moment you look at it, based on who you are, where you’re coming from, what information they have in the database. And then they assemble that page to order, just for you.

So the Internet, rather than being some thing that you can look at, is more like a hologram: you need to be able to look at it from multiple perspectives to really understand what it looks like.

So when I talk about the main obstacles to big data collection, I’m usually thinking about blocking and cloaking. And blocking is what it says it is, the website simply prevents access, and I talked about that in the initial scenarios. And cloaking is when a website is set up to provide different, false information, and that was what I talked about in the airline example – you need to get access to some kind of data, and it’s important that you be able to access it, that you not be blocked, and that when you do access it, the data you’re getting is correct and real, and that’s avoiding cloaking.

In some cases, you just want to understand the targeting. So, if a website is providing different information to different people, you may simply want to understand who is it they show what information to because that may be important from a competitive positioning point of view.

The real thing that sets big data collection on the Internet apart from simple data collection on the Internet, is volume. You could be hitting a website hundreds of thousands to millions of times in a relatively short period.

And so even if you’re anonymous, even if you’ve done a thorough job of hiding who you are and where you’re coming from, it’s still going to be obvious to the website that someone is hitting them a hundred thousand times. It’s like shining a huge spotlight on their website. They’re going to see this activity, this, your IP address, will show up right at the top of their logs. So the trick here is to diffuse your activity – rather than looking like one huge visitor hitting a hundred thousand times, you need to look like a huge number of relatively low activity visitors, all of which who are sort of behaving in a normal way, at normal levels of intensity.

So what’s the give-away? The IP address is the real common denominator, it’s the thing everyone tracks, and it’s one of the hardest things to hide yourself. And the magic metric that you want to watch is the hits per target, per source IP address, per time period.

So you need a realistic number of connections coming, not just per day, but also per hour and per minute, to look plausible. You need to stay human. So, when you’re looking at hits per source IP per day, you might want to stay below, say, looking at fifty pages, while looking at it at a per minute basis, you probably need to make sure you’re staying below five pages depending on the website. And you’ll notice here that the number per minute, multiplied by the number of minutes in a day does not add up to the number per day, because no one sits at their computer clicking continuously all day on the same website, right? So, looking realistic involves all different timescales.

Now, some more paranoid sites are also looking for realistic surfing patterns. They’re looking more closely at how you visit the website, how you load the pages, do you, say, just grab the text off the pages and not the images, which is very common for basic web harvesting because it cuts down on the amount of data you need to grab a lot. But, it also really stands out – it looks very mechanical, it’s not the way a human accesses things. And also most scraping is faster than humans can access the web –if you’re clicking to a new page every second, now that doesn’t leave a lot of time for reading the information that’s out there. So, when trying to go against more sophisticated or paranoid websites, it’s very important to make sure your patterns look appropriate.

Cookies and other tracking mechanisms are another give-away. If they’re blocked entirely, many sites will just fail. But they also need to be turned over frequently or all the activity gets correlated. If you’re pretending to be a hundred people, you can’t have all hundred people using the same cookie, or you’ve undone all the work.

Many sites also check that all traffic with a given cookie comes from the same IP address. In many cases, they’ll embed an encrypted or scrambled version of the IP address in one of their cookies, so they can very quickly check to make sure that you haven’t changed addresses in mid-session. They’re mostly doing this to avoid session highjacking, but it always causes problems for scrapers.

So Ntrepid solutions enable quick integration with your existing scraping solutions to allow you to spread your activity across thousands of different source addresses.

For more sophisticated targets, we enable the creation of massively parallel independent sessions to emulate large numbers of individual realistic agents, ensuring the traffic will stand up to even detailed scrutiny.

For more information about this, and other Ntrepid products, please visit us at ntrepidcorp.com. You can also reach me directly with any questions or suggestions for future topics at lance.cottrell@ntrepidcorp.com. Thank you for listening.

Episode 2

Malware Protection

Welcome to the Ntrepid Podcast: Episode 2. My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation. In this episode, I will be talking about the threats from the new breed of hackers and malware and how virtualization can be used to protect yourself.

The nature and purpose of malware has changed a lot in the last few years, but on the whole, our counter-measures have not kept pace. Historically, malware was developed by individuals or small groups of hackers looking to make a name for themselves. It could be about reputation, revenge, curiosity or even counting coup against a huge organization. These days, the real problems come from criminal hackers and state or pseudo-state-sponsored hackers. All of these groups share a few characteristics. They are interested in specific results, not reputation. They are going to try to avoid detection if possible, rather than advertising their actions and they have the resources and skills to discover and exploit new vulnerabilities.

Once a computer is compromised, the payloads that are delivered have also become much more sophisticated. They are able to monitor activity, capture passwords, credit cards and other credentials. They can even capture tokens from multi-factor authentication to allow session hijacking.

Hacking activities come in two main flavors: mass attacks and targeted attacks. Mass attacks are designed to capture as many computers as possible. They spread indiscriminately and try to infect any computer that appears vulnerable. While they are often very sophisticated, the sheer size of the activity makes detection very likely, which in turn allows for the development of anti-malware rules and fingerprints. Although this does take time to create and disseminate.

Targeted attacks are very different. The malware is typically targeted by hand. It does not spread automatically or does so only within very tightly constrained limits. Because only a small number of computers are compromised, detection is much more difficult, and even heuristic and pattern-based detection is going to have a difficult time with the very low level of activity required to infect machines and deploy these tools.

Attackers have built tools that allow them to test their malware against all known anti-malware tools. This basically ensures that any new malware created will not be detected by any of the commercial anti-malware tools.

Spear phishing and waterhole attacks have become the preferred techniques for these targeted kinds of attacks. In both cases, the victim is lured into executing the malware by couching it in a context that feels safe, and meets the users expectation. The links or documents look real, and seem to come from a trusted source, and generally make sense in context.

With spear phishing, the attack generally comes through email, while waterhole attacks are centered around websites frequented by the target population. One particularly effective waterhole attack is to implement malware on a internal server of the target company. Placing the payload in an update to the HR time-keeping system, for example, is very likely to catch almost everyone in the company.

Between the time lag to detect a new mass malware and to create and deploy new rules, and the difficulty of discovering targeted malware at all, computers are vary vulnerable to attack. Most security experts feel that any computer or network that is not completely isolated can be compromised by a resourceful and capable attacker.

Of course, even air gaps aren’t perfect. It’s really difficult to make a system or network completely isolated. The Iranian nuclear centrifuges attacked by Stuxnet, were controlled by systems with no outside network connectivity, but targeted malware was able to get in through removable storage media.

We think that virtualization is a key technology to help protect you against these new breeds of attacks. It provides two critical capabilities: system isolation and rollback. System isolation is the separation of your high risk and high probability of compromise systems from your core network and valuable data. Conducting your high risk activities in a virtualized environment with no access to internal networks or servers helps prevent the loss of data and makes it extremely difficult for an attacker to use an initial breach of an isolated system as a beachhead from which to attack the rest of your network.

The best implementations of system isolation place the servers running the virtual machines completely outside the sensitive network environment. This is superior to virtualization on the desktop because even if the virtual container is breached, it still does not give access to sensitive data. Either one may be an effective solution depending on your resources and the threat level under which you’re operating.

When you use virtualized servers in an isolated environment, those servers are accessed using remote desktop protocols, generally over a secure VPN. The only connection then, between the desktop and the virtual environment, is this remote desktop session which may be only initiated over the VPN from the user’s end. We have never seen attacks back across such a path. If a virtual machine is compromised, that malware only has access to that single virtual machine, and cannot access any of the other servers, networks, data or storage.

This is also where the rollback capability comes in, because it may be impossible to detect a compromise of your virtual machines. You must assume that they have been compromised, even after a fairly limited amount of use. With virtualization, you can revert to a known good and clean version of your computer and file system daily, or even after each session. This gets around the problem of detecting and surgically removing malware by basically burning your virtual computer to the ground, and effectively dropping in a new one. The one twist is that you will be destroying any data you might have created or stored on that machine. If you want to keep that data it can be done, but only with great care. Any residual information kept around could be a vector for reinfecting your virtual computer. Which information you choose to persist between rollbacks of your virtual environment and how you store that information is critical.

Similarly, how you export data and documents from the virtual environment back to your work computer and network is critical. That, too, can be the path for infection, in this case of your core IT infrastructure. Such data needs to be heavily tested and quarantined. Best practices would be to never actually open any such documents directly on your internal computers, but to always view them in virtualized environments.

Ntrepid offers a line of products called Nfusion, specifically designed for this purpose. They automate the whole process of managing virtual machines and keeping them properly isolated from your network. Persisting key information and safely moving documents between the virtual Nfusion environment and your desktop.

The full version of Nfusion runs the virtual machines in an isolated and dedicated server cluster, either hosted in Ntrepid’s secure cloud infrastructure, or in your data centers outside your firewall.

Nfusion Web is a lightweight, rapidly deployable solution for web surfing only. It runs in a virtual machine on your local desktop and uses VPNs to keep all traffic segregated from your internal traffic until its well outside of your security perimeter.

Both of these are designed to be used by non-technical users. Because human error is the single most common cause of security breaches, we have built the systems to be extremely user friendly and to protect against accidental compromise through carelessness or oversight. Whatever the reason for your excursions beyond the firewall, let us help you ensure that you’re not bringing back anything dangerous or contagious.

For more information about this, and any other Ntrepid products, please visit us on the web at ntrepidcorp.com. And follow us on Facebook and on Twitter @ntrepidcorp.

You can also reach me directly with any questions or suggestions for future topics at lance.cottrell@ntrepidcorp.com.

Thanks for listening.

Episode 3

International Data Scraping Episode

Welcome to the Ntrepid Podcast, Episode #3.

My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation.

In this episode, I will be talking a global perspective on information scraping.

I have a problem with the phrase “The Internet”, because it implies that there is a “thing” out there, and that if we all look we will all see the same thing. In reality, the Internet is really more like a hologram, it looks different to every viewer and from every direction.

In the early days, web pages were simply flat files. If you requested a web page, that file was just sent to you. The same file would be sent to everyone who asked. That is not how things work any more. These days, most web pages are dynamically generated. The page literally does not exist except as a set of rules and logic for how to create the page when requested. Those rules can include information about date, time, recent events, evolving content on the server, the location of the user, and that visitor’s history of activity on the website. The server then pulls together and delivers the website the visitor sees, which might be slightly or significantly different from any other visitor.

A news site, for example, might show stories about your local area, a search engine could rank results based on your previous patterns of interest, and storefronts might adjust prices based on income levels in your area. There have even been examples of targeting based on computer brand, where more expensive hotels were shown to Mac users than to Windows users.

Consider this scenario: You’re traveling to Australia for a summer vacation. You plan to fly into Sydney and use that as your home base. Throughout the three weeks you will be Down Under, you will be making trips to Brisbane, Perth, and Melbourne.

Being the early planner that you are, you book your flights within Australia before you leave the U.S., from your U.S. based IP address. Now, flash forward to your vacation…  Once settled into your hotel room in Sydney, happily connected to the local hotel WiFi, you happen to browse flight prices from Sydney to your other Aussie locations. Not only are you getting killed by the exchange rate of the Australian dollar to American, but the Australian airline knocks off an additional 10% to its domestic travelers.

So it is not enough to get just one picture of a website. To really understand what is there, it must be observed from multiple different perspectives. One of the most important perspectives is location. Altering content based on the country or region of the visitor is really quite common.

Imagine that you are the Product Manager for a high-tech consumer product. You are constantly keeping your eyes on your competitors to make sure you are staying ahead of them in technology, market share, and price. You are in the U.S., but your main competitors are overseas.  So you conduct your research from your work computer, unaware that your corporate-branded U.S. IP address stands out like a sore thumb, every time you hit their site. In fact, they noticed your pattern of checking pricing on Mondays and Fridays, tech specs every Tuesday, and financials on the first of every month. After a while, you might notice that their site is getting quite stagnant. While they used to adjust their pricing weekly and their tech specs every month, they have not changed a thing in the last couple of months… or so you thought.

Some emails from overseas partners suggest that you are missing something. Turns out, your competition got wise to what you were doing and is now spoofing you by posting old data every time their website is visited by your company’s IP address range. If you had access to non-attributable U.S. IP addresses, or better yet, IP addresses that are regionally close to your competitor, you would be able to get the scoop on what they were doing, and they would be none the wiser.

Obviously this pattern would have been even clearer, and the change probably less noticeable, if you had been doing automated scraping, as opposed to just being a human at the keyboard. In order to detect this, your scraping activity needs to be duplicated and originate from different areas. Any given website should be tested to detect if they are doing this kind of modification by scraping random samples of data from the site and comparing them to your standard scraping results. If they are different, then you may need to repeat most or all of your activity from one or even more than one other location in addition to your primary scraping location.

Ntrepid maintains facilities in many different countries around the world specifically for this purpose. It is easy to specify the location of origin of any given scraping activity. Our large pools of IP addresses in each location allow you to disguise your activity just as you would when scraping from our domestic IP address space.

For more information about anonymous web scraping tools, best practices, and other Ntrepid products, please visit us on the web at ntrepidcorp.com, and follow us on Facebook, and on twitter @ntrepidcorp.

You can also reach me by email with any questions or suggestions for future topics at lance.cottrell@ntrepidcorp.com.

Thanks for listening.

Episode 4

Cookies and Web Scraping

Welcome to the Ntrepid Podcast, Episode #4.

My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation.

In this episode I will be talking about how cookies and other information you provide can impact your web scraping success.

When setting up a web scraping process, many people’s first instinct is to remove as much identifying information as possible in order to be more anonymous. Unfortunately, that actually can make you stand out even more, and cause you to be quickly flagged and blocked by the websites against which you are trying to collect.

Take cookies for example, the best known and easiest to remove identifiers. While they can be used to track visitors, they are often required for the website to function correctly. When a website tries to set a cookie, either in the response header or in JavaScript, that cookie should be accepted and returned to the website.

That is not to say that you should let them hang around forever, and therein lies the art. The key is to keep them around for a moderate number of queries, but only a number that a human might reasonably be expected to do in a single sitting.

Cookies need to be managed in concert with many other identifiers, and changed together between those sessions. The most important identifier after cookies are IP addresses. It is particularly important that these change together. Many websites will actually embed a coded version of the visitor’s IP address in a cookie, and then in every page, check that they still match. If you change IP mid stream while keeping the cookies, the website will flag your activity, and is likely to return an error page or bounce you back to the home page without the data you were looking for.

When switching to a new session, we suggest going back to an appropriate landing page, and working down through the website from there. Some websites will set a cookie on their landing pages. If they don’t see it when a visitor hits a deep page, it is evidence that the hit is from a scraper, and not from a real person who came to the website and navigated to that page.

When you change sessions, it is also a good time to change your browser fingerprint. Browsers and OS versions, supported languages, fonts, and plugins can collectively create an almost unique identifier of your computer. Changing these slightly between sessions reduces the likelihood of being detected and blocked.

Finally, you can get tripped up by the information that you explicitly pass to the target website. Many scraping activities require filling out search fields or other forms. We had one situation where a customer was tripped up because they used the same shipping zip code for every query. That zip became so dominant for the website that they investigated and discovered the scraping activity.

It is important to avoid detection if at all possible because it keeps the target at a lower level of alertness. Once they are aware of scraping activity, they are more likely to take countermeasures, and to look more carefully for future scraping. Staying below the radar from the start will make things much easier in the long run.

For more information about anonymous web scraping tools, best practices, and other Ntrepid products, please visit us on the web at ntrepidcorp.com, and follow us on Facebook, and on Twitter @ntrepidcorp.

You can reach me by email with any questions or suggestions for future topics through my email at lance.cottrell@ntrepidcorp.com.

Thanks for listening.

Episode 5

Browser Fingerprints and Scraping

Welcome to the Ntrepid Podcast, Episode #5

My name is Lance Cottrell, Chief Scientist for Ntrepid Corporation.

In this episode I will be talking about how browser fingerprinting can impact your web scraping activities.

In the last podcast I touched on the issue of browser fingerprinting. In this episode I want to dig a little deeper.

The three primary identifiers that a website can track are IP address, Cookies, and browser fingerprint.

By browser fingerprint, I mean all the information a website can obtain about your web browser and computer from within a web page, using Javascript and/or Flash. It turns out that there is a lot more information there than you might guess.

Of course, the website can tell if you are using Firefox, IE, Safari, Chrome, or whatever other browser. It also knows what version you are running, and what operating system and version of the operating system you are running on; Windows 8, Mac Mountain Lion, or Linux for example.

Using Javascript and Flash the website can see much more. It can get your time zone, screen size and color depth. But the real goldmine is in the fonts and plugins.

You almost certainly have a ton of both. Many programs and websites install fonts or plugins. For example, if you download audio from Amazon, you get a plugin. If you update your GPS from your computer, you get a plugin. If you configure your Jambox Bluetooth speaker, you get a plugin, and so on.

Lots of software uses non-standard fonts to make them look unique, or to allow the user more design flexibility. At the moment I have 299 fonts installed on my home computer, and I have made no particular effort to collect fonts.

Taken together, all this information creates a virtually unique pattern, your browser fingerprint. Even if you change your IP address and delete all your cookies, a website can recognize you just by recognizing your browser’s fingerprint.

But do they actually do that?

A recent study showed that over 400 of the top 10,000 websites are actively using this technique to track users who may be trying to prevent that by changing their IP addresses or deleting cookies. These are major mainstream websites, not hackers, not security agencies, and they are using browser fingerprinting to identify visitors to their websites, and this practice is growing quickly.

So, how does this impact you if you are engaged in web scraping?

I will assume that you are already addressing cookies and IP addresses in a way that emulates many different virtual visitors. This would include making sure that any multi-step process on a website would be conducted using a single IP address and keeping cookies, until the process is complete, then changing them all at once.

If you are not also addressing your browser fingerprint, however, any website could still identify you as being the same person, obviating your attempts to hide. You can reduce the size of your browser fingerprint by blocking Flash and/or Javascript. Now, many people block Flash for security reasons, so you will not stand out too much if you choose to do the same. However, blocking Javascript will really stand out because for a real person it would break most of the interesting websites on the Internet.

So, for each virtual visitor you are trying to create, you should have an individual fingerprint discoverable by the website. Those fingerprints need to be created with care, you can’t just randomly create them. For example a very new browser might not be able to run on an older operating system, certain fonts might be unique and specific to a particular OS, and certain plugins only compatible with certain browsers (and even versions of browsers).

In many cases, mobile devices may be the best thing to emulate. Most do not allow installing any additional plugins or fonts, and so there is much less variation, and therefore the fingerprint is much smaller. A tradeoff is that you may be shown the mobile version of the website, but because that is usually smaller and less graphics intensive, that might actually be an advantage for you.

Ntrepid can help you optimize your browser fingerprints, and other web scraping tools and techniques, to stay ahead in this accelerating arms race.

For more information about anonymous web scraping tools and other Ntrepid products please visit us on the web atntrepidcorp.com, and follow us on Facebook, and on twitter @ntrepidcorp.

You can reach me directly by email with any questions or suggestions for future topics through my email addresslance.cottrell@ntrepidcorp.com.

Thanks for listening.

Ntrepid - Elusiv Request

Please fill out form below to request additional ELUSIV product information.

Download White Paper

Thanks for your interest in Passages. Your white paper will automatically download after you submit this form.

* required field