special support for selfhosted/federated services #50

New Issue

Closed

opened 2024-03-01 20:49:35 +01:00 by lnl · 5 comments

lnl commented

2024-03-01 20:49:35 +01:00

Owner

Typical extraction flow (for context):

check the _VALID_URL for each extractor,
use the first extractor that matches the URL,
if no extractor matches the URL, use generic extractor.

The problem with this flow on federated/services services is that they have a shitload of domains, and they can't ever be all listed. fediverse.network lists over 2500 running instances. There could even (at least teoretically) be cases of selfhosted servers in local networks (as an alternative to Facebook Workplace or something).

The selfhosted services cannot just skip the domain part, as the URL scheme may (and does) overlap with other services.
As an example: Gab Social's https://gab.com/ACT1TV/posts/104450493441154721 overlaps with Facebook's https://www.facebook.com/aniaainagrodzka/posts/10222580971026685.

For this reason, we should implement a separate loader for selfhosted services. The extraction flow should work like this:

check the _VALID_URL for each (normal) extractor,
use the first extractor that matches the URL,
if no extractor matches the URL, use generic extractor,
generic extractor downloads the webpage (as it does currently) and checks its content for matches to selfhosted extractors,
if a match is found, pass the url and webpage content to the selfhosted extractor.

Existing selfhosted extractors (PeerTube, and any others if they exist) should be then migrated to use this.

Typical extraction flow (for context): - check the _VALID_URL for each extractor, - use the first extractor that matches the URL, - if no extractor matches the URL, use generic extractor. The problem with this flow on federated/services services is that they have a shitload of domains, and they can't ever be all listed. [fediverse.network](https://fediverse.network/) lists over 2500 running instances. There could even (at least teoretically) be cases of selfhosted servers in local networks (as an alternative to Facebook Workplace or something). The selfhosted services cannot just skip the domain part, as the URL scheme may (and does) overlap with other services. As an example: Gab Social's https://gab.com/ACT1TV/posts/104450493441154721 overlaps with Facebook's https://www.facebook.com/aniaainagrodzka/posts/10222580971026685. For this reason, we should implement a separate loader for selfhosted services. The extraction flow should work like this: - check the _VALID_URL for each (normal) extractor, - use the first extractor that matches the URL, - if no extractor matches the URL, use generic extractor, - generic extractor downloads the webpage (as it does currently) and checks its content for [matches to selfhosted extractors](https://github.com/AliasIO/wappalyzer/pull/3438/files#diff-ad6b5aefcc8c0ba3d7c9f4c9ba096652f7bae1235363a40bba4759f3abb374c4), - if a match is found, pass the url and webpage content to the selfhosted extractor. Existing selfhosted extractors (PeerTube, and any others if they exist) should be then migrated to use this.