special support for selfhosted/federated services #50

Closed
opened 2024-03-01 20:49:35 +01:00 by lnl · 5 comments

Typical extraction flow (for context):

  • check the _VALID_URL for each extractor,
  • use the first extractor that matches the URL,
  • if no extractor matches the URL, use generic extractor.

The problem with this flow on federated/services services is that they have a shitload of domains, and they can't ever be all listed. fediverse.network lists over 2500 running instances. There could even (at least teoretically) be cases of selfhosted servers in local networks (as an alternative to Facebook Workplace or something).

The selfhosted services cannot just skip the domain part, as the URL scheme may (and does) overlap with other services.
As an example: Gab Social's https://gab.com/ACT1TV/posts/104450493441154721 overlaps with Facebook's https://www.facebook.com/aniaainagrodzka/posts/10222580971026685.

For this reason, we should implement a separate loader for selfhosted services. The extraction flow should work like this:

  • check the _VALID_URL for each (normal) extractor,
  • use the first extractor that matches the URL,
  • if no extractor matches the URL, use generic extractor,
  • generic extractor downloads the webpage (as it does currently) and checks its content for matches to selfhosted extractors,
  • if a match is found, pass the url and webpage content to the selfhosted extractor.

Existing selfhosted extractors (PeerTube, and any others if they exist) should be then migrated to use this.

Typical extraction flow (for context): - check the _VALID_URL for each extractor, - use the first extractor that matches the URL, - if no extractor matches the URL, use generic extractor. The problem with this flow on federated/services services is that they have a shitload of domains, and they can't ever be all listed. [fediverse.network](https://fediverse.network/) lists over 2500 running instances. There could even (at least teoretically) be cases of selfhosted servers in local networks (as an alternative to Facebook Workplace or something). The selfhosted services cannot just skip the domain part, as the URL scheme may (and does) overlap with other services. As an example: Gab Social's https://gab.com/ACT1TV/posts/104450493441154721 overlaps with Facebook's https://www.facebook.com/aniaainagrodzka/posts/10222580971026685. For this reason, we should implement a separate loader for selfhosted services. The extraction flow should work like this: - check the _VALID_URL for each (normal) extractor, - use the first extractor that matches the URL, - if no extractor matches the URL, use generic extractor, - generic extractor downloads the webpage (as it does currently) and checks its content for [matches to selfhosted extractors](https://github.com/AliasIO/wappalyzer/pull/3438/files#diff-ad6b5aefcc8c0ba3d7c9f4c9ba096652f7bae1235363a40bba4759f3abb374c4), - if a match is found, pass the url and webpage content to the selfhosted extractor. Existing selfhosted extractors (PeerTube, and any others if they exist) should be then migrated to use this.
lnl added the
enhancement
label 2024-03-01 20:49:35 +01:00
lnl closed this issue 2024-03-01 20:49:35 +01:00
Poster
Owner

closed

closed
Poster
Owner

mentioned in commit 889005bab3

mentioned in commit 889005bab33fb318e971bb265921697b5157023f
Poster
Owner

mentioned in issue #17

mentioned in issue #17
Poster
Owner

mentioned in issue #11

mentioned in issue #11
Poster
Owner

changed the description

changed the description
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: laudom/haruhi-dl#50
There is no content yet.